Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7008 Articles
article-image-create-enterprise-grade-angular-forms-typescript-tutorial
Sugandha Lahoti
04 Jul 2018
11 min read
Save for later

Create enterprise-grade Angular forms in TypeScript [Tutorial]

Sugandha Lahoti
04 Jul 2018
11 min read
Typescript is an open-source programming language which adds optional static typing to Javascript. To give you a flavor of the benefits of TypeScript, let’s have a very quick look at some of the things that TypeScript brings to the table: A compilation step Strong or static typing Type definitions for popular JavaScript libraries Encapsulation Private and public member variable decorators In this article, we will learn how to build forms with typescript. We will cover as much as it takes to build business applications that collect user information. Here is a breakdown of what you should expect from this article: Typed form input and output Form controls Validation Form submission and handling This article is an excerpt from the book, TypeScript 2.x for Angular Developers, written by Chris Nwamba. Creating types for forms We want to try to utilize TypeScript as much as possible, as it simplifies our development process and makes our app behavior more predictable. For this reason, we will create a simple data class to serve as a type for the form values. First, create a new Angular project to follow along with the examples. Then, use the following command to create a new class: ng g class flight The class is generated in the app folder; replace its content with the following data class: export class Flight { constructor( public fullName: string, public from: string, public to: string, public type: string, public adults: number, public departure: Date, public children?: number, public infants?: number, public arrival?: Date, ) {} } This class represents all the values our form (yet to be created) will have. The properties that are succeeded by a question mark (?) are optional, which means that TypeScript will throw no errors when the respective values are not supplied. Before jumping into creating forms, let's start with a clean slate. Replace the app.component.html file with the following: <div class="container"> <h3 class="text-center">Book a Flight</h3> <div class="col-md-offset-3 col-md-6"> <!-- TODO: Form here --> </div> </div> Run the app and leave it running. You should see the following at port 4200 of localhost (remember to include Bootstrap): The form module Now that we have a contract that we want the form to follow, let's now generate the form's component: ng g component flight-form The command also adds the component as a declaration to our App module: import { BrowserModule } from '@angular/platform-browser'; import { NgModule } from '@angular/core'; import { AppComponent } from './app.component'; import { FlightFormComponent } from './flight-form/flight-form.component'; @NgModule({ declarations: [ AppComponent, // Component added after // being generated FlightFormComponent ], imports: [ BrowserModule ], providers: [], bootstrap: [AppComponent] }) export class AppModule { } What makes Angular forms special and easy to use are functionalities provided out-of-the-box, such as the NgForm directive. Such functionalities are not available in the core browser module but in the form module. Hence, we need to import them: import { BrowserModule } from '@angular/platform-browser'; import { NgModule } from '@angular/core'; // Import the form module import { FormsModule } from '@angular/forms'; import { AppComponent } from './app.component'; import { FlightFormComponent } from './flight-form/flight-form.component'; @NgModule({ declarations: [ AppComponent, FlightFormComponent ], imports: [ BrowserModule, // Add the form module // to imports array FormsModule ], providers: [], bootstrap: [AppComponent] }) export class AppModule { } Simply importing and adding FormModule to the imports array is all we needed to do. Two-way binding The perfect time to start showing some form controls using the form component in the browser is right now. Keeping the state in sync between the data layer (model) and the view can be very challenging, but with Angular it's just a matter of using one directive exposed from FormModule: <!-- ./app/flight-form/flight-form.component.html --> <form> <div class="form-group"> <label for="fullName">Full Name</label> <input type="text" class="form-control" [(ngModel)]="flightModel.fullName" name="fullName" > </div> </form> Angular relies on the name attribute internally to carry out binding. For this reason, the name attribute is required. Pay attention to [(ngModel)]="flightModel.fullName"; it's trying to bind a property on the component class to the form. This model will be of the Flight type, which is the class we created earlier: // ./app/flight-form/flight-form.component.ts import { Component, OnInit } from '@angular/core'; import { Flight } from '../flight'; @Component({ selector: 'app-flight-form', templateUrl: './flight-form.component.html', styleUrls: ['./flight-form.component.css'] }) export class FlightFormComponent implements OnInit { flightModel: Flight; constructor() { this.flightModel = new Flight('', '', '', '', 0, '', 0, 0, ''); } ngOnInit() {} } The flightModel property is added to the component as a Flight type and initialized with some default values. Include the component in the app HTML, so it can be displayed in the browser: <div class="container"> <h3 class="text-center">Book a Flight</h3> <div class="col-md-offset-3 col-md-6"> <app-flight-form></app-flight-form> </div> </div> This is what you should have in the browser: To see two-way binding in action, use interpolation to display the value of flightModel.fullName. Then, enter a value and see the live update: <form> <div class="form-group"> <label for="fullName">Full Name</label> <input type="text" class="form-control" [(ngModel)]="flightModel.fullName" name="fullName" > {{flightModel.fullName}} </div> </form> Here is what it looks like: More form fields Let's get hands-on and add the remaining form fields. After all, we can't book a flight by just supplying our names. The from and to fields are going to be select boxes with a list of cities we can fly into and out of. This list of cities will be stored right in our component class, and then we can iterate over it in the template and render it as a select box: export class FlightFormComponent implements OnInit { flightModel: Flight; // Array of cities cities:Array<string> = [ 'Lagos', 'Mumbai', 'New York', 'London', 'Nairobi' ]; constructor() { this.flightModel = new Flight('', '', '', '', 0, '', 0, 0, ''); } } The array stores a few cities from around the world as strings. Let's now use the ngFor directive to iterate over the cities and display them on the form using a select box: <div class="row"> <div class="col-md-6"> <label for="from">From</label> <select type="text" id="from" class="form-control" [(ngModel)]="flightModel.from" name="from"> <option *ngFor="let city of cities" value="{{city}}">{{city}}</option> </select> </div> <div class="col-md-6"> <label for="to">To</label> <select type="text" id="to" class="form-control" [(ngModel)]="flightModel.to" name="to"> <option *ngFor="let city of cities" value="{{city}}">{{city}}</option> </select> </div> </div> Neat and clean! You can open the browser and see it right there: The select drop-down, when clicked, shows a list of cities, as expected: Next, let's add the trip type field (radio buttons), the departure date field (date control), and the arrival date field (date control): <div class="row" style="margin-top: 15px"> <div class="col-md-5"> <label for="" style="display: block">Trip Type</label> <label class="radio-inline"> <input type="radio" name="type" [(ngModel)]="flightModel.type" value="One Way"> One way </label> <label class="radio-inline"> <input type="radio" name="type" [(ngModel)]="flightModel.type" value="Return"> Return </label> </div> <div class="col-md-4"> <label for="departure">Departure</label> <input type="date" id="departure" class="form-control" [(ngModel)]="flightModel.departure" name="departure"> </div> <div class="col-md-3"> <label for="arrival">Arrival</label> <input type="date" id="arrival" class="form-control" [(ngModel)]="flightModel.arrival" name="arrival"> </div> </div> How the data is bound to the controls is very similar to the text and select fields that we created previously. The major difference is the types of control (radio buttons and dates): Lastly, add the number of passengers (adults, children, and infants): <div class="row" style="margin-top: 15px"> <div class="col-md-4"> <label for="adults">Adults</label> <input type="number" id="adults" class="form-control" [(ngModel)]="flightModel.adults" name="adults"> </div> <div class="col-md-4"> <label for="children">Children</label> <input type="number" id="children" class="form-control" [(ngModel)]="flightModel.children" name="children"> </div> <div class="col-md-4"> <label for="infants">Infants</label> <input type="number" id="infants" class="form-control" [(ngModel)]="flightModel.infants" name="infants"> </div> </div> The passengers section are all of the number type because we are just expected to pick the number of passengers coming onboard from each category: Validating the form and form fields Angular greatly simplifies form validation by using its built-in directives and state properties. You can use the state property to check whether a form field has been touched. If it's touched but violates a validation rule, you can use the ngIf directive to display associated errors. Let's see an example of validating the full name field: <div class="form-group"> <label for="fullName">Full Name</label> <input type="text" id="fullName" class="form-control" [(ngModel)]="flightModel.fullName" name="fullName" #name="ngModel" required minlength="6"> </div> We just added three extra significant attributes to our form's full name field: #name, required, and minlength. The #name attribute is completely different from the name attribute in that the former is a template variable that holds information about this given field via the ngModel value while the latter is the usual form input name attribute. In Angular, validation rules are passed as attributes, which is why required and minlength are there. Yes, the fields are validated, but there are no feedbacks to the user on what must have gone wrong. Let's add some error messages to be shown when form fields are violated: <div *ngIf="name.invalid && (name.dirty || name.touched)" class="text-danger"> <div *ngIf="name.errors.required"> Name is required. </div> <div *ngIf="name.errors.minlength"> Name must be at least 6 characters long. </div> </div> The ngIf directive shows these div elements conditionally: If the form field has been touched but there's no value in it, the Name is required error is shown Name must be at least 6 characters long is also shown when the field is touched but the content length is less than 6. The following two screenshots show these error outputs in the browser: A different error is shown when a value is entered but the value text count is not up to 6: Submitting forms We need to consider a few factors before submitting a form: Is the form valid? Is there a handler for the form prior to submission? To make sure that the form is valid, we can disable the Submit button: <form #flightForm="ngForm"> <div class="form-group" style="margin-top: 15px"> <button class="btn btn-primary btn-block" [disabled]="!flightForm.form.valid"> Submit </button> </div> </form> First, we add a template variable called flightForm to the form and then use the variable to check whether the form is valid. If the form is invalid, we disable the button from being clicked: To handle the submission, add an ngSubmit event to the form. This event will be called when the button is clicked: <form #flightForm="ngForm" (ngSubmit)="handleSubmit()"> ... </form> You can now add a class method, handleSubmit, to handle the form submission. A simple log to the console may be just enough for this example: export class FlightFormComponent implements OnInit { flightModel: Flight; cities:Array<string> = [ ... ]; constructor() { this.flightModel = new Flight('', '', '', '', 0, '', 0, 0, ''); } // Handle for submission handleSubmit() { console.log(this.flightModel); } } We discussed about collecting user inputs via forms. We covered important features of forms, such as typed inputs, validation, two-way binding, submission, and so on. All these interesting methods will prepare you for getting started with building business applications. If you liked our article, you may read our book TypeScript 2.x for Angular Developers, to learn to use typed DOM events and event handling among other interesting things to do with Typescript. Typescript 2.9 release candidate is here How to install and configure TypeScript How to work with classes in Typescript
Read more
  • 0
  • 0
  • 30546

article-image-why-do-react-developers-love-redux-for-state-management
Sugandha Lahoti
03 Jul 2018
3 min read
Save for later

Why do React developers love Redux for state management?

Sugandha Lahoti
03 Jul 2018
3 min read
Redux is an implementation of FLUX, which is a pattern for managing application state in React. Redux brings a clean and testable design to the table using a purely functional approach. Redux completes the missing piece of the React framework and is used at the core of React for most complex React projects. This video tutorial talks about why Redux is needed and touches upon the Redux Flow. Why Redux? If you have written a large-scale application before, you will know that managing application state can become a pain as the app grows. Application state includes server responses, cached data, and data that has not been persisted to the server yet. Furthermore, the User Interface (UI) state constantly increases in complexity. Let’s take the example of an e-commerce website. Any website contains a lot of components, for instance, the product view, the menu section, the filter panel. Whenever we have such a complex app, whether it be a mobile or a web app, it becomes difficult to communicate between components and to know each other’s updated state. For instance, when you interact with the price filter slider, the product view changes. This can obviously work if we have a parent component calling the child component and share properties. However, this works only for simple apps. For complex apps, it becomes difficult to manage the state and update history between multiple components. Redux comes to the rescue here. In order to understand the functioning of Redux, we will go through a flow chart. Redux Flow Action Whenever a state change occurs in the components, it triggers an action creator. An action creator is a function called action. Actions are plain javascript objects of information that send data from your application to your store. They are the only source of information for the store. Reducers After action, returns this object, it is handled by Reducers. Reducers specify how the application’s state changes in response to actions sent to the store, depending on the action type. Store The store is the object that brings them together. It holds the application state, allows access to state, and allows state to be updated. Provider The provider distributes the data retrieved from a store to all the other components by encapsulating a main base component. This all seems highly theoretical, and may seem a bit difficult to gulp down first. But once you practically apply it, you will get used to complex terminologies and how Redux flows. Don’t forget to watch the video tutorial from Learning React Native Development by Mifta Sintaha to know more about Redux. For a comprehensive guide to building React Native mobile apps, buy the full video course from the Packt store. Introduction to Redux Creating Reusable Generic Modals in React and Redux Minko Gechev: “Developers should learn all major front-end frameworks to go to the next level”
Read more
  • 0
  • 0
  • 42808

article-image-openstack-networking-and-security-with-ansible-2
Vijin Boricha
03 Jul 2018
9 min read
Save for later

Automating OpenStack Networking and Security with Ansible 2 [Tutorial]

Vijin Boricha
03 Jul 2018
9 min read
OpenStack is a software that can help us build a system similar to popular cloud providers, such as AWS or GCP. OpenStack provides an API and a dashboard to manage the resources that it controls. Basic operations, such as creating and managing virtual machines, block storage, object storage, identity management, and so on, are supported out of the box. This is an excerpt from Ansible 2 Cloud Automation Cookbook written by Aditya Patawari, Vikas Aggarwal. No matter the Cloud Platform you are using this book will help you orchestrate your cloud infrastructure.  In the case of OpenStack, we control the underlying hardware and network, which comes with its own pros and cons. In this article we will leverage Ansible 2 to automate not so common networking tasks in OpenStack. We can use custom network solutions. We can use economical equipment or high-end devices, depending upon the actual need. This can help us get the features that we want and may end up saving money. Caution: Although OpenStack can be hosted on premises, several cloud providers provide OpenStack as a service. Sometimes these cloud providers may choose to turn off certain features or provide add-on features. Sometimes, even while configuring OpenStack on a self-hosted environment, we may choose to toggle certain features or configure a few things differently. Therefore, inconsistencies may occur. All the code examples in this article are tested on a self-hosted OpenStack released in August 2017, named Pike. The underlying operating system was CentOS 7.4. Managing security groups Security groups are the firewalls that can be used to allow or disallow the flow of traffic. They can be applied to virtual machines. Security groups and virtual machines have a many-to-many relationship. A single security group can be applied to multiple virtual machines and a single virtual machine can have multiple security groups. How to do it… Let's create a security group as follows: - name: create a security group for web servers os_security_group: name: web-sg state: present description: security group for web servers The name parameter has to be unique. The description parameter is optional, but we recommend using it to state the purpose of the security group. The preceding task will create a security group for us, but there are no rules attached to it. A firewall without any rules is of little use. So let's go ahead and add a rule to allow access to port 80 as follows: - name: allow port 80 for http os_security_group_rule: security_group: web-sg protocol: tcp port_range_min: 80 port_range_max: 80 remote_ip_prefix: 0.0.0.0/0 We also need SSH access to this server, so we should allow port 22 as well: - name: allow port 80 for SSH os_security_group_rule: security_group: web-sg protocol: tcp port_range_min: 22 port_range_max: 22 remote_ip_prefix: 0.0.0.0/0 How it works… For this module, we need to specify the name of the security group. The rule that we are creating will be associated with this group. We have to supply the protocol and the port range information. If we just want to whitelist only one port, then that would be the upper and lower bound of the range. Lastly, we have to specify the allowed addresses in the form of CIDR. The address 0.0.0.0/0 signifies that port 80 is open for everyone. This task will add an ingress type rule and allow traffic on port 80 to reach the instance. Similarly, we have to add a rule to allow traffic on port 22 as well. Managing network resources A network is a basic building block of the infrastructure. Most of the cloud providers will supply a sample or default network. While setting up a self-hosted OpenStack instance, a single network is typically created automatically. However, if the network is not created, or if we want to create another network for the purpose of isolation or compliance, we can do so using the os_network module. How to do it… Let's go ahead and create an isolated network and name it private, as follows: - name: creating a private network os_network: state: present name: private In the preceding example, we created a logical network with no subnets. A network with no subnets is of little use, so the next step would be to create a subnet: - name: creating a private subnet os_subnet: state: present network_name: private name: app cidr: 192.168.0.0/24 dns_nameservers: - 8.8.4.4 - 8.8.8.8 host_routes: - destination: 0.0.0.0/0 nexthop: 104.131.86.234 - destination: 192.168.0.0/24 nexthop: 192.168.0.1 How it works… The preceding task will create a subnet named app in the network called private. We have also supplied a CIDR for the subnet, 192.168.0.0/24. We are using Google DNS for nameservers as an example here, but this information should be obtained from the IT department of the organization. Similarly, we have set up the example host routes, but this information should be obtained from the IT department as well. After successful execution of this recipe, our network is ready to use. User management OpenStack provides an elaborate user management mechanism. If we are coming from a typical third-party cloud provider, such as AWS or GCP, then it can look overwhelming. The following list explains the building blocks of user management: Domain: This is a collection of projects and users that define an administrative entity. Typically, they can represent a company or a customer account. For a self-hosted setup, this could be done on the basis of departments or environments. A user with administrative privileges on the domain can further create projects, groups, and users. Group: A group is a collection of users owned by a domain. We can add and remove privileges from a group and our changes will be applied to all the users within the group. Project: A project creates a virtual isolation for resources and objects. This can be done to separate departments and environments as well. Role: This is a collection of privileges that can be applied to groups or users. User: A user can be a person or a virtual entity, such as a program, that accesses OpenStack services. For a complete documentation of the user management components, go through the OpenStack Identity document at https://docs.openstack.org/keystone/pike/admin/identity-concepts.html.  How to do it… Let's go ahead and start creating some of these basic building blocks of user management. We should note that, most likely, a default version of these building blocks will already be present in most of the setups: We'll start with a domain called demodomain, as follows: - name: creating a demo domain os_keystone_domain: name: demodomain description: Demo Domain state: present register: demo_domain After we get the domain, let's create a role, as follows: - name: creating a demo role os_keystone_role: state: present name: demorole Projects can be created, as follows: - name: creating a demo project os_project: state: present name: demoproject description: Demo Project domain_id: "{{ demo_domain.id }}" enabled: True Once we have a role and a project, we can create a group, as follows: - name: creating a demo group os_group: state: present name: demogroup description: "Demo Group" domain_id: "{{ demo_domain.id }}" Let's create our first user: - name: creating a demo user os_user: name: demouser password: secret-pass update_password: on_create email: demo@example.com domain: "{{ demo_domain.id }}" state: present Now we have a user and a group. Let's add the user to the group that we created before: - name: adding user to the group os_user_group: user: demouser group: demogroup We can also associate a user or a group with a role: - name: adding role to the group os_user_role: group: demo2 role: demorole domain: "{{ demo_domain.id }}" How it works… In step 1, the os_keystone_domain would take a name as a mandatory parameter. We also supplied a description for our convenience. We are going to use the details of this domain, so we saved it in a variable called demo_domain. In step 2, the os_keystone_role would just take a name and create a role. Note that a role is not tied up with a domain. In step 3, the os_project module would require a name. We have added the description for our convenience. The projects are tied to a domain, so we have used the demo_domain variable that we registered in a previous task. In step 4, the groups are tied to domains as well. So, along with the name, we would specify the description and domain ID like we did before. At this point, the group is empty, and there are no users associated with this group. In step 5, we supply name along with a password for the user. The update_password parameter is set to on_create, which means that the password won't be modified for an existing user. This is great for the sake of idempotency. We also specify the email ID. This would be required for recovering the password and several other use cases. Lastly, we add the domain ID to create the user in the right domain. In step 6, the os_user_group module will help us associate the demouser with the demogroup. In step 7, the os_user_role will take a parameter for user or group and associate it with a role. A lot of these divisions might not be required for every organization. We recommend going through the documentation and understanding the use case of each of them. Another point to note is that we might not even see the user management bits on a day-to-day basis. Depending on the setup and our responsibilities, we might only interact with modules that involve managing resources, such as virtual machines and storage. We learned how to successfully to solve complex OpenStack networking tasks with Ansible 2. To learn more about managing other public cloud platforms like AWS and Azure refer to our book  Ansible 2 Cloud Automation Cookbook. Getting Started with Ansible 2 System Architecture and Design of Ansible An In-depth Look at Ansible Plugins
Read more
  • 0
  • 0
  • 18977

article-image-interactive-dashboard-with-vrealize-operations-manager
Vijin Boricha
03 Jul 2018
14 min read
Save for later

Interactive dashboard with vRealize Operations Manager [Tutorial]

Vijin Boricha
03 Jul 2018
14 min read
Creating a dashboard is a relatively simple exercise, creating a good dashboard will require tuning and some tweaking. The tricky part is displaying the information needed on a single screen, this is one of the biggest challenges when creating a dashboard; placing all the relevant information on a single pane of glass. The number one goal when creating a dashboard is to get all the information across in a glance. This is an excerpt from Mastering vRealize Operations Manager - Second Edition written by Spas Kaloferov, Scott Norris, Christopher Slater. Out of the 46 widgets vRealize Operations 6.6 has available, we will only use a handful of them regularly. The most commonly used widgets, from experience, are the scoreboard, metric selector, heat map, object list and metric chart. The rest are generally only used for specific use cases. There are basically two types of dashboards that we can create, an interactive dashboard or a static dashboard. An interactive dashboard is typically used for troubleshooting or similar activities where you are expecting the user to interact with widgets to get the information they are after. A static or display dashboard typically uses self-providing widgets such as scoreboards and heatmaps that are designed for display monitors, or other situations where an administrator is keeping an eye on environment changes. Each of the widgets has the ability to be a self-provider that means we set the information we want to display directly in the widget. The other option is to set up interactions and have other widgets provide information based on an object or metric selection in another widget. In this article, we will focus on the interactive dashboard. We will be looking at creating a dashboard that looks at vSphere cluster information, which at a glance will show us the overall health and general cluster information an administrator would need. Working through this will give you the knowledge needed to create any type of dashboard. The dashboard we are about to create will show how to configure the more common widgets in a way that can be replicated on a greater scale. When creating a dashboard, you will generally go through the following steps: Start the New Dashboard wizard from the Actions menu. Configure the general dashboard settings. Add and configure individual widgets. (Optional) Configure widget interactions. (Optional) Configure dashboard navigation. Creating an interactive dashboard You can create a dashboard by using the New Dashboard wizard. Alternatively, you can clone an existing dashboard and modify the clone. Perform the following steps to create a new dashboard: Navigate to the Dashboards page, click Actions, and then click Create Dashboard, as shown in the following screenshot: Under Dashboard Configuration, we need to give it a meaningful name and provide a description. If you click Yes for the Is default setting, the dashboard appears on the homepage when you log in. By default, the Recommendations dashboard is the dashboard that appears on the home page when a user logs in. You can change the default dashboard. Next, we click on Widget List to bring up all the available widgets. Here we will click and drag the widgets we need from the left pane to the right. We will use the following: Object List Metric Picker Metric Chart Generic Scoreboard Heat Map You can arrange widgets in the dashboard by dragging them to the desired column position. The left pane and the right pane are collapsed so that you have more room for your dashboard workspace, which is the center pane. To edit the widgets, we click on the little pen icon sitting at the top of the widget. The Object List The Object List widget configuration options, as shown in the following screenshot, include some of the more common options, such as Title, Refresh Content, and Refresh Interval. Options also exist that are specific to this widget: Mode: You can select Self, Children, or Parent mode. This setting is used by widget interactions. Auto Select First Row: This enables you to select whether or not to start with the first row of data. Select which tags to filter: This enables you to select objects from an object tree to observe. For example, you can choose to observe information about objects managed by the vCenter Server instance named VCVA01. You can add different metrics using the Additional Column option during widget configuration. Using the Additional Column pane, you can add metrics that are specific for each object in the data grid columns. Perform the following steps to edit the Object List widget in our example dashboard: Click this on Object List and the Edit Object List window will appear. In here edit the Title, select On for Auto Select First Row, and select Cluster Compute Resource under Object Types Tag. We should have something similar to the following screenshot. Click Save to continue. With tag selection, multiple tags can be selected, if this is done then only objects that fall under both tag types will be shown in the widget. The next thing we want to do is click on Widget Interactions on the left pane; this is where we go to link the widgets, for example, we select a virtual machine from an object list widget and it would change any linked widgets to display the information of that object. We will see a Selected Object(s) with a drop-down list followed by a green arrow pointing to our widgets. This is saying that what we select in the drop-down list will be linked to the associated widget. Here our new Cluster List will feed Metric Picker, Scoreboard, and Heatmap, while Metric Picker will feed Metric Chart. Also we will notice that a widget like Metric Chart can be fed by more than one widget. Click APPLY INTERACTIONS and we should end up with something similar to the following screenshot: The Metric Picker Now, if we select a metric under the Metric Picker widget it should show the metric in the Metric Chart widget, as displayed in the following screenshot: Metric Picker will contain all the available metrics for the selected object, such as an ESXi host or a Virtual Machine. The Heatmap Next up, we will edit the Heatmap widget. For this example, we will use the Heatmap widget to display capacity remaining for the datastores attached to the vSphere cluster. This is the best way to see at a glance that none of the datastores are over 90% used or getting close. We need to make the following changes: Give the widget a new name describing what we are trying to achieve. Change Group By to Cluster Compute Resource - This is what we want the parent container to be. Change mode to Instance - This mode type is best used when interacting with other widgets to get its objects. Change Object Type to Datastore - This is the object that we want displayed. Change Attribute Kind to Disk Space | Capacity Remaining (%) - The metric of the object that we want to use. Change Colors around to 0 (Min), 20 (Max) - Because we really only want to know if a datastore is getting close to the threshold, minimizing the range will give us more granular colors. Change the colors around making it red on the left and green on the right. This is done by clicking on the little color square at each end and picking a new color. The reason this is done is because we have capacity remaining, so we need 0% remaining as red. Click Save and we should now have something similar to the following screenshot, with each square representing a datastore: Move the mouse over each box to reveal more detail of the object. The Scoreboard Time to modify the last widget. This one will be a little more complicated due to how we display what we want while being interactive. When we configured the widget interactions, we noticed that the scoreboard widget was populated automatically with a bunch of metrics, as shown in the following screenshot: Now, let's go back to our dashboard creation and edit the Scoreboard widget. We will notice quite a lot of configuration options compared to others, most of which are how the boxes are laid out, such as number of columns, box size, and rounding out decimals. What we want to do for this widget is: Name the scoreboard something meaningful Round the decimals to 1 - this cuts down the amount of decimal places returned on the displayed value Under Metric Configuration choose the Host-Util file from the drop-down list We should now see something similar to the following screenshot: But what about the object selection you may have noticed in the lower half of the scoreboards widget? These are only used if we make the widget a self-provider, which we can see as an option to the top left of the edit window. We can choose objects and metrics, but they are ignored when Self Provider is set to Off. If we now click Save we should see the new configuration of the scoreboard widget, as shown in the following screenshot:  I’ve also changed the Visual Theme to Original in the scoreboard widget configuration options to change the way the scoreboard visualizes the information. The scoreboard widget may not always display the information we necessarily need. To get the widget to display the information we want while continuing to be interactive to our selections in the Cluster List widgets, we have to create a metric configuration (XML) file. Metric Configuration Files (XML) A lot of the widgets are edited through the GUI with the objects and metrics we want displayed, but some require a metric configuration file to define what metrics the widget should display. Metric configuration files can create a custom set of metrics for the customization of supported widgets with meaningful data. Metric configuration files store the metric attribute keys in XML format. These widgets support customization using metric configuration files: Scoreboard Metric Chart Property List Rolling View Chart Sparkline Chart Topology Graph To keep this simple, we will configure four metrics to be displayed, which are: CPU usage for the cluster in % CPU demand for the cluster Memory ballooning CPU usage for the cluster in MHz Perform the following steps to create a metric configuration file: Open a text editor, add the following code, and save it as an XML file; in this case we will call it clusterexample.xml: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <AdapterKinds> <AdapterKind adapterKindKey="VMWARE"> <ResourceKind resourceKindKey="ClusterComputeResource"> <Metric attrkey="cpu|capacity_usagepct_average" label="CPU" unit="%" yellow="50" orange="75" red="90" /> <Metric attrkey="cpu|demandPct" label="CPU Demand" unit="%" yellow="50" orange="75" red="90" /> <Metric attrkey="cpu|usagemhz_average" label="CPU Usage" unit="GHz" yellow="8" orange="16" red="20" /> <Metric attrkey="mem|vmmemctl_average" label="Balloon Mem" unit="GB" yellow="100" orange="150" red="200" /> </ResourceKind> </AdapterKind> </AdapterKinds> Using WinSCP or another similar product, upload this file to the following location on the vRealize Operations 6.0 virtual appliance: /usr/lib/vmware-vcops/tomcat-web-app/webapps/vcops-web-ent/WEB-INF/classes/resources/reskndmetrics In this location, you will notice some built in sample XML files. Alternatively, you can create the XML file from the vRealize Operations user interface. To do so, navigate to Administration | Configuration, and then Metric Configuration. Now let's go back to our dashboard creation and edit the Scoreboard widget. Under Metric Configuration choose the clusterexmaple.xml file that we just created from the drop-down list. Click Save to save the configuration. We have now completed the new dashboard; click Save on the bottom right to save the dashboard. We can go back and edit this dashboard whenever we need to. This new dashboard will now be available on the home page, this is shown in the following screenshot: For the Scoreboard widget we have used an XML file so the widget will display the metrics we would like to see when an object is selected in another widget. How can we get the correct metric and adapter names to be used in this file? Glad you asked. The simplest way to get the correct information we need for that XML file is to create a non-interactive dashboard with the widget we require with all the information we want to display for our interactive one. For example, let's quickly create a temp dashboard with only one scoreboard widget and populate it with what we want by manually selecting the objects and metrics with self-provider set to yes: Create another dashboard and drag and drop a single scoreboard widget. Edit the scoreboard widget and configure it with all the information we would like. Search for an object in the middle pane and select the widgets we want in the right pane. Configure the box label and Measurement Unit. A thing to note here is that we have selected memory balloon metric as shown in the following screenshot, but we have given it a label of GB. This is because of a new feature in 6.0 it will automatically upscale the metrics when shown on a scoreboard, this also goes for datastore GB to TB, CPU MHz to GHz, and network throughput from KBps to MBps. Typically in 5.x we would create super metrics to make this happen. The downside to this is that the badge color still has to be set in the metrics base format. Save this dashboard once we have the metrics we want. Locate it under our dashboard list and select it, click on the little cog, and select Export Dashboards as shown in the following screenshot. This will automatically download a file called Dashboard<Date>.json. Open this file in a text editor and have a look through it and we will see all the information we require to write our XML interaction file. First off is our resourceKindKey and adapterKindKey, as shown in the following screenshot. These are pretty self-explanatory, resourceKind being Cluster resource, and adapter is the adapter that's collecting the metrics, in this case the inbuilt vCenter one called VMWARE. Next are our resources, as we can see from the following screenshot we have metricKey, which is the most important one as well as the color settings, unit, and the label: There it is, how we can get the information we require for XML files: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <AdapterKinds> <AdapterKind adapterKindKey="VMWARE"> <ResourceKind resourceKindKey="ClusterComputeResource"> <Metric attrkey="cpu|capacity_usagepct_average" label="CPU" unit="%" yellow="50" orange="75" red="90" /> <Metric attrkey="cpu|demandPct" label="CPU Demand" unit="%" yellow="50" orange="75" red="90" /> <Metric attrkey="cpu|usagemhz_average" label="CPU Usage" unit="GHz" yellow="8" orange="16" red="20" /> <Metric attrkey="mem|vmmemctl_average" label="Balloon Mem" unit="GB" yellow="100" orange="150" red="200" /> </ResourceKind> </AdapterKind> </AdapterKinds> Any widget with the setting Metric Configuration available can use the XML files you create. The XML format is as per the preceding code. An XML file can also have multiple Adapter kinds as there could be different adapter metrics that you require. Today, we learned to create a dashboard that is interactive based on selections made within widgets. You also unraveled the mystery of the metric configuration XML file and how to get the information you require into it. To know more about Super metrics and when to use it, check out this book Mastering vRealize Operations Manager - Second Edition. What to expect from vSphere 6.7 How to ace managing the Endpoint Operations Management Agent with vROps Troubleshooting techniques in vRealize Operations components
Read more
  • 0
  • 0
  • 24087

article-image-configuring-and-deploying-hbase-tutorial
Natasha Mathur
02 Jul 2018
21 min read
Save for later

Configuring and deploying HBase [Tutorial]

Natasha Mathur
02 Jul 2018
21 min read
HBase is inspired by the Google big table architecture, and is fundamentally a non-relational, open source, and column-oriented distributed NoSQL. Written in Java, it is designed and developed by many engineers under the framework of Apache Software Foundation. Architecturally it sits on Apache Hadoop and runs by using Hadoop Distributed File System (HDFS) as its foundation. It is a column-oriented database, empowered by a fault-tolerant distributed file structure known as HDFS. In addition to this, it also provides very advanced features, such as auto sharding, load-balancing, in-memory caching, replication, compression, near real-time lookups, strong consistency (using multi-version). It uses the latest concepts of block cache and bloom filter to provide faster response to online/real-time request. It supports multiple clients running on heterogeneous platforms by providing user-friendly APIs. In this tutorial, we will discuss how to effectively set up mid and large size HBase cluster on top of Hadoop/HDFS framework. We will also help you set up HBase on a fully distributed cluster. For cluster setup, we will consider REH (RedHat Enterprise-6.2 Linux 64 bit); for the setup we will be using six nodes. This article is an excerpt taken from the book ‘HBase High Performance Cookbook’ written by Ruchir Choudhry. This book provides a solid understanding of the HBase basics. Let’s get started! Configuring and deploying Hbase Before we start HBase in fully distributed mode, we will be setting up first Hadoop-2.2.0 in a distributed mode, and then on top of Hadoop cluster we will set up HBase because HBase stores data in HDFS. Getting Ready The first step will be to create a directory at user/u/HBase B and download the tar file from the location given later. The location can be local, mount points or in cloud environments; it can be block storage: wget wget –b http://apache.mirrors.pair.com/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz This –b option will download the tar file as a background process. The output will be piped to wget-log. You can tail this log file using tail -200f wget-log. Untar it using the following commands: tar -xzvf hadoop-2.2.0.tar.gz This is used to untar the file in a folder hadoop-2.2.0 in your current diectory location. Once the untar process is done, for clarity it's recommended use two different folders one for NameNode and other for DataNode. I am assuming app is a user and app is a group on a Linux platform which has access to read/write/execute access to the locations, if not please create a user app and group app if you have sudo su - or root/admin access, in case you don't have please ask your administrator to create this user and group for you in all the nodes and directorates you will be accessing. To keep the NameNodeData and the DataNodeData for clarity let's create two folders by using the following command, inside /u/HBase B: Mkdir NameNodeData DataNodeData NameNodeData will have the data which is used by the name nodes and DataNodeData will have the data which will be used by the data nodes: ls –ltr will show the below results. drwxrwxr-x 2 app app  4096 Jun 19 22:22 NameNodeData drwxrwxr-x 2 app app  4096 Jun 19 22:22 DataNodeData -bash-4.1$ pwd /u/HBase B/hadoop-2.2.0 -bash-4.1$ ls -ltr total 60K drwxr-xr-x 2 app app 4.0K Mar 31 08:49 bin drwxrwxr-x 2 app app 4.0K Jun 19 22:22 DataNodeData drwxr-xr-x 3 app app 4.0K Mar 31 08:49 etc The steps in choosing Hadoop cluster are: Hardware details required for it Software required to do the setup OS required to do the setup Configuration steps HDFS core architecture is based on master/slave, where an HDFS cluster comprises of solo NameNode, which is essentially used as a master node, and owns the accountability for that orchestrating, handling the file system, namespace, and controling access to files by client. It performs this task by storing all the modifications to the underlying file system and propagates these changes as logs, appends to the native file system files, and edits. SecondaryNameNode is designed to merge the fsimage and the edits log files regularly and controls the size of edit logs to an acceptable limit. In a true cluster/distributed environment, it runs on a different machine. It works as a checkpoint in HDFS. We will require the following for the NameNode: Components Details Used for nodes/systems Operating System Redhat-6.2 Linux  x86_64 GNU/Linux, or other standard linux kernel. All the setup for Hadoop/HBase and other components used Hardware /CPUS 16 to 32 CPU cores NameNode/Secondary NameNode 2 quad-hex-/octo-core CPU DataNodes Hardware/RAM 128 to 256 GB, In special caes 128 GB to 512 GB RAM NameNode/Secondary NameNodes 128 GB -512 GB of RAM DataNodes Hardware/storage It's pivotal to have NameNode server on robust and reliable storage platform as it responsible for many key activities like edit-log journaling. As the importance of these machines are very high and the NameNodes plays a central role in orchestrating everything,thus RAID or any robust storage device is acceptable. NameNode/Secondary Namenodes 2 to 4 TB hard disk in a JBOD DataNodes RAID is nothing but a random access inexpensive drive or independent disk. There are many levels of RAID drives, but for master or a NameNode, RAID 1 will be enough. JBOD stands for Just a bunch of Disk. The design is to have multiple hard drives stacked over each other with no redundancy. The calling software needs to take care of the failure and redundancy. In essence, it works as a single logical volume: Before we start for the cluster setup, a quick recap of the Hadoop setup is essential with brief descriptions. How to do it Let's create a directory where you will have all the software components to be downloaded: For the simplicity, let's take it as /u/HBase B. Create different users for different purposes. The format will be as follows user/group, this is essentially required to differentiate different roles for specific purposes: Hdfs/hadoop is for handling Hadoop-related setup Yarn/hadoop is for yarn related setup HBase /hadoop Pig/hadoop Hive/hadoop Zookeeper/hadoop Hcat/hadoop Set up directories for Hadoop cluster. Let's assume /u as a shared mount point. We can create specific directories that will be used for specific purposes. Please make sure that you have adequate privileges on the folder to add, edit, and execute commands. Also, you must set up password less communication between different machines like from name node to the data node and from HBase master to all the region server nodes. Once the earlier-mentioned structure is created; we can download the tar files from the following locations: -bash-4.1$ ls -ltr total 32 drwxr-xr-x  9 app app 4096 hadoop-2.2.0 drwxr-xr-x 10 app app 4096 zookeeper-3.4.6 drwxr-xr-x 15 app app 4096 pig-0.12.1 drwxrwxr-x  7 app app 4096 HBase -0.98.3-hadoop2 drwxrwxr-x  8 app app 4096 apache-hive-0.13.1-bin drwxrwxr-x  7 app app 4096 Jun 30 01:04 mahout-distribution-0.9 You can download these tar files from the following location: wget –o https://archive.apache.org/dist/HBase /HBase -0.98.3/HBase -0.98.3-hadoop1-bin.tar.gz wget -o https://www.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz wget –o https://archive.apache.org/dist/mahout/0.9/mahout-distribution-0.9.tar.gz wget –o https://archive.apache.org/dist/hive/hive-0.13.1/apache-hive-0.13.1-bin.tar.gz wget -o https://archive.apache.org/dist/pig/pig-0.12.1/pig-0.12.1.tar.gz Here, we will list the procedure to achieve the end result of the recipe. This section will follow a numbered bullet form. We do not need to give the reason that we are following a procedure. Numbered single sentences would do fine. Let's assume that there is a /u directory and you have downloaded the entire stack of software from: /u/HBase B/hadoop-2.2.0/etc/hadoop/ and look for the file core-site.xml. Place the following lines in this configuration file: <configuration> <property>    <name>fs.default.name</name>    <value>hdfs://addressofbsdnsofmynamenode-hadoop:9001</value> </property> </configuration> You can specify a port that you want to use, and it should not clash with the ports that are already in use by the system for various purposes. Save the file. This helps us create a master /NameNode. Now, let's move to set up SecondryNodes, let's edit /u/HBase B/hadoop-2.2.0/etc/hadoop/ and look for the file core-site.xml: <property>  <name>fs.defaultFS</name>  <value>hdfs://custome location of your hdfs</value> </property> <configuration> <property>           <name>fs.checkpoint.dir</name>           <value>/u/HBase B/dn001/hadoop/hdf/secdn        /u/HBase B/dn002/hadoop/hdfs/secdn </value>    </property> </configuration> The separation of the directory structure is for the purpose of a clean separation of the HDFS block separation and to keep the configurations as simple as possible. This also allows us to do a proper maintenance. Now, let's move towards changing the setup for hdfs; the file location will be /u/HBase B/hadoop-2.2.0/etc/hadoop/hdfs-site.xml. Add these properties in hdfs-site.xml: For NameNode: <property>          <name>dfs.name.dir</name>          <value> /u/HBase B/nn01/hadoop/hdfs/nn,/u/HBase B/nn02/hadoop/hdfs/nn </value>      </property> For DataNode: <property>          <name>dfs.data.dir</name>          <value> /u/HBase B/dnn01/hadoop/hdfs/dn,/HBase B/u/dnn02/hadoop/hdfs/dn </value> </property> Now, let's go for NameNode for http address or to access using http protocol: <property> <name>dfs.http.address</name> <value>yournamenode.full.hostname:50070</value> </property> <property> <name>dfs.secondary.http.address</name> <value> secondary.yournamenode.full.hostname:50090 </value>      </property> We can go for the https setup for the NameNode too, but let's keep it optional for now: Let's set up the yarn resource manager: Let's look for Yarn setup: /u/HBase B/hadoop-2.2.0/etc/hadoop/ yarn-site.xml For resource tracker a part of yarn resource manager: <property>  <name>yarn.yourresourcemanager.resourcetracker.address</name> <value>youryarnresourcemanager.full.hostname:8025</value> </property> For resource schedule part of yarn resource scheduler: <property> <name>yarn.yourresourcemanager.scheduler.address</name> <value>yourresourcemanager.full.hostname:8030</value> </property> For scheduler address: <property> <name>yarn.yourresourcemanager.address</name> <value>yourresourcemanager.full.hostname:8050</value> </property> For scheduler admin address: <property> <name>yarn.yourresourcemanager.admin.address</name> <value>yourresourcemanager.full.hostname:8041</value> </property> To set up a local dir: <property>         <name>yarn.yournodemanager.local-dirs</name>         <value>/u/HBase /dnn01/hadoop/hdfs /yarn,/u/HBase B/dnn02/hadoop/hdfs/yarn </value>    </property> To set up a log location: <property> <name> yarn.yournodemanager.logdirs </name>          <value>/u/HBase B/var/log/hadoop/yarn</value> </property> This completes the configuration changes required for Yarn. Now, let's make the changes for Map reduce: Let's open the mapred-site.xml: /u/HBase B/hadoop-2.2.0/etc/hadoop/mapred-site.xml Now, let's place this property configuration setup in the mapred-site.xml and place it between the following: <configuration > </configurations > <property><name>mapreduce.yourjobhistory.address</name> <value>yourjobhistoryserver.full.hostname:10020</value> </property> Once we have configured Map reduce job history details, we can move on to configure HBase . Let's go to this path /u/HBase B/HBase -0.98.3-hadoop2/conf and open HBase -site.xml. You will see a template having the following: <configuration > </configurations > We need to add the following lines between the starting and ending tags: <property> <name>HBase .rootdir</name> <value>hdfs://HBase .yournamenode.full.hostname:8020/apps/HBase /data </value> </property> <property> <name>HBase .yourmaster.info.bindAddress</name> <value>$HBase .yourmaster.full.hostname</value> </property> This competes the HBase changes. ZooKeeper: Now, let's focus on the setup of ZooKeeper. In distributed env, let's go to this location and rename the zoo_sample.cfg to zoo.cfg: /u/HBase B/zookeeper-3.4.6/conf Open zoo.cfg by vi zoo.cfg and place the details as follows; this will create two instances of zookeeper on different ports: yourzooKeeperserver.1=zoo1:2888:3888 yourZooKeeperserver.2=zoo2:2888:3888 If you want to test this setup locally, please use different port combinations. In a production-like setup as mentioned earlier, yourzooKeeperserver.1=zoo1:2888:3888 is server.id=host:port:port: yourzooKeeperserver.1= server.id zoo1=host 2888=port 3888=port Atomic broadcasting is an atomic messaging system that keeps all the servers in sync and provides reliable delivery, total order, casual order, and so on. Region servers: Before concluding it, let's go through the region server setup process. Go to this folder /u/HBase B/HBase -0.98.3-hadoop2/conf and edit the regionserver file. Specify the region servers accordingly: RegionServer1 RegionServer2 RegionServer3 RegionServer4 RegionServer1 equal to the IP or fully qualified CNAME of 1 Region server. You can have as many region servers (1. N=4 in our case), but its CNAME and mapping in the region server file need to be different. Copy all the configuration files of HBase and ZooKeeper to the relative host dedicated for HBase and ZooKeeper. As the setup is in a fully distributed cluster mode, we will be using a different host for HBase and its components and a dedicated host for ZooKeeper. Next, we validate the setup we've worked on by adding the following to the bashrc, this will make sure later we are able to configure the NameNode as expected: It preferred to use it in your profile, essentially /etc/profile; this will make sure the shell which is used is only impacted. Now let's format NameNode: Sudo su $HDFS_USER /u/HBase B/hadoop-2.2.0/bin/hadoop namenode -format HDFS is implemented on the existing local file system of your cluster. When you want to start the Hadoop setup first time you need to start with a clean slate and hence any existing data needs to be formatted and erased. Before formatting we need to take care of the following. Check whether there is a Hadoop cluster running and using the same HDFS; if it's done accidentally all the data will be lost. /u/HBase B/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode Now let's go to the SecondryNodes: Sudo su $HDFS_USER /u/HBase B/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start secondarynamenode Repeating the same procedure in DataNode: Sudo su $HDFS_USER /u/HBase B/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode Test 01> See if you can reach from your browser http://namenode.full.hostname:50070: Test 02> sudo su $HDFS_USER touch /tmp/hello.txt Now, hello.txt file will be created in tmp location: /u/HBase B/hadoop-2.2.0/bin/hadoop dfs  -mkdir -p /app /u/HBase B/hadoop-2.2.0/bin/hadoop dfs  -mkdir -p /app/apphduser This will create a specific directory for this application user in the HDFS FileSystem location(/app/apphduser) /u/HBase B/hadoop-2.2.0/bin/hadoop dfs -copyFromLocal /tmp/hello.txt /app/apphduser /u/HBase B/hadoop-2.2.0/bin/hadoop dfs –ls /app/apphduser apphduser is a dirctory which is created in hdfs for a specific user. So that the data is sepreated based on the users, in a true production env many users will be using it. You can also use hdfs dfs –ls / commands if it shows hadoop command as depricated. You must see hello.txt once the command executes: Test 03> Browse http://datanode.full.hostname:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=$datanode.full.hostname:8020 It is important to change the data host name and other parameters accordingly. You should see the details on the DataNode. Once you hit the preceding URL you will get the following screenshot: On the command line it will be as follows: Validate Yarn/MapReduce setup and execute this command from the resource manager: <login as $YARN_USER> /u/HBase B/hadoop-2.2.0/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager Execute the following command from NodeManager: <login as $YARN_USER > /u/HBase B/hadoop-2.2.0/sbin /yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager Executing the following commands will create the directories in the hdfs and apply the respective access rights: Cd u/HBase B/hadoop-2.2.0/bin hadoop fs -mkdir /app-logs // creates the dir in HDFS hadoop fs -chown $YARN_USER /app-logs //changes the ownership hadoop fs -chmod 1777 /app-logs // explained in the note section Execute MapReduce Start jobhistory servers: <login as $MAPRED_USER> /u/HBase B/hadoop-2.2.0/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR Let's have a few tests to be sure we have configured properly: Test 01: From the browser or from curl use the link to browse: http://yourresourcemanager.full.hostname:8088/. Test 02: Sudo su $HDFS_USER /u/HBase B/hadoop-2.2.0/bin/hadoop jar /u/HBase B/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar teragen 100 /test/10gsort/input /u/HBase B/hadoop-2.2.0/bin/hadoop jar /u/HBase B/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar Validate the HBase setup: Login as $HDFS_USER /u/HBase B/hadoop-2.2.0/bin/hadoop fs –mkdir -p /apps/HBase /u/HBase B/hadoop-2.2.0/bin/hadoop fs –chown app:app –R  /apps/HBase Now login as $HBase _USER: /u/HBase B/HBase -0.98.3-hadoop2/bin/HBase -daemon.sh –-config $HBase _CONF_DIR start master This command will start the master node. Now let's move to HBase Region server nodes: /u/HBase B/HBase -0.98.3-hadoop2/bin/HBase -daemon.sh –-config $HBase _CONF_DIR start regionserver This command will start the regionservers: For a single machine, direct sudo ./HBase master start can also be used. Please check the logs in case of any logs at this location /opt/HBase B/HBase -0.98.5-hadoop2/logs. You can check the log files and check for any errors: Now let's login using: Sudo su- $HBase _USER /u/HBase B/HBase -0.98.3-hadoop2/bin/HBase shell We will connect HBase to the master. Validate the ZooKeeper setup. If you want to use an external zookeeper, make sure there is no internal HBase based zookeeper running while working with the external zookeeper or existing zookeeper and is not managed by HBase : For this you have to edit /opt/HBase B/HBase -0.98.5-hadoop2/conf/ HBase -env.sh. Change the following statement (HBase _MANAGES_ZK=false): # Tell HBase whether it should manage its own instance of Zookeeper or not. export HBase _MANAGES_ZK=true. Once this is done we can add zoo.cfg to HBase 's CLASSPATH. HBase looks into zoo.cfg as a default lookup for configurations dataDir=/opt/HBase B/zookeeper-3.4.6/zooData # this is the place where the zooData will be present server.1=172.28.182.45:2888:3888 # IP and port for server 01 server.2=172.29.75.37:4888:5888 # IP and port for server 02 You can edit the log4j.properties file which is located at /opt/HBase B/zookeeper-3.4.6/conf and point the location where you want to keep the logs. # Define some default values that can be overridden by system properties: zookeeper.root.logger=INFO, CONSOLE zookeeper.console.threshold=INFO zookeeper.log.dir=. zookeeper.log.file=zookeeper.log zookeeper.log.threshold=DEBUG zookeeper.tracelog.dir=. # you can specify the location here zookeeper.tracelog.file=zookeeper_trace.log Once this is done you start zookeeper with the following command: -bash-4.1$ sudo /u/HBase B/zookeeper-3.4.6/bin/zkServer.sh start Starting zookeeper ... STARTED You can also pipe the log to the ZooKeeper logs: /u/logs//u/HBase B/zookeeper-3.4.6/zoo.out 2>&1 2 : refers to the second file descriptor for the process, that is stderr. > : means re-direct &1:  means the target of the rediretion should be the same location as the first file descriptor i.e stdout How it works Sizing of the environment is very critical for the success of any project, and it's a very complex task to optimize it to the needs. We dissect it into two parts, master and slave setup. We can divide it in the following parts: Master-NameNode Master-Secondary NameNode Master-Jobtracker Master-Yarn Resource Manager Master-HBase Master Slave-DataNode Slave-Map Reduce Tasktracker Slave-Yarn Node Manager Slave-HBase Region server NameNode: The architecture of Hadoop provides us a capability to set up a fully fault tolerant/high availability Hadoop/HBase cluster. In doing so, it requires a master and slave setup. In a fully HA setup, nodes are configured in active passive way; one node is always active at any given point of time and the other node remains as passive. Active node is the one interacting with the clients and works as a coordinator to the clients. The other standby node keeps itself synchronized with the active node and to keep the state intact and live, so that in case of failover it is ready to take the load without any downtime. Now we have to make sure that when the passive node comes up in the event of a failure, the passive node is in perfect sync with the active node, which is currently taking the traffic. This is done by Journal Nodes(JNs), these Journal Nodes use daemon threads to keep the primary and sercodry in perfect sync. Journal Node: By design, JournalNodes will only have single NameNode acting as a active/primary to be a writer at a time. In case of failure of the active/primary, the passive NameNode immediately takes the charge and transforms itself as active, this essentially means this newly active node starts writing to Journal Nodes. Thus it totally avoids the other NameNode to stay in active state, this also acknowledges that the newly active node work as a fail over node. JobTracker: This is an integral part of Hadoop EcoSystem. It works as a service which farms MapReduce task to specific nodes in the cluster. ResourceManager (RM): This responsibility is limited to scheduling, that is, only mediating available resources in the system between different needs for the application like registering new nodes, retiring dead nodes, it dose it by constantly monitoring the heartbeats based on the internal configuration. Due to this core design practice of explicit separation of responsibilities and clear orchestrations of modularity and with the inbuilt and robust scheduler API, This allows the resource manager to scale and support different design needs at one end, and on the other, it allows us to cater to different programming models. HBase Master: The Master server is the main orchestrator for all the region servers in the HBase cluster . Usually, it's placed on the ZooKeeper nodes. In a real cluster configuration, you will have 5 to 6 nodes of Zookeeper. DataNode: It's a real workhorse and does most of the heavy lifting; it runs the MapReduce Job and stores the chunks of HDFS data. The core objective of the data node was to be available on the commodity hardware and should be agnostic to the failures. It keeps some data of HDFS, and the multiple copy of the same data is sprinkled around the cluster. This makes the DataNode architecture fully fault tolerant. This is the reason a data node can have JBOD01 rather rely on the expensive RAID02. MapReduce: Jobs are run on these DataNodes in parallel as a subtask. These subtasks provides the consistent data across the cluster and stays consistent. So we learned about the HBase basics and how to configure and set it up. We set up HBase to store data in Hadoop Distributed File System. We also explored the working structure of RAID and JBOD and the differences between both filesystems. If you found this post useful, be sure to check out the book ‘HBase High Perforamnce Cookbook’ to learn more about configuring HBase in terms of administering and managing clusters as well as other concepts in HBase. Understanding the HBase Ecosystem Configuring HBase 5 Mistake Developers make when working with HBase    
Read more
  • 0
  • 0
  • 8894

article-image-administration-rights-for-power-bi-users
Pravin Dhandre
02 Jul 2018
8 min read
Save for later

Administration rights for Power BI users

Pravin Dhandre
02 Jul 2018
8 min read
In this tutorial, you will understand and learn administration rights/rules for Power BI users. This includes setting and monitoring rules like; who in the organization can utilize which feature, how Power BI Premium capacity is allocated and by whom, and other settings such as embed codes and custom visuals. This article is an excerpt from a book written by Brett Powell titled Mastering Microsoft Power BI. The admin portal is accessible to Office 365 Global Administrators and users mapped to the Power BI service administrator role. To open the admin portal, log in to the Power BI service and select the Admin portal item from the Settings (Gear icon) menu in the top right, as shown in the following screenshot: All Power BI users, including Power BI free users, are able to access the Admin portal. However, users who are not admins can only view the Capacity settings page. The Power BI service administrators and Office 365 global administrators have view and edit access to the following seven pages: Administrators of Power BI most commonly utilize the Tenant settings and Capacity settings as described in the Tenant Settings and Power BI Premium Capacities sections later in this tutorial. However, the admin portal can also be used to manage any approved custom visuals for the organization. Usage metrics The Usage metrics page of the Admin portal provides admins with a Power BI dashboard of several top metrics, such as the most consumed dashboards and the most consumed dashboards by workspace. However, the dashboard cannot be modified and the tiles of the dashboard are not linked to any underlying reports or separate dashboards to support further analysis. Given these limitations, alternative monitoring solutions are recommended, such as the Office 365 audit logs and usage metric datasets specific to Power BI apps. Details of both monitoring options are included in the app usage metrics and Power BI audit log activities sections later in this chapter. Users and Audit logs The Users and Audit logs pages only provide links to the Office 365 admin center. In the admin center, Power BI users can be added, removed and managed. If audit logging is enabled for the organization via the Create audit logs for internal activity and auditing and compliance tenant setting, this audit log data can be retrieved from the Office 365 Security & Compliance Center or via PowerShell. This setting is noted in the following section regarding the Tenant settings tab of the Power BI admin portal. An Office 365 license is not required to utilize the Office 365 admin center for Power BI license assignments or to retrieve Power BI audit log activity. Tenant settings The Tenant settings page of the Admin portal allows administrators to enable or disable various features of the Power BI web service. Likewise, the administrator could allow only a certain security group to embed Power BI content in SaaS applications such as SharePoint Online. The following diagram identifies the 18 tenant settings currently available in the admin portal and the scope available to administrators for configuring each setting: From a data security perspective, the first seven settings within the Export and Sharing and Content packs and apps groups are most important. For example, many organizations choose to disable the Publish to web feature for the entire organization. Additionally, only certain security groups may be allowed to export data or to print hard copies of reports and dashboards. As shown in the Scope column of the previous table and the following example, granular security group configurations are available to minimize risk and manage the overall deployment. Currently, only one tenant setting is available for custom visuals and this setting (Custom visuals settings) can be enabled or disabled for the entire organization only. For organizations that wish to restrict or prohibit custom visuals for security reasons, this setting can be used to eliminate the ability to add, view, share, or interact with custom visuals. More granular controls to this setting are expected later in 2018, such as the ability to define users or security groups of users who are allowed to use custom visuals. In the following screenshot from the Tenant settings page of the Admin portal, only the users within the BI Admin security group who are not also members of the BI Team security group are allowed to publish apps to the entire organization: For example, a report author who also helps administer the On-premises data gateway via the BI Admin security group would be denied the ability to publish apps to the organization given membership in the BI Team security group. Many of the tenant setting configurations will be more simple than this example, particularly for smaller organizations or at the beginning of Power BI deployments. However, as adoption grows and the team responsible for Power BI changes, it's important that the security groups created to help administer these settings are kept up to date. Embed Codes Embed Codes are created and stored in the Power BI service when the Publish to web feature is utilized. As described in the Publish to web section of the previous chapter, this feature allows a Power BI report to be embedded in any website or shared via URL on the public internet. Users with edit rights to the workspace of the published to web content are able to manage the embed codes themselves from within the workspace. However, the admin portal provides visibility and access to embed codes across all workspaces, as shown in the following screenshot: Via the Actions commands on the far right of the Embed Codes page, a Power BI Admin can view the report in a browser (diagonal arrow) or remove the embed code. The Embed Codes page can be helpful to periodically monitor the usage of the Publish to web feature and for scenarios in which data was included in a publish to web report that shouldn't have been, and thus needs to be removed. As shown in the Power BI Tenant settings table referenced in the previous section, this feature can be enabled or disabled for the entire organization or for specific users within security groups. Organizational Custom visuals The Custom Visuals page allows admins to upload and manage custom visuals (.pbiviz files) that have been approved for use within the organization. For example, an organization may have proprietary custom visuals developed internally, which it wishes to expose to business users. Alternatively, the organization may wish to define a set of approved custom visuals, such as only the custom visuals that have been certified by Microsoft. In the following screenshot, the Chiclet Slicer custom visual is added as an organizational custom visual from the Organizational visuals page of the Power BI admin portal: The Organizational visuals page provides a link (Add a custom visual) to launch the form and identifies all uploaded visuals, as well as their last update. Once a visual has been uploaded, it can be deleted but not updated or modified. Therefore, when a new version of an organizational visual becomes available, this visual can be added to the list of organizational visuals with a descriptive title (Chiclet Slicer v2.0). Deleting an organizational custom visual will cause any reports that use this visual to stop rendering. The following screenshot reflects the uploaded Chiclet Slicer custom visual on the Organization visuals page: Once the custom visual has been uploaded as an organizational custom visual, it will be accessible to users in Power BI Desktop. In the following screenshot from Power BI Desktop, the user has opened the MARKETPLACE of custom visuals and selected MY ORGANIZATION: In this screenshot, rather than searching through the MARKETPLACE, the user can go directly to visuals defined by the organization. The marketplace of custom visuals can be launched via either the Visualizations pane or the From Marketplace icon on the Home tab of the ribbon. Organizational custom visuals are not supported for reports or dashboards shared with external users. Additionally, organizational custom visuals used in reports that utilize the publish to web feature will not render outside the Power BI tenant. Moreover, Organizational custom visuals are currently a preview feature. Therefore, users must enable the My organization custom visuals feature via the Preview features tab of the Options window in Power BI Desktop. With this, we got you acquainted with features and processes applicable in administering Power BI for an organization. This includes the configuration of tenant settings in the Power BI admin portal, analyzing the usage of Power BI assets, and monitoring overall user activity via the Office 365 audit logs. If you found this tutorial useful, do check out the book Mastering Microsoft Power BI to develop visually rich, immersive, and interactive Power BI reports and dashboards. Unlocking the secrets of Microsoft Power BI A tale of two tools: Tableau and Power BI Building a Microsoft Power BI Data Model
Read more
  • 0
  • 0
  • 7578
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-how-to-migrate-power-bi-datasets-to-microsoft-analysis-services-models
Pravin Dhandre
29 Jun 2018
5 min read
Save for later

How to migrate Power BI datasets to Microsoft Analysis Services models [Tutorial]

Pravin Dhandre
29 Jun 2018
5 min read
The Azure Analysis Services web designer, supports the ability to import a data model contained within a Power BI Desktop file. The imported or migrated model can then take advantage of the resources available to the Azure Analysis Services server and can be accessed from client tools such as Power BI Desktop. Additionally, Azure Analysis Services provides a Visual Studio project file and a Model.bim file for the migrated model that a corporate BI team can use in SSDT for Visual Studio. In this tutorial, you will learn how to migrate your Power BI data to Microsoft Analysis Services for further self-service BI solutions and delivering flexibility to a huge network of stakeholders. This article is an excerpt from a book written by Brett Powell titled Mastering Microsoft Power BI. The following process migrates the model within a Power BI Desktop file to an Azure Analysis Server and downloads the Visual Studio project file for the migrated model: Open the Web designer from the Overview page of the Azure Analysis Services resource in the Azure portal On the Models form, click Add and then provide a name for the new model in the New model form Select the Power BI Desktop File source icon at the bottom and choose the file on the Import menu Click Import to begin the migration process The following screenshot represents these four steps from the Azure Analysis Services web designer: In this example, a Power BI Desktop file (AdWorks Enterprise.pbix) that contains an import mode model based on two on-premises sources (SQL Server and Excel) is imported via the Azure Analysis Services web designer. Once the import is complete, the Field list from the model will be exposed on the right and the imported model will be accessible from client tools like any other Azure Analysis Services model. For example, refreshing the Azure AS server in SQL Server Management Studio will expose the new database (AdWorks Enterprise). Likewise, the Azure Analysis Services database connection in Power BI Desktop (Get Data | Azure) can be used to connect to the migrated model, as shown in the following screenshot: Just like the SQL Server Analysis Services database connection (Get Data | Database), the only required field is the name of the server which is provided in the Azure portal. From the Overview page of the Azure Analysis Services resource, select the Open in Visual Studio project option from the context menu on the far right, as shown in the following screenshot: Save the zip file provided by Azure Analysis Services to a secure local network location. Extract the files from the zip file to expose the Analysis Services project and .bim file, as shown in the following screenshot: In Visual Studio, open a project/solution (File | Open | Project/Solution) and navigate to the downloaded project file (.smproj). Select the project file and click Open. Double-click the Model.bim file in the Solution Explorer window to expose the metadata of the migrated model. All of the objects of the data model built into the Power BI Desktop file including Data Sources, Queries, and Measures are accessible in SSDT just like standard Analysis Services projects, as shown in the following screenshot: The preceding screenshot from Diagram view in SQL Server Data Tools exposes the two on-premises sources of the imported PBIX file via the Tabular Model Explorer window. By default, the deployment server of the Analysis Services project in SSDT is set to the Azure Analysis Services server. As an alternative to a new solution with a single project, an existing solution with an existing Analysis Services project could be opened and the new project from the migration could be added to this solution. This can be accomplished by right-clicking the existing solution's name in the Solution Explorer window and selecting the Existing project from the Add menu (Add | Existing project). This approach allows the corporate BI developer to view and compare both models and optionally implement incremental changes, such as new columns or measures that were exclusive to the Power BI Desktop file. The following screenshot from a solution in Visual Studio includes both the migrated model (via the project file) and an existing Analysis Services model (AdWorks Import): The ability to quickly migrate Power BI datasets to Analysis Services models complements the flexibility and scale of Power BI Premium capacity in allowing organizations to manage and deploy Power BI on their terms. By now, you have successfully migrated your Power BI datasets to Analysis Services and can enjoy the complete flexibility of making further edits to your model for mining much better insights out of it. If you found this tutorial useful, do check out the book Mastering Microsoft Power BI and start producing insightful and beautiful reports from hundreds of data sources and scale across the enterprise. How to use M functions within Microsoft Power BI for querying data Building a Microsoft Power BI Data Model How to build a live interactive visual dashboard in Power BI with Azure Stream
Read more
  • 0
  • 1
  • 28761

article-image-indexing-replicating-and-sharding-in-mongodb-tutorial
Amey Varangaonkar
29 Jun 2018
11 min read
Save for later

Indexing, Replicating, and Sharding in MongoDB [Tutorial]

Amey Varangaonkar
29 Jun 2018
11 min read
MongoDB is an open source, document-oriented, and cross-platform database. It is primarily written in C++. It is also the leading NoSQL database and tied with the SQL database in the fifth position after PostgreSQL. It provides high performance, high availability, and easy scalability. MongoDB uses JSON-like documents with schema. MongoDB, developed by MongoDB Inc., is free to use. It is published under a combination of the GNU Affero General Public License and the Apache License. In this article, we look at the indexing, replication and sharding features offered by MongoDB. The following excerpt is taken from the book 'Seven NoSQL Databases in a Week' written by Aaron Ploetz et al. Introduction to MongoDB indexing Indexes allow efficient execution of MongoDB queries. If we don't have indexes, MongoDB has to scan all the documents in the collection to select those documents that match the criteria. If proper indexing is used, MongoDB can limit the scanning of documents and select documents efficiently. Indexes are a special data structure that store some field values of documents in an easy-to-traverse way. Indexes store the values of specific fields or sets of fields, ordered by the values of fields. The ordering of field values allows us to apply effective algorithms of traversing, such as the mid-search algorithm, and also supports range-based operations effectively. In addition, MongoDB can return sorted results easily. Indexes in MongoDB are the same as indexes in other database systems. MongoDB defines indexes at the collection level and supports indexes on fields and sub-fields of documents. The default _id index MongoDB creates the default _id index when creating a document. The _id index prevents users from inserting two documents with the same _id value. You cannot drop an index on an _id field. The following syntax is used to create an index in MongoDB: >db.collection.createIndex(<key and index type specification>, <options>); The preceding method creates an index only if an index with the same specification does not exist. MongoDB indexes use the B-tree data structure. The following are the different types of indexes: Single field: In addition to the _id field index, MongoDB allows the creation of an index on any single field in ascending or descending order. For a single field index, the order of the index does not matter as MongoDB can traverse indexes in any order. The following is an example of creating an index on the single field where we are creating an index on the firstName field of the user_profiles collection: The query gives acknowledgment after creating the index: This will create an ascending index on the firstName field. To create a descending index, we have to provide -1 instead of 1. Compound index: MongoDB also supports user-defined indexes on multiple fields. The order of fields defined while creating an index has a significant effect. For example, a compound index defined as {firstName:1, age:-1} will sort data by firstName first and then each firstName with age. Multikey index: MongoDB uses multi-key indexes to index the content in the array. If you index the field that contains the array values, MongoDB creates an index for each field in the object of an array. These indexes allow queries to select the document by matching the element or set of elements of the array. MongoDB automatically decides whether to create multi-key indexes or not. Text indexes: MongoDB provides text indexes that support the searching of string contents in the MongoDB collection. To create text indexes, we have to use the db.collection.createIndex() method, but we need to pass a text string literal in the query: You can also create text indexes on multiple fields, for example: Once the index is created, we get an acknowledgment: Compound indexes can be used with text indexes to define an ascending or descending order of the index. Hashed index: To support hash-based sharding, MongoDB supports hashed indexes. In this approach, indexes store the hash value and query, and the select operation checks the hashed indexes. Hashed indexes can support only equality-based operations. They are limited in their performance of range-based operations. Indexes have the following properties: Unique indexes: Indexes should maintain uniqueness. This makes MongoDB drop the duplicate value from indexes. Partial Indexes: Partial indexes apply the index on documents of a collection that match a specified condition. By applying an index on the subset of documents in the collection, partial indexes have a lower storage requirement as well as a reduced performance cost. Sparse index: In the sparse index, MongoDB includes only those documents in the index in which the index field is present, other documents are discarded. We can combine unique indexes with a sparse index to reject documents that have duplicate values but ignore documents that have an indexed key. TTL index: TTL indexes are a special type of indexes where MongoDB will automatically remove the document from the collection after a certain amount of time. Such indexes are ideal to remove machine-generated data, logs, and session information that we need for a finite duration. The following TTL index will automatically delete data from the log table after 3000 seconds: Once the index is created, we get an acknowledgment message: The limitations of indexes: A single collection can have up to 64 indexes only. The qualified index name is <database-name>.<collection-name>.$<index-name> and cannot have more than 128 characters. By default, the index name is a combination of index type and field name. You can specify an index name while using the createIndex() method to ensure that the fully-qualified name does not exceed the limit. There can be no more than 31 fields in the compound index. The query cannot use both text and geospatial indexes. You cannot combine the $text operator, which requires text indexes, with some other query operator required for special indexes. For example, you cannot combine the $text operator with the $near operator. Fields with 2d sphere indexes can only hold geometry data. 2d sphere indexes are specially provided for geometric data operations. For example, to perform operations on co-ordinate, we have to provide data as points on a planer co-ordinate system, [x, y]. For non-geometries, the data query operation will fail. The limitation on data: The maximum number of documents in a capped collection must be less than 2^32. We should define it by the max parameter while creating it. If you do not specify, the capped collection can have any number of documents, which will slow down the queries. The MMAPv1 storage engine will allow 16,000 data files per database, which means it provides the maximum size of 32 TB. We can set the storage.mmapv1.smallfile parameter to reduce the size of the database to 8 TB only. Replica sets can have up to 50 members. Shard keys cannot exceed 512 bytes. Replication in MongoDB A replica set is a group of MongoDB instances that store the same set of data. Replicas are basically used in production to ensure a high availability of data. Redundancy and data availability: because of replication, we have redundant data across the MongoDB instances. We are using replication to provide a high availability of data to the application. If one instance of MongoDB is unavailable, we can serve data from another instance. Replication also increases the read capacity of applications as reading operations can be sent to different servers and retrieve data faster. By maintaining data on different servers, we can increase the locality of data and increase the availability of data for distributed applications. We can use the replica copy for backup, reporting, as well as disaster recovery. Working with replica sets A replica set is a group of MongoDB instances that have the same dataset. A replica set has one arbiter node and multiple data-bearing nodes. In data-bearing nodes, one node is considered the primary node while the other nodes are considered the secondary nodes. All write operations happen at the primary node. Once a write occurs at the primary node, the data is replicated across the secondary nodes internally to make copies of the data available to all nodes and to avoid data inconsistency. If a primary node is not available for the operation, secondary nodes use election algorithms to select one of their nodes as a primary node. A special node, called an arbiter node, is added in the replica set. This arbiter node does not store any data. The arbiter is used to maintain a quorum in the replica set by responding to a heartbeat and election request sent by the secondary nodes in replica sets. As an arbiter does not store data, it is a cost-effective resource used in the election process. If votes in the election process are even, the arbiter adds a voice to choose a primary node. The arbiter node is always the arbiter, it will not change its behavior, unlike a primary or secondary node. The primary node can step down and work as secondary node, while secondary nodes can be elected to perform as primary nodes. Secondary nodes apply read/write operations from a primary node to secondary nodes asynchronously. Automatic failover in replication Primary nodes always communicate with other members every 10 seconds. If it fails to communicate with the others in 10 seconds, other eligible secondary nodes hold an election to choose a primary-acting node among them. The first secondary node that holds the election and receives the majority of votes is elected as a primary node. If there is an arbiter node, its vote is taken into consideration while choosing primary nodes. Read operations Basically, the read operation happens at the primary node only, but we can specify the read operation to be carried out from secondary nodes also. A read from a secondary node does not affect data at the primary node. Reading from secondary nodes can also give inconsistent data. Sharding in MongoDB Sharding is a methodology to distribute data across multiple machines. Sharding is basically used for deployment with a large dataset and high throughput operations. The single database cannot handle a database with large datasets as it requires larger storage, and bulk query operations can use most of the CPU cycles, which slows down processing. For such scenarios, we need more powerful systems. One approach is to add more capacity to a single server, such as adding more memory and processing units or adding more RAM on the single server, this is also called vertical scaling. Another approach is to divide a large dataset across multiple systems and serve a data application to query data from multiple servers. This approach is called horizontal scaling. MongoDB handles horizontal scaling through sharding. Sharded clusters MongoDB's sharding consists of the following components: Shard: Each shard stores a subset of sharded data. Also, each shard can be deployed as a replica set. Mongos: Mongos provide an interface between a client application and sharded cluster to route the query. Config server: The configuration server stores the metadata and configuration settings for the cluster. The MongoDB data is sharded at the collection level and distributed across sharded clusters. Shard keys: To distribute documents in collections, MongoDB partitions the collection using the shard key. MongoDB shards data into chunks. These chunks are distributed across shards in sharded clusters. Advantages of sharding Here are some of the advantages of sharding: When we use sharding, the load of the read/write operations gets distributed across sharded clusters. As sharding is used to distribute data across a shard cluster, we can increase the storage capacity by adding shards horizontally. MongoDB allows continuing the read/write operation even if one of the shards is unavailable. In the production environment, shards should deploy with a replication mechanism to maintain high availability and add fault tolerance in a system. Indexing, sharding and replication are three of the most important tasks to perform on any database, as they ensure optimal querying and database performance. In this article, we saw how MongoDB facilitates these tasks and makes them as easy as possible for the administrators to take care of. If you found the excerpt to be useful, make sure you check out our book Seven NoSQL Databases in a Week to learn more about the different database administration techniques in MongoDB, as well as the other popularly used NoSQL databases such as Redis, HBase, Neo4j, and more. Read more Top 5 programming languages for crunching Big Data effectively Top 5 NoSQL Databases Is Apache Spark today’s Hadoop?
Read more
  • 0
  • 0
  • 23476

article-image-build-actuator-application-with-raspberry-pi-3
Gebin George
28 Jun 2018
10 min read
Save for later

Build an Actuator app for controlling Illumination with Raspberry Pi 3

Gebin George
28 Jun 2018
10 min read
In this article, we will look at how to build an actuator application for controlling illuminations. This article is an excerpt from the book, Mastering Internet of Things, written by Peter Waher. This book will help you design and implement IoT solutions with single board computers. Preparing our project Let's create a new Universal Windows Platform application project. This time, we'll call it Actuator. We can also use the Raspberry Pi 3, even though we will only use the relay in this project. To make the persistence of application states even easier, we'll also include the latest version of the NuGet package Waher.Runtime.Settings in the project. It uses the underlying object database defined by Waher.Persistence to persist application settings. Defining control parameters Actuators come in all sorts, types, and sizes, from the very complex to the very simple. While it would be possible to create a proprietary format that configures the actuator in a bulk operation, such a method is doomed to fail if you aim for any type of interoperable communication. Since the internet is based on interoperability as a core principle, we should consider this from the start, during the design phase. Interoperability means devices can learn to operate together, even if they are from different manufacturers. To achieve this, devices must be able to describe what they can do, in a way that each participant understands. To be able to do this, we need a way to break down (divide and conquer) a complex actuator into parts that are easily described and understood. One way is to see an actuator as a collection of control parameters. Each control parameter is a named parameter with a simple and recognizable data type. (In the same way, we can see a sensor as a collection of sensor data fields.): For our example, we will only need one control parameter: A Boolean control parameter controlling the state of our relay. We'll just call it Output, for simplicity. Understanding relays Relays, simply put, are electric switches that we can control using a small output signal. They're perfect for small controllers, like Raspberry Pi, to switch other circuits with higher voltages on and off. The simplest example is to use a relay to switch a lamp on and off. We can't light the lamp using the voltage available to us in Raspberry Pi, but we can use a relay as a switch to control the lamp. The principal part of a normal relay is a coil. When electricity runs through it, it magnetizes an iron core, which in turn moves a lever from the Normally Closed (NC) connector to the Normally Open (NO) connector. When electricity is cut, a spring returns the lever from the NO connector to the NC connector. This movement of the lever from one connector to the other causes a characteristic clicking sound. This tells you that the relay works. The lever in turn is connected to the Common Ground (COM) connector. The following figure illustrates how a simple relay is constructed. We control the flow of the current through the coil (L1) using our output SIGNAL (D1 in our case). Internally, in the relay, a resistor (R1) is placed before the base pin of the transistor (T1), to adapt the signal voltage to an appropriate level. When we connect or cut the current through the coil, it will induce a reverse current. This may be harmful for the transistor when the current is being cut. For that reason, a fly-back diode (D1) is added, allowing excess current to be fed back, avoiding harm to the transistor: Connecting our lamp Now that we know how a relay works, it's relatively easy to connect our lamp to it. Since we want the lamp to be illuminated when we turn the relay on (set D1to HIGH), we will use the NO and COM connectors, and let the NC connector be. If the lamp has a normal two-wire AC cable, we can insert the relay into the AC circuit by simply cutting one of the wires, inserting one end into the NO connector and the other into the COM connector, as is illustrated in the following figure: Be sure to follow appropriate safety regulations when working with electricity. Connecting an LED An alternative to working with the alternating current (AC) is to use a low-power direct current (DC) source and an LED to simulate a lamp. You can connect the COM connector to a resistor and an LED, and then to ground (GND) on one end, and the NO directly to the 5V or 3.3V source on the Raspberry Pi on the other end. The size of the resistor is determined by how much current the LED needs to light up, and the voltage source you choose. If the LED needs 20 mA, and you connect it to a 5V source, Ohms Law tells us we need an R = U/I = 5V/0.02A = 250 Ω resistor. The following figure illustrates this: Controlling the output The relay is connected to our digital output pin 9 on the Arduino board. As such, controlling it is a simple call to the digitalWrite() method on our arduino object. Since we will need to perform this control action from various locations in code in later chapters, we'll create a method for it: internal async Task SetOutput(bool On, string Actor) { if (this.arduino != null) { this.arduino.digitalWrite(9, On ? PinState.HIGH : PinState.LOW); The first parameter simply states the new value of the control parameter. We'll add a second parameter that describes who is making the requested change. This will come in handy later, when we allow online users to change control parameters. Persisting control parameter states If the device reboots for some reason, for instance after a power outage, it's normally desirable that it returns to the state it was in before it shut down. For this, we need to persist the output value. We can use the object database defined in Waher.Persistence and Waher.Persistence.Files for this. But for simple control states, we don't need to create our own data-bearing classes. That has already been done by Waher.Runtime.Settings. To use it, we first include the NuGet, as described earlier. We must also include its assembly when we initialize the runtime inventory, which is used by the object database: Types.Initialize( typeof(FilesProvider).GetTypeInfo().Assembly, typeof(App).GetTypeInfo().Assembly, typeof(RuntimeSettings).GetTypeInfo().Assembly); Depending on the build version selected when creating your UWP application, different versions of .NET Standard will be supported. Build 10586 for instance, only supports .NET Standard up to v1.4. Build 16299, however, supports .NET Standard up to v2.0. The Waher.Runtime.Inventory.Loader library, available as a NuGet package, provides the capability to loop through existing assemblies in a simple manner, but it requires support for .NET Standard 1.5. You can call its TypesLoader.Initialize() method to initialize Waher.Runtime.Inventory with all assemblies available in the runtime. It also dynamically loads all permitted assemblies available in the application folder that have not been loaded. Saving the current control state is then simply a matter of calling the Set() or SetAsync() methods on the static RuntimeSettings class, defined in the Waher.Runtime.Settings namespace: await RuntimeSettings.SetAsync("Actuator.Output", On); During the initialization of the device, we then call the Get() or GetAsync() methods to get the last value, if it exists. If it does not exist, a default value we define is returned: bool LastOn = await RuntimeSettings.GetAsync("Actuator.Output", false); this.arduino.digitalWrite(1, LastOn ? PinState.HIGH : PinState.LOW); Logging important control events In distributed IoT control applications, it's vitally important to make sure unauthorized access to the system is avoided. While we will dive deeper into this subject in later chapters, one important tool we can start using it to log everything of a security interest in the event log. We can decide what to do with the event log later, whether we want to analyze or store it locally or distribute it in the network for analysis somewhere else. But unless we start logging events of security interest directly when we develop, we risk forgetting logging certain events later. So, let's log an event every time the output is set: Log.Informational("Setting Control Parameter.", string.Empty, Actor ?? "Windows user", new KeyValuePair<string, object>("Output", On)); If the Actor parameter is null, we assume the control parameter has been set from the Windows GUI. We use this fact, to update the window if the change has been requested from somewhere else: if (Actor != null) await MainPage.Instance.OutputSet(On); Using Raspberry Pi GPIO pins directly The Raspberry Pi can also perform input and output without an Arduino board. But the General-Purpose Input/Output (GPIO) pins available only supports digital input and output. Since the relay module is controlled through a digital output, we can connect it directly to the Raspberry Pi, if we want. That way, we don't need the Arduino board. (We wouldn't be able to test-run the application on the local machine either, though.) Checking whether GPIO is available GPIO pins are accessed through the GpioController class defined in the Windows.Devices.Gpio namespace. First, we must check that GPIO is available on the machine. We do this by getting the default controller, and checking whether it's available: gpio = GpioController.GetDefault(); if (gpio != null) { ... } else Log.Error("Unable to get access to GPIO pin " + gpioOutputPin.ToString()); Initializing the GPIO output pin Once we have access to the controller, we can try to open exclusive access to the GPIO pin we've connected the relay to: if (gpio.TryOpenPin(gpioOutputPin, GpioSharingMode.Exclusive, out this.gpioPin, out GpioOpenStatus Status) && Status == GpioOpenStatus.PinOpened) { ... } else Log.Error("Unable to get access to GPIO pin " + gpioOutputPin.ToString()); Through the GpioPin object gpioPin, we can now control the pin. The first step is to set the operating mode for the pin. This is done by calling the SetDriveMode() method. There are many different modes a pin can be set to, not all necessarily supported by the underlying firmware and hardware. To check that a mode is supported, call the IsDriveModeSupported() method first: if (this.gpioPin.IsDriveModeSupported(GpioPinDriveMode.Output)) { This.gpioPin.SetDriveMode(GpioPinDriveMode.Output); ... } else Log.Error("Output mode not supported for GPIO pin " + gpioOutputPin.ToString()); There are various output modes available: Output, OutputOpenDrain, OutputOpenDrainPullUp, OutputOpenSource, and OutputOpenSourcePullDown. The code documentation for each flag describes the particulars of each option. Setting the GPIO pin output To set the actual output value, we call the Write() method on the pin object: bool LastOn = await RuntimeSettings.GetAsync("Actuator.Output", false); this.gpioPin.Write(LastOn ? GpioPinValue.High : GpioPinValue.Low); We need to make a similar change in the SetOutput() method. The Actuator project in the MIOT repository uses the Arduino use case by default. The GPIO code is also available through conditional compiling. It is activated by uncommenting the GPIO switch definition on the first row of the App.xaml.cs file. You can also perform Digital Input using principles similar to the preceding ones, with some differences. First, you select an input drive mode: Input, InputPullUp or InputPullDown. You then use the Read() method to read the current state of the pin. You can also use the ValueChanged event to get a notification whenever the input pin changes value. We saw how to create a simple actuator app for the Raspberry Pi using C#. If you found our post useful, do check out this title Mastering Internet of Things, to build complex projects using motions detectors, controllers, sensors, and Raspberry Pi 3. Should you go with Arduino Uno or Raspberry Pi 3 for your next IoT project? Build your first Raspberry Pi project Meet the Coolest Raspberry Pi Family Member: Raspberry Pi Zero W Wireless
Read more
  • 0
  • 0
  • 31890

article-image-build-and-train-rnn-chatbot-using-tensorflow
Sunith Shetty
28 Jun 2018
21 min read
Save for later

Build and train an RNN chatbot using TensorFlow [Tutorial]

Sunith Shetty
28 Jun 2018
21 min read
Chatbots are increasingly used as a way to provide assistance to users. Many companies, including banks, mobile/landline companies and large e-sellers now use chatbots for customer assistance and for helping users in pre and post sales queries. They are a great tool for companies which don't need to provide additional customer service capacity for trivial questions: they really look like a win-win situation! In today’s tutorial, we will understand how to train an automatic chatbot that will be able to answer simple and generic questions, and how to create an endpoint over HTTP for providing the answers via an API. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. There are mainly two types of chatbot: the first is a simple one, which tries to understand the topic, always providing the same answer for all questions about the same topic. For example, on a train website, the questions Where can I find the timetable of the City_A to City_B service? and What's the next train departing from City_A? will likely get the same answer, that could read Hi! The timetable on our network is available on this page: <link>. This types of chatbots use classification algorithms to understand the topic (in the example, both questions are about the timetable topic). Given the topic, they always provide the same answer. Usually, they have a list of N topics and N answers; also, if the probability of the classified topic is low (the question is too vague, or it's on a topic not included in the list), they usually ask the user to be more specific and repeat the question, eventually pointing out other ways to do the question (send an email or call the customer service number, for example). The second type of chatbots is more advanced, smarter, but also more complex. For those, the answers are built using an RNN, in the same way, that machine translation is performed. Those chatbots are able to provide more personalized answers, and they may provide a more specific reply. In fact, they don't just guess the topic, but with an RNN engine, they're able to understand more about the user's questions and provide the best possible answer: in fact, it's very unlikely you'll get the same answers with two different questions using these types of chatbots. The input corpus Unfortunately, we haven't found any consumer-oriented dataset that is open source and freely available on the Internet. Therefore, we will train the chatbot with a more generic dataset, not really focused on customer service. Specifically, we will use the Cornell Movie Dialogs Corpus, from the Cornell University. The corpus contains the collection of conversations extracted from raw movie scripts, therefore the chatbot will be able to answer more to fictional questions than real ones. The Cornell corpus contains more than 200,000 conversational exchanges between 10+ thousands of movie characters, extracted from 617 movies. The dataset is available here: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. We would like to thank the authors for having released the corpus: that makes experimentation, reproducibility and knowledge sharing easier. The dataset comes as a .zip archive file. After decompressing it, you'll find several files in it: README.txt contains the description of the dataset, the format of the corpora files, the details on the collection procedure and the author's contact. Chameleons.pdf is the original paper for which the corpus has been released. Although the goal of the paper is strictly not around chatbots, it studies the language used in dialogues, and it's a good source of information to understanding more movie_conversations.txt contains all the dialogues structure. For each conversation, it includes the ID of the two characters involved in the discussion, the ID of the movie and the list of sentences IDs (or utterances, to be more precise) in chronological order. For example, the first line of the file is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'] That means that user u0 had a conversation with user u2 in the movie m0 and the conversation had 4 utterances: 'L194', 'L195', 'L196' and 'L197' movie_lines.txt contains the actual text of each utterance ID and the person who produced it. For example, the utterance L195 is listed here as: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. So, the text of the utterance L195 is Well, I thought we'd start with pronunciation, if that's okay with you. And it was pronounced by the character u2 whose name is CAMERON in the movie m0. movie_titles_metadata.txt contains information about the movies, including the title, year, IMDB rating, the number of votes in IMDB and the genres. For example, the movie m0 here is described as: m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance'] So, the title of the movie whose ID is m0 is 10 things i hate about you, it's from 1999, it's a comedy with romance and it received almost 63 thousand votes on IMDB with an average score of 6.9 (over 10.0) movie_characters_metadata.txt contains information about the movie characters, including the name the title of the movie where he/she appears, the gender (if known) and the position in the credits (if known). For example, the character “u2” appears in this file with this description: u2 +++$+++ CAMERON +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ m +++$+++ 3 The character u2 is named CAMERON, it appears in the movie m0 whose title is 10 things i hate about you, his gender is male and he's the third person appearing in the credits. raw_script_urls.txt contains the source URL where the dialogues of each movie can be retrieved. For example, for the movie m0 that's it: m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html As you will have noticed, most files use the token  +++$+++  to separate the fields. Beyond that, the format looks pretty straightforward to parse. Please take particular care while parsing the files: their format is not UTF-8 but ISO-8859-1. Creating the training dataset Let's now create the training set for the chatbot. We'd need all the conversations between the characters in the correct order: fortunately, the corpora contains more than what we actually need. For creating the dataset, we will start by downloading the zip archive, if it's not already on disk. We'll then decompress the archive in a temporary folder (if you're using Windows, that should be C:Temp), and we will read just the movie_lines.txt and the movie_conversations.txt files, the ones we really need to create a dataset of consecutive utterances. Let's now go step by step, creating multiple functions, one for each step, in the file corpora_downloader.py. The first function we need is to retrieve the file from the Internet, if not available on disk. def download_and_decompress(url, storage_path, storage_dir): import os.path directory = storage_path + "/" + storage_dir zip_file = directory + ".zip" a_file = directory + "/cornell movie-dialogs corpus/README.txt" if not os.path.isfile(a_file): import urllib.request import zipfile urllib.request.urlretrieve(url, zip_file) with zipfile.ZipFile(zip_file, "r") as zfh: zfh.extractall(directory) return This function does exactly that: it checks whether the “README.txt” file is available locally; if not, it downloads the file (thanks for the urlretrieve function in the urllib.request module) and it decompresses the zip (using the zipfile module). The next step is to read the conversation file and extract the list of utterance IDS. As a reminder, its format is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'], therefore what we're looking for is the fourth element of the list after we split it on the token  +++$+++ . Also, we'd need to clean up the square brackets and the apostrophes to have a clean list of IDs. For doing that, we shall import the re module, and the function will look like this. import re def read_conversations(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_conversations.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: conversations_chunks = [line.split(" +++$+++ ") for line in fh] return [re.sub('[[]']', '', el[3].strip()).split(", ") for el in conversations_chunks] As previously said, remember to read the file with the right encoding, otherwise, you'll get an error. The output of this function is a list of lists, each of them containing the sequence of utterance IDS in a conversation between characters. Next step is to read and parse the movie_lines.txt file, to extract the actual utterances texts. As a reminder, the file looks like this line: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. Here, what we're looking for are the first and the last chunks. def read_lines(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_lines.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: lines_chunks = [line.split(" +++$+++ ") for line in fh] return {line[0]: line[-1].strip() for line in lines_chunks} The very last bit is about tokenization and alignment. We'd like to have a set whose observations have two sequential utterances. In this way, we will train the chatbot, given the first utterance, to provide the next one. Hopefully, this will lead to a smart chatbot, able to reply to multiple questions. Here's the function: def get_tokenized_sequencial_sentences(list_of_lines, line_text): for line in list_of_lines: for i in range(len(line) - 1): yield (line_text[line[i]].split(" "), line_text[line[i+1]].split(" ")) Its output is a generator containing a tuple of the two utterances (the one on the right follows temporally the one on the left). Also, utterances are tokenized on the space character. Finally, we can wrap up everything into a function, which downloads the file and unzip it (if not cached), parse the conversations and the lines, and format the dataset as a generator. As a default, we will store the files in the /tmp directory: def retrieve_cornell_corpora(storage_path="/tmp", storage_dir="cornell_movie_dialogs_corpus"): download_and_decompress("http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip", storage_path, storage_dir) conversations = read_conversations(storage_path, storage_dir) lines = read_lines(storage_path, storage_dir) return tuple(zip(*list(get_tokenized_sequencial_sentences(conversations, lines)))) At this point, our training set looks very similar to the training set used in the translation project. We can, therefore, use some pieces of code we've developed in the machine learning translation article. For example, the corpora_tools.py file can be used here without any change (also, it requires the data_utils.py). Given that file, we can dig more into the corpora, with a script to check the chatbot input. To inspect the corpora, we can use the corpora_tools.py, and the file we've previously created. Let's retrieve the Cornell Movie Dialog Corpus, format the corpora and print an example and its length: from corpora_tools import * from corpora_downloader import retrieve_cornell_corpora sen_l1, sen_l2 = retrieve_cornell_corpora() print("# Two consecutive sentences in a conversation") print("Q:", sen_l1[0]) print("A:", sen_l2[0]) print("# Corpora length (i.e. number of sentences)") print(len(sen_l1)) assert len(sen_l1) == len(sen_l2) This code prints an example of two tokenized consecutive utterances, and the number of examples in the dataset, that is more than 220,000: # Two consecutive sentences in a conversation Q: ['Can', 'we', 'make', 'this', 'quick?', '', 'Roxanne', 'Korrine', 'and', 'Andrew', 'Barrett', 'are', 'having', 'an', 'incredibly', 'horrendous', 'public', 'break-', 'up', 'on', 'the', 'quad.', '', 'Again.'] A: ['Well,', 'I', 'thought', "we'd", 'start', 'with', 'pronunciation,', 'if', "that's", 'okay', 'with', 'you.'] # Corpora length (i.e. number of sentences) 221616 Let's now clean the punctuation in the sentences, lowercase them and limits their size to 20 words maximum (that is examples where at least one of the sentences is longer than 20 words are discarded). This is needed to standardize the tokens: clean_sen_l1 = [clean_sentence(s) for s in sen_l1] clean_sen_l2 = [clean_sentence(s) for s in sen_l2] filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2) print("# Filtered Corpora length (i.e. number of sentences)") print(len(filt_clean_sen_l1)) assert len(filt_clean_sen_l1) == len(filt_clean_sen_l2) This leads us to almost 140,000 examples: # Filtered Corpora length (i.e. number of sentences) 140261 Then, let's create the dictionaries for the two sets of sentences. Practically, they should look the same (since the same sentence appears once on the left side, and once in the right side) except there might be some changes introduced by the first and last sentences of a conversation (they appear only once). To make the best out of our corpora, let's build two dictionaries of words and then encode all the words in the corpora with their dictionary indexes: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=15000, storage_path="/tmp/l1_dict.p") dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=15000, storage_path="/tmp/l2_dict.p") idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) print("# Same sentences as before, with their dictionary ID") print("Q:", list(zip(filt_clean_sen_l1[0], idx_sentences_l1[0]))) print("A:", list(zip(filt_clean_sen_l2[0], idx_sentences_l2[0]))) That prints the following output. We also notice that a dictionary of 15 thousand entries doesn't contain all the words and more than 16 thousand (less popular) of them don't fit into it: [sentences_to_indexes] Did not find 16823 words [sentences_to_indexes] Did not find 16649 words # Same sentences as before, with their dictionary ID Q: [('well', 68), (',', 8), ('i', 9), ('thought', 141), ('we', 23), ("'", 5), ('d', 83), ('start', 370), ('with', 46), ('pronunciation', 3), (',', 8), ('if', 78), ('that', 18), ("'", 5), ('s', 12), ('okay', 92), ('with', 46), ('you', 7), ('.', 4)] A: [('not', 31), ('the', 10), ('hacking', 7309), ('and', 23), ('gagging', 8761), ('and', 23), ('spitting', 6354), ('part', 437), ('.', 4), ('please', 145), ('.', 4)] As the final step, let's add paddings and markings to the sentences: data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) print("# Prepared minibatch with paddings and extra stuff") print("Q:", data_set[0][0]) print("A:", data_set[0][1]) print("# The sentence pass from X to Y tokens") print("Q:", len(idx_sentences_l1[0]), "->", len(data_set[0][0])) print("A:", len(idx_sentences_l2[0]), "->", len(data_set[0][1])) And that, as expected, prints: # Prepared minibatch with paddings and extra stuff Q: [0, 68, 8, 9, 141, 23, 5, 83, 370, 46, 3, 8, 78, 18, 5, 12, 92, 46, 7, 4] A: [1, 31, 10, 7309, 23, 8761, 23, 6354, 437, 4, 145, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0] # The sentence pass from X to Y tokens Q: 19 -> 20 A: 11 -> 22 Training the chatbot After we're done with the corpora, it's now time to work on the model. This project requires again a sequence to sequence model, therefore we can use an RNN. Even more, we can reuse part of the code from the previous project: we'd just need to change how the dataset is built, and the parameters of the model. We can then copy the training script, and modify the build_dataset function, to use the Cornell dataset. Mind that the dataset used in this article is bigger than the one used in the machine learning translation article, therefore you may need to limit the corpora to a few dozen thousand lines. On a 4 years old laptop with 8GB RAM, we had to select only the first 30 thousand lines, otherwise, the program ran out of memory and kept swapping. As a side effect of having fewer examples, even the dictionaries are smaller, resulting in less than 10 thousands words each. def build_dataset(use_stored_dictionary=False): sen_l1, sen_l2 = retrieve_cornell_corpora() clean_sen_l1 = [clean_sentence(s) for s in sen_l1][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP clean_sen_l2 = [clean_sentence(s) for s in sen_l2][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2, max_len=10) if not use_stored_dictionary: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=10000, storage_path=path_l1_dict) dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=10000, storage_path=path_l2_dict) else: dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2_length = len(dict_l2) idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) max_length_l1 = extract_max_length(idx_sentences_l1) max_length_l2 = extract_max_length(idx_sentences_l2) data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) return (filt_clean_sen_l1, filt_clean_sen_l2), data_set, (max_length_l1, max_length_l2), (dict_l1_length, dict_l2_length) By inserting this function into the train_translator.py file and rename the file as train_chatbot.py, we can run the training of the chatbot. After a few iterations, you can stop the program and you'll see something similar to this output: [sentences_to_indexes] Did not find 0 words [sentences_to_indexes] Did not find 0 words global step 100 learning rate 1.0 step-time 7.708967611789704 perplexity 444.90090078460474 eval: perplexity 57.442316329639176 global step 200 learning rate 0.990234375 step-time 7.700247814655302 perplexity 48.8545568311572 eval: perplexity 42.190180314697045 global step 300 learning rate 0.98046875 step-time 7.69800933599472 perplexity 41.620538109894945 eval: perplexity 31.291903031786116 ... ... ... global step 2400 learning rate 0.79833984375 step-time 7.686293318271639 perplexity 3.7086356605442767 eval: perplexity 2.8348589631663046 global step 2500 learning rate 0.79052734375 step-time 7.689657487869262 perplexity 3.211876894960698 eval: perplexity 2.973809378544393 global step 2600 learning rate 0.78271484375 step-time 7.690396382808681 perplexity 2.878854805600354 eval: perplexity 2.563583924617356 Again, if you change the settings, you may end up with a different perplexity. To obtain these results, we set the RNN size to 256 and 2 layers, the batch size of 128 samples, and the learning rate to 1.0. At this point, the chatbot is ready to be tested. Although you can test the chatbot with the same code as in the test_translator.py, here we would like to do a more elaborate solution, which allows exposing the chatbot as a service with APIs. Chatbox API First of all, we need a web framework to expose the API. In this project, we've chosen Bottle, a lightweight simple framework very easy to use. To install the package, run pip install bottle from the command line. To gather further information and dig into the code, take a look at the project webpage, https://bottlepy.org. Let's now create a function to parse an arbitrary sentence provided by the user as an argument. All the following code should live in the test_chatbot_aas.py file. Let's start with some imports and the function to clean, tokenize and prepare the sentence using the dictionary: import pickle import sys import numpy as np import tensorflow as tf import data_utils from corpora_tools import clean_sentence, sentences_to_indexes, prepare_sentences from train_chatbot import get_seq2seq_model, path_l1_dict, path_l2_dict model_dir = "/home/abc/chat/chatbot_model" def prepare_sentence(sentence, dict_l1, max_length): sents = [sentence.split(" ")] clean_sen_l1 = [clean_sentence(s) for s in sents] idx_sentences_l1 = sentences_to_indexes(clean_sen_l1, dict_l1) data_set = prepare_sentences(idx_sentences_l1, [[]], max_length, max_length) sentences = (clean_sen_l1, [[]]) return sentences, data_set The function prepare_sentence does the following: Tokenizes the input sentence Cleans it (lowercase and punctuation cleanup) Converts tokens to dictionary IDs Add markers and paddings to reach the default length Next, we will need a function to convert the predicted sequence of numbers to an actual sentence composed of words. This is done by the function decode, which runs the prediction given the input sentence and with softmax predicts the most likely output. Finally, it returns the sentence without paddings and markers: def decode(data_set): with tf.Session() as sess: model = get_seq2seq_model(sess, True, dict_lengths, max_sentence_lengths, model_dir) model.batch_size = 1 bucket = 0 encoder_inputs, decoder_inputs, target_weights = model.get_batch( {bucket: [(data_set[0][0], [])]}, bucket) _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket, True) outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] if data_utils.EOS_ID in outputs: outputs = outputs[1:outputs.index(data_utils.EOS_ID)] tf.reset_default_graph() return " ".join([tf.compat.as_str(inv_dict_l2[output]) for output in outputs]) Finally, the main function, that is, the function to run in the script: if __name__ == "__main__": dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l2_length = len(dict_l2) inv_dict_l2 = {v: k for k, v in dict_l2.items()} max_lengths = 10 dict_lengths = (dict_l1_length, dict_l2_length) max_sentence_lengths = (max_lengths, max_lengths) from bottle import route, run, request @route('/api') def api(): in_sentence = request.query.sentence _, data_set = prepare_sentence(in_sentence, dict_l1, max_lengths) resp = [{"in": in_sentence, "out": decode(data_set)}] return dict(data=resp) run(host='127.0.0.1', port=8080, reloader=True, debug=True) Initially, it loads the dictionary and prepares the inverse dictionary. Then, it uses the Bottle API to create an HTTP GET endpoint (under the /api URL). The route decorator sets and enriches the function to run when the endpoint is contacted via HTTP GET. In this case, the api() function is run, which first reads the sentence passed as HTTP parameter, then calls the prepare_sentence function, described above, and finally runs the decoding step. What's returned is a dictionary containing both the input sentence provided by the user and the reply of the chatbot. Finally, the webserver is turned on, on the localhost at port 8080. Isn't very easy to have a chatbot as a service with Bottle? It's now time to run it and check the outputs. To run it, run from the command line: $> python3 –u test_chatbot_aas.py Then, let's start querying the chatbot with some generic questions, to do so we can use CURL, a simple command line; also all the browsers are ok, just remember that the URL should be encoded, for example, the space character should be replaced with its encoding, that is, %20. Curl makes things easier, having a simple way to encode the URL request. Here are a couple of examples: $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "i ' m here with you .", "in": "where are you?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you here?" {"data": [{"out": "yes .", "in": "are you here?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you a chatbot?" {"data": [{"out": "you ' for the stuff to be right .", "in": "are you a chatbot?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=what is your name ?" {"data": [{"out": "we don ' t know .", "in": "what is your name ?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "that ' s okay .", "in": "how are you?"}]} If the system doesn't work with your browser, try encoding the URL, for example: $> curl -X GET http://127.0.0.1:8080/api?sentence=how%20are%20you? {"data": [{"out": "that ' s okay .", "in": "how are you?"}]}. Replies are quite funny; always remember that we trained the chatbox on movies, therefore the type of replies follow that style. To turn off the webserver, use Ctrl + C. To summarize, we've learned to implement a chatbot, which is able to respond to questions through an HTTP endpoint and a GET API. To know more how to design deep learning systems for a variety of real-world scenarios using TensorFlow, do checkout this book TensorFlow Deep Learning Projects. Facebook’s Wit.ai: Why we need yet another chatbot development framework? How to build a chatbot with Microsoft Bot framework Top 4 chatbot development frameworks for developers
Read more
  • 0
  • 1
  • 44455
article-image-how-to-implement-immutability-functions-in-kotlin
Aaron Lazar
27 Jun 2018
8 min read
Save for later

How to implement immutability functions in Kotlin [Tutorial]

Aaron Lazar
27 Jun 2018
8 min read
Unlike Clojure, Haskell, F#, and the likes, Kotlin is not a pure functional programming language, where immutability is forced; rather, we may refer to Kotlin as a perfect blend of functional programming and OOP languages. It contains the major benefits of both worlds. So, instead of forcing immutability like pure functional programming languages, Kotlin encourages immutability, giving it automatic preference wherever possible. In this article, we'll understand the various methods of implementing immutability in Kotlin. This article has been taken from the book, Functional Kotlin, by Mario Arias and Rivu Chakraborty. In other words, Kotlin has immutable variables (val), but no language mechanisms that would guarantee true deep immutability of the state. If a val variable references a mutable object, its contents can still be modified. We will have a more elaborate discussion and a deeper dive on this topic, but first let us have a look at how we can get referential immutability in Kotlin and the differences between var, val, and const val. By true deep immutability of the state, we mean a property will always return the same value whenever it is called and that the property never changes its value; we can easily avoid this if we have a val  property that has a custom getter. You can find more details at the following link: https://artemzin.com/blog/kotlin-val-does-not-mean-immutable-it-just-means-readonly-yeah/ The difference between var and val So, in order to encourage immutability but still let the developers have the choice, Kotlin introduced two types of variables. The first one is var, which is just a simple variable, just like in any imperative language. On the other hand, val brings us a bit closer to immutability; again, it doesn't guarantee immutability. So, what exactly does the val variable provide us? It enforces read-only, you cannot write into a val variable after initialization. So, if you use a val variable without a custom getter, you can achieve referential immutability. Let's have a look; the following program will not compile: fun main(args: Array<String>) { val x:String = "Kotlin" x+="Immutable"//(1) } As I mentioned earlier, the preceding program will not compile; it will give an error on comment (1). As we've declared variable x as val, x will be read-only and once we initialize x; we cannot modify it afterward. So, now you're probably asking why we cannot guarantee immutability with val ? Let's inspect this with the following example: object MutableVal { var count = 0 val myString:String = "Mutable" get() {//(1) return "$field ${++count}"//(2) } } fun main(args: Array<String>) { println("Calling 1st time ${MutableVal.myString}") println("Calling 2nd time ${MutableVal.myString}") println("Calling 3rd time ${MutableVal.myString}")//(3) } In this program, we declared myString as a val property, but implemented a custom get function, where we tweaked the value of myString before returning it. Have a look at the output first, then we will further look into the program: As you can see, the myString property, despite being val, returned different values every time we accessed it. So, now, let us look into the code to understand such behavior. On comment (1), we declared a custom getter for the val property myString. On comment (2), we pre-incremented the value of count and added it after the value of the field value, myString, and returned the same from the getter. So, whenever we requested the myString property, count got incremented and, on the next request, we got a different value. As a result, we broke the immutable behavior of a val property. Compile time constants So, how can we overcome this? How can we enforce immutability? The const val properties are here to help us. Just modify val myString with const val myString and you cannot implement the custom getter. While val properties are read-only variables, const val, on the other hand, are compile time constants. You cannot assign the outcome (result) of a function to const val. Let's discuss some of the differences between val and const val: The val properties are read-only variables, while const val are compile time constants The val properties can have custom getters, but const val cannot We can have val properties anywhere in our Kotlin code, inside functions, as a class member, anywhere, but const val has to be a top-level member of a class/object You cannot write delegates for the const val properties We can have the val property of any type, be it our custom class or any primitive data type, but only primitive data types and String are allowed with a const val property We cannot have nullable data types with the const val properties; as a result, we cannot have null values for the const val properties either As a result, the const val properties guarantee immutability of value but have lesser flexibility and you are bound to use only primitive data types with const val, which cannot always serve our purposes. Now, that I've used the word referential immutability quite a few times, let us now inspect what it means and how many types of immutability there are. Types of immutability There are basically the following two types of immutability: Referential immutability Immutable values Immutable reference  (referential immutability) Referential immutability enforces that, once a reference is assigned, it can't be assigned to something else. Think of having it as a val property of a custom class, or even MutableList or MutableMap; after you initialize the property, you cannot reference something else from that property, except the underlying value from the object. For example, take the following program: class MutableObj { var value = "" override fun toString(): String { return "MutableObj(value='$value')" } } fun main(args: Array<String>) { val mutableObj:MutableObj = MutableObj()//(1) println("MutableObj $mutableObj") mutableObj.value = "Changed"//(2) println("MutableObj $mutableObj") val list = mutableListOf("a","b","c","d","e")//(3) println(list) list.add("f")//(4) println(list) } Have a look at the output before we proceed with explaining the program: So, in this program we've two val properties—list and mutableObj. We initialized mutableObj with the default constructor of MutableObj, since it's a val property it'll always refer to that specific object; but, if you concentrate on comment (2), we changed the value property of mutableObj, as the value property of the MutableObj class is mutable (var). It's the same with the list property, we can add items to the list after initialization, changing its underlying value. Both list and mutableObj are perfect examples of immutable reference; once initialized, the properties can't be assigned to something else, but their underlying values can be changed (you can refer the output). The reason behind that is the data type we used to assign to those properties. Both the MutableObj class and the MutableList<String> data structures are mutable themselves, so we cannot restrict value changes for their instances. Immutable values The immutable values, on the other hand, enforce no change on values as well; it is really complex to maintain. In Kotlin, the const val properties enforce immutability of value, but they lack flexibility (we already discussed them) and you're bound to use only primitive types, which can be troublesome in real-life scenarios. Immutable collections Kotlin gives preference to immutability wherever possible, but leaves the choice to the developer whether or when to use it. This power of choice makes the language even more powerful. Unlike most languages, where they have either only mutable (like Java, C#, and so on) or only immutable collections (like F#, Haskell, Clojure, and so on), Kotlin has both and distinguishes between them, leaving the developer with the freedom to choose whether to use an immutable or mutable one. Kotlin has two interfaces for collection objects—Collection<out E> and MutableCollection<out E>; all the collection classes (for example, List, Set, or Map) implement either of them. As the name suggests, the two interfaces are designed to serve immutable and mutable collections respectively. Let us have an example: fun main(args: Array<String>) { val immutableList = listOf(1,2,3,4,5,6,7)//(1) println("Immutable List $immutableList") val mutableList:MutableList<Int> = immutableList.toMutableList()//(2) println("Mutable List $mutableList") mutableList.add(8)//(3) println("Mutable List after add $mutableList") println("Mutable List after add $immutableList") } The output is as follows: So, in this program, we created an immutable list with the help of the listOf method of Kotlin, on comment (1). The listOf method creates an immutable list with the elements (varargs) passed to it. This method also has a generic type parameter, which can be skipped if the elements array is not empty. The listOf method also has a mutable version—mutableListOf() which is identical except that it returns MutableList instead. We can convert an immutable list to a mutable one with the help of the toMutableList() extension function, we did the same in comment (2), to add an element to it on comment (3). However, if you check the output, the original Immutable List remains the same without any changes, the item is, however, added to the newly created MutableList instead. So now you know how to implement immutability in Kotlin. If you found this tutorial helpful, and would like to learn more, head on over to purchase the full book, Functional Kotlin, by Mario Arias and Rivu Chakraborty. Extension functions in Kotlin: everything you need to know Building RESTful web services with Kotlin Building chat application with Kotlin using Node.js, the powerful Server-side JavaScript platform
Read more
  • 0
  • 0
  • 28976

article-image-build-google-cloud-iot-application
Gebin George
27 Jun 2018
19 min read
Save for later

Build an IoT application with Google Cloud [Tutorial]

Gebin George
27 Jun 2018
19 min read
In this tutorial, we will build a sample internet of things application using Google Cloud IoT. We will start off by implementing the end-to-end solution, where we take the data from the DHT11 sensor and post it to the Google IoT Core state topic. This article is an excerpt from the book, Enterprise Internet of Things Handbook, written by Arvind Ravulavaru. End-to-end communication To get started with Google IoT Core, we need to have a Google account. If you do not have a Google account, you can create one by navigating to this URL: https://accounts.google.com/SignUp?hl=en. Once you have created your account, you can login and navigate to Google Cloud Console: https://console.cloud.google.com. Setting up a project The first thing we are going to do is create a project. If you have already worked with Google Cloud Platform and have at least one project, you will be taken to the first project in the list or you will be taken to the Getting started page. As of the time of writing this book, Google Cloud Platform has a free trial for 12 months with $300 if the offer is still available when you are reading this chapter, I would highly recommend signing up: Once you have signed up, let's get started by creating a new project. From the top menu bar, select the Select a Project dropdown and click on the plus icon to create a new project. You can fill in the details as illustrated in the following screenshot: Click on the Create button. Once the project is created, navigate to the Project and you should land on the Home page. Enabling APIs Following are the steps to be followed for enabling APIs: From the menu on the left-hand side, select APIs & Services | Library as shown in the following screenshot: On the following screen, search for pubsub and select the Pub/Sub API from the results and we should land on a page similar to the following: Click on the ENABLE button and we should now be able to use these APIs in our project. Next, we need to enable the real-time API; search for realtime and we should find something similar to the following: Click on the ENABLE & button. Enabling device registry and devices The following steps should be used for enabling device registry and devices: From the left-hand side menu, select IoT Core and we should land on the IoT Core home page: Instead of the previous screen, if you see a screen to enable APIs, please enable the required APIs from here. Click on the & Create device registry button. On the Create device registry screen, fill the details as shown in the following table: Field Value Registry ID Pi3-DHT11-Nodes Cloud region us-central1 Protocol MQTT HTTP Default telemetry topic device-events Default state topic dht11 After completing all the details, our form should look like the following: We will add the required certificates later on. Click on the Create button and a new device registry will be created. From the Pi3-DHT11-Nodes registry page, click on the Add device button and set the Device ID as Pi3-DHT11-Node or any other suitable name. Leave everything as the defaults and make sure the Device communication is set to Allowed and create a new device. On the device page, we should see a warning as highlighted in the following screenshot: Now, we are going to add a new public key. To generate a public/private key pair, we need to have OpenSSL command line available. You can download and set up OpenSSL from here: https://www.openssl.org/source/. Use the following command to generate a certificate pair at the default location on your machine: openssl req -x509 -newkey rsa:2048 -keyout rsa_private.pem -nodes -out rsa_cert.pem -subj "/CN=unused" If everything goes well, you should see an output as shown here: Do not share these certificates anywhere; anyone with these certificates can connect to Google IoT Core as a device and start publishing data. Now, once the certificates are created, we will attach them to the device we have created in IoT Core. Head back to the device page of the Google IoT Core service and under Authentication click on Add public key. On the following screen, fill it in as illustrated: The public key value is the contents of rsa_cert.pem that we generated earlier. Click on the ADD button. Now that the public key has been successfully added, we can connect to the cloud using the private key. Setting up Raspberry Pi 3 with DHT11 node Now that we have our device set up in Google IoT Core, we are going to complete the remaining operation on Raspberry Pi 3 to send data. Pre-requisites The requirements for setting up Raspberry Pi 3 on a DHT11 node are: One Raspberry Pi 3: https://www.amazon.com/Raspberry-Pi-Desktop-Starter-White/dp/B01CI58722 One breadboard: https://www.amazon.com/Solderless-Breadboard-Circuit-Circboard-Prototyping/dp/B01DDI54II/ One DHT11 sensor: https://www.amazon.com/HiLetgo-Temperature-Humidity-Arduino-Raspberry/dp/B01DKC2GQ0 Three male-to-female jumper cables: https://www.amazon.com/RGBZONE-120pcs-Multicolored-Dupont-Breadboard/dp/B01M1IEUAF/ If you are new to the world of Raspberry Pi GPIO's interfacing, take a look at this Raspberry Pi GPIO Tutorial: The Basics Explained on YouTube: https://www.youtube.com/watch?v=6PuK9fh3aL8. The following steps are to be used for the setup process: Connect the DHT11 sensor to Raspberry Pi 3 as shown in the following diagram: Next, power up Raspberry Pi 3 and log in to it. On the desktop, create a new folder named Google-IoT-Device. Open a new Terminal and cd into this folder. Setting up Node.js Refer to the following steps to install Node.js: Open a new Terminal and run the following commands: $ sudo apt update $ sudo apt full-upgrade This will upgrade all the packages that need upgrades. Next, we will install the latest version of Node.js. We will be using the Node 7.x version: $ curl -sL https://deb.nodesource.com/setup_7.x | sudo -E bash - $ sudo apt install nodejs This will take a moment to install, and once your installation is done, you should be able to run the following commands to see the version of Node.js and npm: $ node -v $ npm -v Developing the Node.js device app Now, we will set up the app and write the required code: From the Terminal, once you are inside the Google-IoT-Device folder, run the following command: $ npm init -y Next, we will install jsonwebtoken (https://www.npmjs.com/package/jsonwebtoken) and mqtt (https://www.npmjs.com/package/mqtt) from npm. Execute the following command: $ npm install jsonwebtoken mqtt--save Next, we will install rpi-dht-sensor (https://www.npmjs.com/package/rpi-dht-sensor) from npm. This module will help in reading the DHT11 temperature and humidity values: $ npm install rpi-dht-sensor --save Your final package.json file should look similar to the following code snippet: { "name": "Google-IoT-Device", "version": "1.0.0", "description": "", "main": "index.js", "scripts": { "test": "echo "Error: no test specified" && exit 1" }, "keywords": [], "author": "", "license": "ISC", "dependencies": { "jsonwebtoken": "^8.1.1", "mqtt": "^2.15.3", "rpi-dht-sensor": "^0.1.1" } } Now that we have the required dependencies installed, let's continue. Create a new file named index.js at the root of the Google-IoT-Device folder. Next, create a folder named certs at the root of the Google-IoT-Device folder and move the two certificates we created using OpenSSL there. Your final folder structure should look something like this: Open index.js in any text editor and update it as shown here: var fs = require('fs'); var jwt = require('jsonwebtoken'); var mqtt = require('mqtt'); var rpiDhtSensor = require('rpi-dht-sensor'); var dht = new rpiDhtSensor.DHT11(2); // `2` => GPIO2 var projectId = 'pi-iot-project'; var cloudRegion = 'us-central1'; var registryId = 'Pi3-DHT11-Nodes'; var deviceId = 'Pi3-DHT11-Node'; var mqttHost = 'mqtt.googleapis.com'; var mqttPort = 8883; var privateKeyFile = '../certs/rsa_private.pem'; var algorithm = 'RS256'; var messageType = 'state'; // or event var mqttClientId = 'projects/' + projectId + '/locations/' + cloudRegion + '/registries/' + registryId + '/devices/' + deviceId; var mqttTopic = '/devices/' + deviceId + '/' + messageType; var connectionArgs = { host: mqttHost, port: mqttPort, clientId: mqttClientId, username: 'unused', password: createJwt(projectId, privateKeyFile, algorithm), protocol: 'mqtts', secureProtocol: 'TLSv1_2_method' }; console.log('connecting...'); var client = mqtt.connect(connectionArgs); // Subscribe to the /devices/{device-id}/config topic to receive config updates. client.subscribe('/devices/' + deviceId + '/config'); client.on('connect', function(success) { if (success) { console.log('Client connected...'); sendData(); } else { console.log('Client not connected...'); } }); client.on('close', function() { console.log('close'); }); client.on('error', function(err) { console.log('error', err); }); client.on('message', function(topic, message, packet) { console.log(topic, 'message received: ', Buffer.from(message, 'base64').toString('ascii')); }); function createJwt(projectId, privateKeyFile, algorithm) { var token = { 'iat': parseInt(Date.now() / 1000), 'exp': parseInt(Date.now() / 1000) + 86400 * 60, // 1 day 'aud': projectId }; var privateKey = fs.readFileSync(privateKeyFile); return jwt.sign(token, privateKey, { algorithm: algorithm }); } function fetchData() { var readout = dht.read(); var temp = readout.temperature.toFixed(2); var humd = readout.humidity.toFixed(2); return { 'temp': temp, 'humd': humd, 'time': new Date().toISOString().slice(0, 19).replace('T', ' ') // https://stackoverflow.com/a/11150727/1015046 }; } function sendData() { var payload = fetchData(); payload = JSON.stringify(payload); console.log(mqttTopic, ': Publishing message:', payload); client.publish(mqttTopic, payload, { qos: 1 }); console.log('Transmitting in 30 seconds'); setTimeout(sendData, 30000); } In the previous code, we first define the projectId, cloudRegion, registryId, and deviceId based on what we have created. Next, we build the connectionArgs object, using which we are going to connect to Google IoT Core using MQTT-SN. Do note that the password property is a JSON Web Token (JWT), based on the projectId and privateKeyFile algorithm. The token that is created by this function is valid only for one day. After one day, the cloud will refuse connection to this device if the same token is used. The username value is the Common Name (CN) of the certificate we have created, which is unused. Using mqtt.connect(), we are going to connect to the Google IoT Core. And we are subscribing to the device config topic, which can be used to send device configurations when connected. Once the connection is successful, we callsendData() every 30 seconds to send data to the state topic. Save the previous file and run the following command: $ sudo node index.js And we should see something like this: As you can see from the previous Terminal logs, the device first gets connected then starts transmitting the temperature and humidity along with time. We are sending time as well, so we can save it in the BigQuery table and then build a time series chart quite easily. Now, if we head back to the Device page of Google IoT Core and navigate to the Configuration & state history tab, we should see the data that we are sending to the state topic here: Now that the device is sending data, let's actually read the data from another client. Reading the data from the device For this, you can either use the same Raspberry Pi 3 or another computer. I am going to use MacBook as a client that is interested in the data sent by the Thing. Setting up credentials Before we start reading data from Google IoT Core, we have to set up our computer (for example, MacBook) as a trusted device, so our computer can request data. Let's perform the following steps to set the credentials: To do this, we need to create a new Service account key. From the left-hand-side menu of the Google Cloud Console, select APIs & Services | Credentials. Then click on the Create credentials dropdown and select Service account key as shown in the following screenshot: Now, fill in the details as shown in the following screenshot: We have given access to the entire project for this client and as an Owner. Do not select these settings if this is a production application. Click on Create and you will be asked to download and save the file. Do not share this file; this file is as good as giving someone owner-level permissions to all assets of this project. Once the file is downloaded somewhere safe, create an environment variable with the name GOOGLE_APPLICATION_CREDENTIALS and point it to the path of the downloaded file. You can refer to Getting Started with Authentication at https://cloud.google.com/docs/authentication/getting-started if you are facing any difficulties. Setting up subscriptions The data from the device is being sent to Google IoT Core using the state topic. If you recall, we have named that topic dht11. Now, we are going to create a subscription for this topic: From the menu on the left side, select Pub/Sub | Topics. Now, click on New subscription for the dht11 topic, as shown in the following screenshot: Create a new subscription by setting up the options selected in this screenshot: We are going to use the subscription named dht11-data to get the data from the state topic. Setting up the client Now that we have provided the required credentials as well as subscribed to a Pub/Sub topic, we will set up the Pub/Sub client. Follow these steps: Create a folder named test_client inside the test_client directory. Now, run the following command: $ npm init -y Next, install the @google-cloud/pubsub (https://www.npmjs.com/package/@google-cloud/pubsub) module with the help of the following command: $ npm install @google-cloud/pubsub --save Create a file inside the test_client folder named index.js and update it as shown in this code snippet: var PubSub = require('@google-cloud/pubsub'); var projectId = 'pi-iot-project'; var stateSubscriber = 'dht11-data' // Instantiates a client var pubsub = new PubSub({ projectId: projectId, }); var subscription = pubsub.subscription('projects/' + projectId + '/subscriptions/' + stateSubscriber); var messageHandler = function(message) { console.log('Message Begin >>>>>>>>'); console.log('message.connectionId', message.connectionId); console.log('message.attributes', message.attributes); console.log('message.data', Buffer.from(message.data, 'base64').toString('ascii')); console.log('Message End >>>>>>>>>>'); // "Ack" (acknowledge receipt of) the message message.ack(); }; // Listen for new messages subscription.on('message', messageHandler); Update the projectId and stateSubscriber in the previous code. Now, save the file and run the following command: $ node index.js We should see the following output in the console: This way, any client that is interested in the data of this device can use this approach to get the latest data. With this, we conclude the section on posting data to Google IoT Core and fetching the data. In the next section, we are going to work on building a dashboard. Building a dashboard Now that we have seen how a client can read the data from our device on demand, we will move on to building a dashboard, where we display data in real time. For this, we are going to use Google Cloud Functions, Google BigQuery, and Google Data Studio. Google Cloud Functions Cloud Functions are solution for serverless services. Cloud Functions is a lightweight solution for creating standalone and single-purpose functions that respond to cloud events. You can read more about Google Cloud Functions at https://cloud.google.com/functions/. Google BigQuery Google BigQuery is an enterprise data warehouse that solves this problem by enabling super-fast SQL queries using the processing power of Google's infrastructure. You can read more about Google BigQuery at https://cloud.google.com/bigquery/. Google Data Studio Google Data Studio helps to build dashboards and reports using various data connectors, such as BigQuery or Google Analytics. You can read more about Google Data Studio at https://cloud.google.com/data-studio/. As of April 2018, these three services are still in beta. As we have already seen in the Architecture section, once the data is published on the state topic, we are going to create a cloud function that will get triggered by the data event on the Pub/Sub client. And inside our cloud function, we are going to get a copy of the published data and then insert it into the BigQuery dataset. Once the data is inserted, we are going to use Google Data Studio to create a new report by linking the BigQuery dataset to the input. So, let's get started. Setting up BigQuery The first thing we are going to do is set up BigQuery: From the side menu of the Google Cloud Platform Console, our project page, click on the BigQuery URL and we should be taken to the Google BigQuery home page. Select Create new dataset, as shown in the following screenshot: Create a new dataset with the values illustrated in the following screenshot: Once the dataset is created, click on the plus sign next to the dataset and create an empty table. We are going to name the table dht11_data and we are going have three fields in it, as shown here: Click on the Create Table button to create the table. Now that we have our table ready, we will write a cloud function to insert the incoming data from Pub/Sub into this table. Setting up Google Cloud Function Now, we are going to set up a cloud function that will be triggered by the incoming data: From the Google Cloud Console's left-hand-side menu, select Cloud Functions under Compute. Once you land on the Google Cloud Functions homepage, you will be asked to enable the cloud functions API. Click on Enable API: Once the API is enabled, we will be on the Create function page. Fill in the form as shown here: The Trigger is set to Cloud Pub/Sub topic and we have selected dht11 as the Topic. Under the Source code section; make sure you are in the index.js tab and update it as shown here: var BigQuery = require('@google-cloud/bigquery'); var projectId = 'pi-iot-project'; var bigquery = new BigQuery({ projectId: projectId, }); var datasetName = 'pi3_dht11_dataset'; var tableName = 'dht11_data'; exports.pubsubToBQ = function(event, callback) { var msg = event.data; var data = JSON.parse(Buffer.from(msg.data, 'base64').toString()); // console.log(data); bigquery .dataset(datasetName) .table(tableName) .insert(data) .then(function() { console.log('Inserted rows'); callback(); // task done }) .catch(function(err) { if (err && err.name === 'PartialFailureError') { if (err.errors && err.errors.length > 0) { console.log('Insert errors:'); err.errors.forEach(function(err) { console.error(err); }); } } else { console.error('ERROR:', err); } callback(); // task done }); }; In the previous code, we were using the BigQuery Node.js module to insert data into our BigQuery table. Update projectId, datasetName, and tableName as applicable in the code. Next, click on the package.json tab and update it as shown: { "name": "cloud_function", "version": "0.0.1", "dependencies": { "@google-cloud/bigquery": "^1.0.0" } } Finally, for the Function to execute field, enter pubsubToBQ. pubsubToBQ is the name of the function that has our logic and this function will be called when the data event occurs. Click on the Create button and our function should be deployed in a minute. Running the device Now that the entire setup is done, we will start pumping data into BigQuery: Head back to Raspberry Pi 3 which was sending the DHT11 temperature and humidity data, and run the application. We should see the data being published to the state topic: Now, if we head back to the Cloud Functions page, we should see the requests coming into the cloud function: You can click on VIEW LOGS to view the logs of each function execution: Now, head over to our table in BigQuery and click on the RUN QUERY button; run the query as shown in the following screenshot: Now, all the data that was generated by the DHT11 sensor is timestamped and stored in BigQuery. You can use the Save to Google Sheets button to save this data to Google Sheets and analyze the data there or plot graphs, as shown here: Or we can go one step ahead and use the Google Data Studio to do the same. Google Data Studio reports Now that the data is ready in BigQuery, we are going to set up Google Data Studio and then connect both of them, so we can access the data from BigQuery in Google Data Studio: Navigate to https://datastudio.google.com and log in with your Google account. Once you are on the Home page of Google Data Studio, click on the Blank report template. Make sure you read and agree to the terms and conditions before proceeding. Name the report PI3 DHT11 Sensor Data. Using the Create new data source button, we will create a new data source. Click on Create new data source and we should land on a page where we need to create a new Data Source. From the list of Connectors, select BigQuery; you will be asked to authorize Data Studio to interface with BigQuery, as shown in the following screenshot: Once we authorized, we will be shown our projects and related datasets and tables: Select the dht11_data table and click on Connect. This fetches the metadata of the table as shown here: Set the Aggregation for the temp and humd fields to Max and set the Type for time as Date & Time. Pick Minute (mm) from the sub-list. Click on Add to report and you will be asked to authorize Google Data Studio to read data from the table. Once the data source has been successfully linked, we will create a new time series chart. From the menu, select Insert | Time Series link. Update the data configuration of the chart as shown in the following screenshot: You can play with the styles as per your preference and we should see something similar to the following screenshot: This report can then be shared with any user. With this, we have seen the basic features and implementation process needed to work with Google Cloud IoT Core as well other features of the platform. If you found this post useful, do check out the book,  Enterprise Internet of Things Handbook, to build state of the art IoT applications best-fit for Enterprises. Cognitive IoT: How Artificial Intelligence is remoulding Industrial and Consumer IoT Five Most Surprising Applications of IoT How IoT is going to change tech teams
Read more
  • 0
  • 17
  • 57004

article-image-data-model-in-splunk-to-enable-interactive-reports-and-dashboards
Pravin Dhandre
26 Jun 2018
8 min read
Save for later

Create a data model in Splunk to enable interactive reports and dashboards

Pravin Dhandre
26 Jun 2018
8 min read
Data models enable you to create Splunk reports and dashboards without having to develop Splunk search. Typically, data models are designed by those that understand the specifics around the format, the semantics of certain data, and the manner in which users may expect to work with that data. In building a typical data model, knowledge managers use knowledge object types (such as lookups, transactions, search-time field extractions, and calculated fields). Today we are going to learn how to create a Splunk data model and how to describe that model with various fields and lookup attributes. This article is an excerpt from a book written by James D. Miller titled Implementing Splunk 7 - Third Edition.  Creating a data model So now that we have a general idea of what a Splunk data model is, let's go ahead and create one. Before we can get started, we need to verify that our user ID is set up with the proper access required to create a data model. By default, only users with an admin or power role can create data models. For other users, the ability to create a data model depends on whether their roles have write access to an app. To begin (once you have verified that you do have access to create a data model), you can click on Settings and then on Data models (under KNOWLEDGE): This takes you to the Data Models (management) page, shown in the next screenshot. This is where a list of data models is displayed. From here, you can manage permissions, acceleration, cloning, and removal of existing data models. You can also use this page to upload a data model or create new data models, using the Upload Data Model and New Data Model buttons on the upper-right corner, respectively. Since this is a new data model, you can click on the button labeled New Data Model. This will open the New Data Model dialog box (shown in the following image). We can fill in the required information in this dialog box: Filling in the new data model dialog You have four fields to fill in order to describe your new Splunk data model (Title, ID, App, and Description): Title: Here you must enter a Title for your data model. This field accepts any character, as well as spaces. The value you enter here is what will appear on the data model listing page. ID: This is an optional field. It gets prepopulated with what you entered for your data model title (with any spaces replaced with underscores. Take a minute to make sure you have a good one, since once you enter the data model ID, you can't change it. App: Here you select (from a drop-down list) the Splunk app that your data model will serve. Description: The description is also an optional field, but I recommend adding something descriptive to later identify your data model. Once you have filled in these fields, you can click on the button labeled Create. This opens the data model (in our example, Aviation Games) in the Splunk Edit Objects page as shown in the following screenshot: The next step in defining a data model is to add the first object. As we have already stated, data models are typically composed of object hierarchies built on root event objects. Each root event object represents a set of data that is defined by a constraint, which is a simple search that filters out events that are not relevant to the object. Getting back to our example, let's create an object for our data model to track purchase requests on our Aviation Games website. To define our first event-based object, click on Add Dataset (as shown in the following screenshot): Our data model's first object can either be a Root Event, or Root Search. We're going to add a Root Event, so select Root Event. This will take you to the Add Event Dataset editor: Our example event will expose events that contain the phrase error, which represents processing errors that have occurred within our data source. So, for Dataset Name, we will enter Processing Errors. The Dataset ID will automatically populate when you type in the Dataset Name (you can edit it if you want to change it). For our object's constraint, we'll enter sourcetype=tm1* error. This constraint defines the events that will be reported on (all events that contain the phrase error that are indexed in the data sources starting with tml). After providing Constraints for the event-based object, you can click on Preview to test whether the constraints you've supplied return the kind of events that you want. The following screenshot depicts the preview of the constraints given in this example: After reviewing the output, click on Save. The list of attributes for our root object is displayed: host, source, sourcetype, and _time. If you want to add child objects to client and server errors, you need to edit the attributes list to include additional attributes: Editing fields (attributes) Let's add an auto-extracted attribute, as mentioned earlier in this chapter, to our data model. Remember, auto-extracted attributes are derived by Splunk at search time. To start, click on Add Field: Next, select Auto-Extracted. The Add Auto-Extracted Field window opens: You can scroll through the list of automatically extracted fields and check the fields that you want to include. Since my data model example deals with errors that occurred, I've selected date_mday, date_month, and date_year. Notice that to the right of the field list, you have the opportunity to rename and type set each of the fields that you selected. Rename is self-explanatory, but for Type, Splunk allows you to select String, Number, Boolean, or IPV$ and indicate if the attribute is Required, Optional, Hidden, or Hidden &amp; Required. Optional means that the attribute doesn't have to appear in every event represented by the object. The attribute may appear in some of the object events and not others. Once you have reviewed your selected field types, click on Save: Lookup attributes Let's discuss lookup attributes now. Splunk can use the existing lookup definitions to match the values of an attribute that you select to values of a field in the specified lookup table. It then returns the corresponding field/value combinations and applies them to your object as (lookup) attributes. Once again, if you click on Add Field and select Lookup, Splunk opens the Add Fields with a Lookup page (shown in the following screenshot) where you can select from your currently defined lookup definitions. For this example, we select dnslookup: The dnslookup converts clienthost to clientip. We can configure a lookup attribute using this lookup to add that result to the processing errors objects. Under Input, select clienthost for Field in Lookup and Field in Dataset. Field in Lookup is the field to be used in the lookup table. Field in Dataset is the name of the field used in the event data. In our simple example, Splunk will match the field clienthost with the field host: Under Output, I have selected host as the output field to be matched with the lookup. You can provide a Display Name for the selected field. This display name is the name used for the field in your events. I simply typed AviationLookupName for my display name (see the following screenshot): Again, Splunk allows you to click on Preview to review the fields that you want to add. You can use the tabs to view the Events in a table, or view the values of each of the fields that you selected in Output. For example, the following screenshot shows the values of AviationLookupName: Finally, we can click on Save: Add Child object to our model We have just added a root (or parent) object to our data model. The next step is to add some children. Although a child object inherits all the constraints and attributes from its parent, when you create a child, you will give it additional constraints with the intention of further filtering the dataset that the object represents. To add a child object to our data model, click on Add Field and select Child: Splunk then opens the editor window, Add Child Dataset (shown in the following screenshot): On this page, follow these steps: Enter the Object Name: Dimensional Errors. Leave the Object ID as it is: Dimensional_Errors. Under Inherit From, select Processing Errors. This means that this child object will inherit all the attributes from the parent object, Processing Errors. Add the Additional Constraints, dimension, which means that the data models search for the events in this object; when expanded, it will look something like sourcetype=tm1* error dimension. Finally, click on Save to save your changes: Following the previously outlined steps, you can add more objects, each continuing to filter the results until you have the results that you need. With this we learned to create data models, and manage permissions, cloning and accelerating operational data models with ease. If you found this tutorial useful, do check out the book Implementing Splunk 7 - Third Edition and start transforming machine-generated data into valuable and actionable business insights. How to use R to boost your Data Model Building a Microsoft Power BI Data Model Splunk’s Input Methods and Data Feeds
Read more
  • 0
  • 0
  • 22418
article-image-set-up-scala-plugin-for-intellij-ide
Pavan Ramchandani
26 Jun 2018
2 min read
Save for later

How to set up the Scala Plugin in IntelliJ IDE [Tutorial]

Pavan Ramchandani
26 Jun 2018
2 min read
The Scala Plugin is used to turn a normal IntelliJ IDEA into a convenient Scala development environment. In this article, we will discuss how to set up Scala Plugin for IntelliJ IDEA IDE.  If you do not have IntelliJ IDEA, you can download it from here. By default, IntelliJ IDEA does not come with Scala features. Scala Plugin adds Scala features means that we can create Scala/Play Projects, we can create Scala Applications, Scala worksheets, and more. Scala Plugin contains the following technologies: Scala Play Framework SBT Scala.js It supports three popular OS Environments: Windows, Mac, and Linux. Setting up Scala Plugin for IntelliJ IDE Perform the  following steps to install Scala Plugin for IntelliJ IDE to develop our Scala-based projects: Open IntelliJ IDE: Go to  Configure at the bottom right and click on the Plugins option available in the drop-down, as shown here: This opens the Plugins window as shown here: Now click on InstallJetbrainsplugins, as shown in the preceding screenshot. Next, type the word Scala in the search bar to see the ScalaPlugin, as shown here: Click on the Install button to install Scala Plugin for IntelliJ IDEA. Now restart IntelliJ IDEA to see that Scala Plugin features. After we re-open IntelliJ IDEA, if we try to access File | New Project option, we will see Scala option in New Project window as shown in the following screenshot to create new Scala or Play Framework-based SBT projects: We can see the Play Framework option only in the IntelliJ IDEA Ultimate Edition. As we are using CE (Community Edition), we cannot see that option. It's now time to start Scala/Play application development using the IntelliJ IDE. You can start developing some Scala/Play-based applications. To summarize, we got an understanding to Scala Plugin and covered the installation steps for Scala Plugin for IntelliJ. To learn more about solutions for taking reactive programming approach with Scala, please refer the book Scala Reactive Programming. What Scala 3.0 Roadmap looks like! Building Scalable Microservices Exploring Scala Performance
Read more
  • 0
  • 0
  • 31806

article-image-interact-with-hbase-using-hbase-shell-tutorial
Amey Varangaonkar
25 Jun 2018
9 min read
Save for later

How to interact with HBase using HBase shell [Tutorial]

Amey Varangaonkar
25 Jun 2018
9 min read
HBase is among the top five most popular and widely-deployed NoSQL databases. It is used to support critical production workloads across hundreds of organizations. It is supported by multiple vendors (in fact, it is one of the few databases that is multi-vendor), and more importantly has an active and diverse developer and user community. In this article, we see how to work with the HBase shell in order to efficiently work on the massive amounts of data. The following excerpt is taken from the book '7 NoSQL Databases in a Week' authored by Aaron Ploetz et al. Working with the HBase shell The best way to get started with understanding HBase is through the HBase shell. Before we do that, we need to first install HBase. An easy way to get started is to use the Hortonworks sandbox. You can download the sandbox for free from https://hortonworks.com/products/sandbox/. The sandbox can be installed on Linux, Mac and Windows. Follow the instructions to get this set up. On any cluster where the HBase client or server is installed, type hbase shell to get a prompt into HBase: hbase(main):004:0> version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 This tells you the version of HBase that is running on the cluster. In this instance, the HBase version is 1.1.2, provided by a particular Hadoop distribution, in this case HDP 2.3.6: hbase(main):001:0> help HBase Shell, version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command. Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group. This provides the set of operations that are possible through the HBase shell, which includes DDL, DML, and admin operations. hbase(main):001:0> create 'sensor_telemetry', 'metrics' 0 row(s) in 1.7250 seconds => Hbase::Table - sensor_telemetry This creates a table called sensor_telemetry, with a single column family called metrics. As we discussed before, HBase doesn't require column names to be defined in the table schema (and in fact, has no provision for you to be able to do so): hbase(main):001:0> describe 'sensor_telemetry' Table sensor_telemetry is ENABLED sensor_telemetry COLUMN FAMILIES DESCRIPTION {NAME => 'metrics', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE =>'0'} 1 row(s) in 0.5030 seconds This describes the structure of the sensor_telemetry table. The command output indicates that there's a single column family present called metrics, with various attributes defined on it. BLOOMFILTER indicates the type of bloom filter defined for the table, which can either be a bloom filter of the ROW type, which probes for the presence/absence of a given row key, or of the ROWCOL type, which probes for the presence/absence of a given row key, col-qualifier combination. You can also choose to have BLOOMFILTER set to None. The BLOCKSIZE configures the minimum granularity of an HBase read. By default, the block size is 64 KB, so if the average cells are less than 64 KB, and there's not much locality of reference, you can lower your block size to ensure there's not more I/O than necessary, and more importantly, that your block cache isn't wasted on data that is not needed. VERSIONS refers to the maximum number of cell versions that are to be kept around: hbase(main):004:0> alter 'sensor_telemetry', {NAME => 'metrics', BLOCKSIZE => '16384', COMPRESSION => 'SNAPPY'} Updating all regions with the new schema... 1/1 regions updated. Done. 0 row(s) in 1.9660 seconds Here, we are altering the table and column family definition to change the BLOCKSIZE to be 16 K and the COMPRESSION codec to be SNAPPY: hbase(main):004:0> version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 hbase(main):005:0> describe 'sensor_telemetry' Table sensor_telemetry is ENABLED sensor_telemetry COLUMN FAMILIES DESCRIPTION {NAME => 'metrics', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '16384', REPLICATION_SCOPE => '0'} 1 row(s) in 0.0410 seconds This is what the table definition now looks like after our ALTER table statement. Next, let's scan the table to see what it contains: hbase(main):007:0> scan 'sensor_telemetry' ROW COLUMN+CELL 0 row(s) in 0.0750 seconds No surprises, the table is empty. So, let's populate some data into the table: hbase(main):007:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'temperature', '65' ERROR: Unknown column family! Valid column names: metrics:* Here, we are attempting to insert data into the sensor_telemetry table. We are attempting to store the value '65' for the column qualifier 'temperature' for a row key '/94555/20170308/18:30'. This is unsuccessful because the column 'temperature' is not associated with any column family. In HBase, you always need the row key, the column family and the column qualifier to uniquely specify a value. So, let's try this again: hbase(main):008:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'metrics:temperature', '65' 0 row(s) in 0.0120 seconds Ok, that seemed to be successful. Let's confirm that we now have some data in the table: hbase(main):009:0> count 'sensor_telemetry' 1 row(s) in 0.0620 seconds => 1 Ok, it looks like we are on the right track. Let's scan the table to see what it contains: hbase(main):010:0> scan 'sensor_telemetry' ROW COLUMN+CELL /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810397402,value=65 1 row(s) in 0.0190 seconds This tells us we've got data for a single row and a single column. The insert time epoch in milliseconds was 1501810397402. In addition to a scan operation, which scans through all of the rows in the table, HBase also provides a get operation, where you can retrieve data for one or more rows, if you know the keys: hbase(main):011:0> get 'sensor_telemetry', '/94555/20170308/18:30' COLUMN CELL metrics:temperature timestamp=1501810397402, value=65 OK, that returns the row as expected. Next, let's look at the effect of cell versions. As we've discussed before, a value in HBase is defined by a combination of Row-key, Col-family, Col-qualifier, Timestamp. To understand this, let's insert the value '66', for the same row key and column qualifier as before: hbase(main):012:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'metrics:temperature', '66' 0 row(s) in 0.0080 seconds Now let's read the value for the row key back: hbase(main):013:0> get 'sensor_telemetry', '/94555/20170308/18:30' COLUMN CELL metrics:temperature timestamp=1501810496459, value=66 1 row(s) in 0.0130 seconds This is in line with what we expect, and this is the standard behavior we'd expect from any database. A put in HBase is the equivalent to an upsert in an RDBMS. Like an upsert, put inserts a value if it doesn't already exist and updates it if a prior value exists. Now, this is where things get interesting. The get operation in HBase allows us to retrieve data associated with a particular timestamp: hbase(main):015:0> get 'sensor_telemetry', '/94555/20170308/18:30', {COLUMN => 'metrics:temperature', TIMESTAMP => 1501810397402} COLUMN CELL metrics:temperature timestamp=1501810397402,value=65 1 row(s) in 0.0120 seconds   We are able to retrieve the old value of 65 by providing the right timestamp. So, puts in HBase don't overwrite the old value, they merely hide it; we can always retrieve the old values by providing the timestamps. Now, let's insert more data into the table: hbase(main):028:0> put 'sensor_telemetry', '/94555/20170307/18:30', 'metrics:temperature', '43' 0 row(s) in 0.0080 seconds hbase(main):029:0> put 'sensor_telemetry', '/94555/20170306/18:30', 'metrics:temperature', '33' 0 row(s) in 0.0070 seconds Now, let's scan the table back: hbase(main):030:0> scan 'sensor_telemetry' ROW COLUMN+CELL /94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956, value=33 /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941,value=67 3 row(s) in 0.0310 seconds We can also scan the table in reverse key order: hbase(main):031:0> scan 'sensor_telemetry', {REVERSED => true} ROW COLUMN+CELL /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956,value=33 3 row(s) in 0.0520 seconds What if we wanted all the rows, but in addition, wanted all the cell versions from each row? We can easily retrieve that: hbase(main):032:0> scan 'sensor_telemetry', {RAW => true, VERSIONS => 10} ROW COLUMN+CELL /94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956, value=33 /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810496459, value=66 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810397402, value=65 Here, we are retrieving all three values of the row key /94555/20170308/18:30 in the scan result set. HBase scan operations don't need to go from the beginning to the end of the table; you can optionally specify the row to start scanning from and the row to stop the scan operation at: hbase(main):034:0> scan 'sensor_telemetry', {STARTROW => '/94555/20170307'} ROW COLUMN+CELL /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 2 row(s) in 0.0550 seconds HBase also provides the ability to supply filters to the scan operation to restrict what rows are returned by the scan operation. It's possible to implement your own filters, but there's rarely a need to. There's a large collection of filters that are already implemented: hbase(main):033:0> scan 'sensor_telemetry', {ROWPREFIXFILTER => '/94555/20170307'} ROW COLUMN+CELL /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 1 row(s) in 0.0300 seconds This returns all the rows whose keys have the prefix /94555/20170307: hbase(main):033:0> scan 'sensor_telemetry', { FILTER => SingleColumnValueFilter.new( Bytes.toBytes('metrics'), Bytes.toBytes('temperature'), CompareFilter::CompareOp.valueOf('EQUAL'), BinaryComparator.new(Bytes.toBytes('66')))} The SingleColumnValueFilter can be used to scan a table and look for all rows with a given column value. We saw how fairly easy it is to work with your data in HBase using the HBase shell. If you found this excerpt useful, make sure you check out the book 'Seven NoSQL Databases in a Week', to get more hands-on information about HBase and the other popular NoSQL databases out there today. Read More Level Up Your Company’s Big Data with Mesos 2018 is the year of graph databases. Here’s why. Top 5 NoSQL Databases
Read more
  • 0
  • 0
  • 9711
Modal Close icon
Modal Close icon