How-To Tutorials

article-image-the-elements-of-webassembly-wat-and-wasm-explained-tutorial

03 Jan 2019

9 min read

The elements of WebAssembly - Wat and Wasm, explained [Tutorial]

03 Jan 2019

In this tutorial, we will dig into the elements that correspond to the official specifications created by the WebAssembly Working Group. We will examine the Wat and the binary format in greater detail to gain a better understanding of how they relate to modules. This article is an excerpt from a book written by Mike Rourke titled Learn WebAssembly. In this book, you will learn how to build web applications with native performance using Wasm and C/C++<mark. Common structure and abstract syntax Before getting into the nuts and bolts of these formats, it's worth mentioning how these are related within the Core Specification. The following diagram is a visual representation of the table of contents (with some sections excluded for clarity): As you can see, the Text Format and Binary Format sections contain subsections for Values, Types, Instructions, and Modules that correlate with the Structure section. Consequently, much of what we cover in the next section for the text format have direct corollaries with the binary format. With that in mind, let's dive into the text format. Wat The Text Format section of the Core Specification provides technical descriptions for common language concepts such as values, types, and instructions. These are important concepts to know and understand if you're planning on building tooling for WebAssembly, but not necessary if you just plan on using it in your applications. That being said, the text format is an important part of WebAssembly, so there are concepts you should be aware of. In this section, we will dig into some of the details of the text format and highlight important points from the Core Specification. Definitions and S-expressions To understand Wat, let's start with the first sentence of the description taken directly from the WebAssembly Core Specification: "The textual format for WebAssembly modules is a rendering of their abstract syntax into S-expressions." So what are symbolic expressions (S-expressions)? S-expressions are notations for nested list (tree-structured) data. Essentially, they provide a simple and elegant way to represent list-based data in textual form. To understand how textual representations of nested lists map to a tree structure, let's extrapolate the tree structure from an HTML page. The following example contains a simple HTML page and the corresponding tree structure diagram. A simple HTML page: The corresponding tree structure is: Even if you've never seen a tree structure before, it's still clear to see how the HTML maps to the tree in terms of structure and hierarchy. Mapping HTML elements is relatively simple because it's a markup language with well-defined tags and no actual logic. Wat represents modules that can have multiple functions with varying parameters. To demonstrate the relationship between source code, Wat, and the corresponding tree structure, let's start with a simple C function that adds 2 to the number that is passed in as a parameter: Here is a C function that adds 2 to the num argument passed in and returns the result: int addTwo(int num) { return num + 2; } Converting the addTwo function to valid Wat produces this result: (module (table 0 anyfunc) (memory $0 1) (export "memory" (memory $0)) (export "addTwo" (func $addTwo)) (func $addTwo (; 0 ;) (param $0 i32) (result i32) (i32.add (get_local $0) (i32.const 2) ) ) ) The Structure section defines each of these concepts in the context of an abstract syntax. The Text Format section of the specification corresponds with these concepts as well, and you can see them defined by their keywords in the preceding snippet (func, memory, table). Tree Structure: The entire tree would be too large to fit on a page, so this diagram is limited to the first five lines of the Wat source text. Each filled-in dot represents a list node (or the contents of a set of parentheses). As you can see, code written in s-expressions can be clearly and concisely expressed in a tree structure, which is why s-expressions were chosen for WebAssembly's text format. Values, types, and instructions Although detailed coverage of the Text Format section of the Core Specification is out of the scope of this text, it's worth demonstrating how some of the language concepts map to the corresponding Wat. The following diagram demonstrates these mappings in a sample Wat snippet. The C code that this was compiled from represents a function that takes a word as a parameter and returns the square root of the character count: If you intend on writing or editing Wat, note that it supports block and line comments. The instructions are split up into blocks and consist of setting and getting memory associated with variables with valid types. You are able to control the flow of logic using if statements and loops are supported using the loop keyword. Role in the development process The text format allows for the representation of a binary Wasm module in textual form. This has some profound implications with regard to the ease of development and debugging. Having a textual representation of a WebAssembly module allows developers to view the source of a loaded module in a browser, which eliminates the black-box issues that inhibited the adoption of NaCl. It also allows for tooling to be built around troubleshooting modules. The official website describes the use cases that drove the design of the text format: • View Source on a WebAssembly module, thus fitting into the Web (where every source can be viewed) in a natural way. • Presentation in browser development tools when source maps aren't present (which is necessarily the case with the Minimum Viable Product (MVP)). • Writing WebAssembly code directly for reasons including pedagogical, experimental, debugging, optimization, and testing of the spec itself. The last item in the list reflects that the text format isn't intended to be written by hand in the course of normal development, but rather generated from a tool like Emscripten. You probably won't see or manipulate any .wat files when you're generating modules, but you may be viewing them in a debugging context. Not only is the text format valuable with regards to debugging, but having this intermediate format reduces the amount of reliance on a single tool for compilation. Several different tools currently exist to consume and emit this s-expression syntax, some of which are used by Emscripten to compile your code down to a .wasm file. Binary format and the module file (Wasm) The Binary Format section of the Core Specification provides the same level of detail with regard to language concepts as the Text format section. In this section, we will briefly cover some high-level details about the binary format and discuss the various sections that make up a Wasm module. Definition and module overview The binary format is defined as a dense linear encoding of the abstract syntax. Without getting too technical, that essentially means it's an efficient form of binary that allows for fast decoding, small file size, and reduced memory usage. The file representation of the binary format is a .wasm file. The Values, Types, and Instructions subsections of the Core Specification for the binary format correlate directly to the Text Format section. Each of these concepts is covered in the context of encoding. For example, according to the specification, the Integer types are encoded using the LEB128 variable-length integer encoding, in either unsigned or signed variant. These are important details to know if you wish to develop tooling for WebAssembly, but not necessary if you just plan on using it on your website. The Structure, Binary Format, and Text Format (wat) sections of the Core Specification have a Module subsection. We didn't cover aspects of the module in the previous section because it's more prudent to describe them in the context of a binary. The official WebAssembly site offers the following description for a module: "The distributable, loadable, and executable unit of code in WebAssembly is called a module. At runtime, a module can be instantiated with a set of import values to produce an instance, which is an immutable tuple referencing all the state accessible to the running module." Module sections A module is made up of several sections, some of which you'll be interacting with through the JavaScript API: Imports (import) are elements that can be accessed within the module and can be one of the following: Function, which can be called inside the module using the call operator Global, which can be accessed inside the module via the global operators Linear Memory, which can be accessed inside the module via the memory operators Table, which can be accessed inside the module using call_indirect Exports (export) are elements that can be accessed by the consuming API (that is, called by a JavaScript function) Module start function (start) is called after the module instance is initialized Global (global) contains the internal definition of global variables Linear memory (memory) contains the internal definition of linear memory with an initial memory size and optional maximum size Data (data) contains an array of data segments which specify the initial contents of fixed ranges of a given memory Table (table) is a linear memory whose elements are opaque values of a particular table element type: In the MVP, its primary purpose is to implement indirect function calls in C/C++ Elements (elements) is a section that allows a module to initialize the elements of any import or internally defined table with any other definition in the module Function and code: The function section declares the signatures of each internal function defined in the module The code section contains the function body of each function declared by the function section Some of the keywords (import, export, and so on) should look familiar; they're present in the contents of the Wat file in the previous section. WebAssembly's components follow a logical mapping that directly correspond to the APIs (for example, you pass a memory and table instance into JavaScript's WebAssembly.instantiate() function). Your primary interaction with a module in binary format will be through these APIs. In this tutorial, we looked at the WebAssembly modules Wat and Wasm. We covered Wat definitions, values, types, instructions, and its role. We looked at Wasm overview and its module sections. To know more about another element of WebAssembly, the JavaScript API and build applications on WebAssembly check out the book Learn WebAssembly. How has Rust and WebAssembly evolved in 2018 WebAssembly – Trick or Treat? Mozilla shares plans to bring desktop applications, games to WebAssembly and make deeper inroads for the future web

0
0
60935

How-To Tutorials

article-image-the-cap-theorem-in-practice-the-consistency-vs-availability-trade-off-in-distributed-databases

Richard Gall

12 Sep 2019

7 min read

The CAP Theorem in practice: The consistency vs. availability trade-off in distributed databases

Richard Gall

12 Sep 2019

7 min read

When you choose a database you are making a design decision. One of the best frameworks for understanding what this means in practice is the CAP Theorem. What is the CAP Theorem? The CAP Theorem, developed by computer scientist Eric Brewer in the late nineties, states that databases can only ever fulfil two out of three elements: Consistency - that reads are always up to date, which means any client making a request to the database will get the same view of data. Availability - database requests always receive a response (when valid). Partition tolerance - that a network fault doesn’t prevent messaging between nodes. In the context of distributed (NoSQL) databases, this means there is always going to be a trade-off between consistency and availability. This is because distributed systems are always necessarily partition tolerant (ie. it simply wouldn’t be a distributed database if it wasn’t partition tolerant.) Read next: Different types of NoSQL databases and when to use them How do you use the CAP Theorem when making database decisions? Although the CAP Theorem can feel quite abstract, it has practical, real-world consequences. From both a technical and business perspective the trade-offs will lead you to some very important questions. There are no right answers. Ultimately it will be all about the context in which your database is operating, the needs of the business, and the expectations and needs of users. You will have to consider things like: Is it important to avoid throwing up errors in the client? Or are we willing to sacrifice the visible user experience to ensure consistency? Is consistency an actual important part of the user’s experience Or can we actually do what we want with a relational database and avoid the need for partition tolerance altogether? As you can see, these are ultimately user experience questions. To properly understand those, you need to be sensitive to the overall goals of the project, and, as said above, the context in which your database solution is operating. (Eg. Is it powering an internal analytics dashboard? Or is it supporting a widely used external-facing website or application?) And, as the final bullet point highlights, it’s always worth considering whether the consistency v availability trade-off should matter at all. Avoid the temptation to think a complex database solution will always be better when a simple, more traditional solution will do the job. Of course, it’s important to note that systems that aren’t partition tolerant are a single point of failure in a system. That introduces the potential for unreliability. Prioritizing consistency in a distributed database It’s possible to get into a lot of technical detail when talking about consistency and availability, but at a really fundamental level the principle is straightforward: you need consistency (or what is called a CP database) if the data in the database must always be up to date and aligned, even in the instance of a network failure (eg. the partitioned nodes are unable to communicate with one another for whatever reason). Particular use cases where you would prioritize consistency is when you need multiple clients to have the same view of the data. For example, where you’re dealing with financial information, personal information, using a database that gives you consistency and confidence that data you are looking at is up to date in a situation where the network is unreliable or fails. Examples of CP databases MongoDB Learning MongoDB 4 [Video] MongoDB 4 Quick Start Guide MongoDB, Express, Angular, and Node.js Fundamentals Redis Build Complex Express Sites with Redis and Socket.io [Video] Learning Redis HBase Learn by Example : HBase - The Hadoop Database [Video] HBase Design Patterns Prioritizing availability in a distributed database Availability is essential when data accumulation is a priority. Think here of things like behavioral data or user preferences. In scenarios like these, you will want to capture as much information as possible about what a user or customer is doing, but it isn’t critical that the database is constantly up to date. It simply just needs to be accessible and available even when network connections aren’t working. The growing demand for offline application use is also one reason why you might use a NoSQL database that prioritizes availability over consistency. Examples of AP databases Cassandra Learn Apache Cassandra in Just 2 Hours [Video] Mastering Apache Cassandra 3.x - Third Edition DynamoDB Managed NoSQL Database In The Cloud - Amazon AWS DynamoDB [Video] Hands-On Amazon DynamoDB for Developers [Video] Limitations and criticisms of CAP Theorem It’s worth noting that the CAP Theorem can pose problems. As with most things, in truth, things are a little more complicated. Even Eric Brewer is circumspect about the theorem, especially as what we expect from distributed databases. Back in 2012, twelve years after he first put his theorem into the world, he wrote that: “Although designers still need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them. The modern CAP goal should be to maximize combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations.” So, this means we must think about the trade-off between consistency and availability as a balancing act, rather than a binary design decision. Elsewhere, there have been more robust criticisms of CAP Theorem. Software engineer Martin Kleppmann, for example, pleaded Please stop calling databases CP or AP in 2015. In a blog post he argues that CAP Theorem only works if you adhere to specific definitions of consistency, availability, and partition tolerance. “If your use of words matches the precise definitions of the proof, then the CAP theorem applies to you," he writes. “But if you’re using some other notion of consistency or availability, you can’t expect the CAP theorem to still apply.” The consequences of this are much like those described in Brewer’s piece from 2012. You need to take a nuanced approach to database trade-offs in which you think them through on your own terms and up against your own needs. The PACELC Theorem One of the developments of this line of argument is an extension to the CAP Theorem: the PACELC Theorem. This moves beyond thinking about consistency and availability and instead places an emphasis on the trade-off between consistency and latency. The PACELC Theorem builds on the CAP Theorem (the ‘PAC’) and adds an else (the ‘E’). What this means is that while you need to choose between availability and consistency if communication between partitions has failed in a distributed system, even if things are running properly and there are no network issues, there is still going to be a trade-off between consistency and latency (the ‘LC’). Conclusion: Learn to align context with technical specs Although the CAP Theorem might seem somewhat outdated, it is valuable in providing a way to think about database architecture design. It not only forces engineers and architects to ask questions about what they want from the technologies they use, but it also forces them to think carefully about the requirements of a given project. What are the business goals? What are user expectations? The PACELC Theorem builds on CAP in an effective way. However, the most important thing about these frameworks is how they help you to think about your problems. Of course the CAP Theorem has limitations. Because it abstracts a problem it is necessarily going to lack nuance. There are going to be things it simplifies. It’s important, as Kleppmann reminds us - to be mindful of these nuances. But at the same time, we shouldn’t let an obsession with nuance and detail allow us to miss the bigger picture.

0
0
60879

article-image-using-genetic-algorithms-for-optimizing-your-models-tutorial

Natasha Mathur

04 Apr 2019

14 min read

Using Genetic Algorithms for optimizing your models [Tutorial]

Natasha Mathur

04 Apr 2019

14 min read

0
0
60277

How-To Tutorials

article-image-wireshark-analyze-malicious-emails-in-pop-imap-smtp

Vijin Boricha

29 Jul 2018

10 min read

Wireshark for analyzing issues and malicious emails in POP, IMAP, and SMTP [Tutorial]

Vijin Boricha

29 Jul 2018

10 min read

One of the contributing factors in the evolution of digital marketing and business is email. Email allows users to exchange real-time messages and other digital information such as files and images over the internet in an efficient manner. Each user is required to have a human-readable email address in the form of username@domainname.com. There are various email providers available on the internet, and any user can register to get a free email address. There are different email application-layer protocols available for sending and receiving mails, and the combination of these protocols helps with end-to-end email exchange between users in the same or different mail domains. In this article, we will look at the normal operation of email protocols and how to use Wireshark for basic analysis and troubleshooting. This article is an excerpt from Network Analysis using Wireshark 2 Cookbook - Second Edition written by Nagendra Kumar Nainar, Yogesh Ramdoss, Yoram Orzach. The three most commonly used application layer protocols are POP3, IMAP, and SMTP: POP3: Post Office Protocol 3 (POP3) is an application layer protocol used by email systems to retrieve mail from email servers. The email client uses POP3 commands such as LOGIN, LIST, RETR, DELE, QUIT to access and manipulate (retrieve or delete) the email from the server. POP3 uses TCP port 110 and wipes the mail from the server once it is downloaded to the local client. IMAP: Internet Mail Access Protocol (IMAP) is another application layer protocol used to retrieve mail from the email server. Unlike POP3, IMAP allows the user to read and access the mail concurrently from more than one client device. With current trends, it is very common to see users with more than one device to access emails (laptop, smartphone, and so on), and the use of IMAP allows the user to access mail any time, from any device. The current version of IMAP is 4 and it uses TCP port 143. SMTP: Simple Mail Transfer Protocol (SMTP) is an application layer protocol that is used to send email from the client to the mail server. When the sender and receiver are in different email domains, SMTP helps to exchange the mail between servers in different domains. It uses TCP port 25: As shown in the preceding diagram, SMTP is the email client used to send the mail to the mail server, and POP3 or IMAP is used to retrieve the email from the server. The email server uses SMTP to exchange the mail between different domains. In order to maintain the privacy of end users, most email servers use different encryption mechanisms at the transport layer. The transport layer port number will differ from the traditional email protocols if they are used over secured transport layer (TLS). For example, POP3 over TLS uses TCP port 995, IMAP4 over TLS uses TCP port 993, and SMTP over TLS uses port 465. Normal operation of mail protocols As we saw above, the common mail protocols for mail client to server and server to server communication are POP3, SMTP, and IMAP4. Another common method for accessing emails is web access to mail, where you have common mail servers such as Gmail, Yahoo!, and Hotmail. Examples include Outlook Web Access (OWA) and RPC over HTTPS for the Outlook web client from Microsoft. In this recipe, we will talk about the most common client-server and server-server protocols, POP3 and SMTP, and the normal operation of each protocol. Getting ready Port mirroring to capture the packets can be done either on the email client side or on the server side. How to do it... POP3 is usually used for client to server communications, while SMTP is usually used for server to server communications. POP3 communications POP3 is usually used for mail client to mail server communications. The normal operation of POP3 is as follows: Open the email client and enter the username and password for login access. Use POP as a display filter to list all the POP packets. It should be noted that this display filter will only list packets that use TCP port 110. If TLS is used, the filter will not list the POP packets. We may need to use tcp.port == 995 to list the POP3 packets over TLS. Check the authentication has been passed correctly. In the following screenshot, you can see a session opened with a username that starts with doronn@ (all IDs were deleted) and a password that starts with u6F. To see the TCP stream shown in the following screenshot, right-click on one of the packets in the stream and choose Follow TCP Stream from the drop-down menu: Any error messages in the authentication stage will prevent communications from being established. You can see an example of this in the following screenshot, where user authentication failed. In this case, we see that when the client gets a Logon failure, it closes the TCP connection: Use relevant display filters to list the specific packet. For example, pop.request.command == "USER" will list the POP request packet with the username and pop.request.command == "PASS" will list the POP packet carrying the password. A sample snapshot is as follows: During the mail transfer, be aware that mail clients can easily fill a narrow-band communications line. You can check this by simply configuring the I/O graphs with a filter on POP. Always check for common TCP indications: retransmissions, zero-window, window-full, and others. They can indicate a busy communication line, slow server, and other problems coming from the communication lines or end nodes and servers. These problems will mostly cause slow connectivity. When the POP3 protocol uses TLS for encryption, the payload details are not visible. We explain how the SSL captures can be decrypted in the There's more... section. IMAP communications IMAP is similar to POP3 in that it is used to retrieve the mail from the server by the client. The normal behavior of IMAP communication is as follows: Open the email client and enter the username and password for the relevant account. Compose a new message and send it from any email account. Retrieve the email on the client that is using IMAP. Different clients may have different ways of retrieving the email. Use the relevant button to trigger it. Check you received the email on your local client. SMTP communications SMTP is commonly used for the following purposes: Server to server communications, in which SMTP is the mail protocol that runs between the servers In some clients, POP3 or IMAP4 are configured for incoming messages (messages from the server to the client), while SMTP is configured for outgoing messages (messages from the client to the server) The normal behavior of SMTP communication is as follows: The local email client resolves the IP address of the configured SMTP server address. This triggers a TCP connection to port number 25 if SSL/TLS is not enabled. If SSL/TLS is enabled, a TCP connection is established over port 465. It exchanges SMTP messages to authenticate with the server. The client sends AUTH LOGIN to trigger the login authentication. Upon successful login, the client will be able to send mails. It sends SMTP message such as "MAIL FROM:<>", "RCPT TO:<>" carrying sender and receiver email addresses. Upon successful queuing, we get an OK response from the SMTP server. The following is a sample SMTP message flow between client and server: How it works... In this section, let's look into the normal operation of different email protocols with the use of Wireshark. Mail clients will mostly use POP3 for communication with the server. In some cases, they will use SMTP as well. IMAP4 is used when server manipulation is required, for example, when you need to see messages that exist on a remote server without downloading them to the client. Server to server communication is usually implemented by SMTP. The difference between IMAP and POP is that in IMAP, the mail is always stored on the server. If you delete it, it will be unavailable from any other machine. In POP, deleting a downloaded email may or may not delete that email on the server. In general, SMTP status codes are divided into three categories, which are structured in a way that helps you understand what exactly went wrong. The methods and details of SMTP status codes are discussed in the following section. POP3 POP3 is an application layer protocol used by mail clients to retrieve email messages from the server. A typical POP3 session will look like the following screenshot: It has the following steps: The client opens a TCP connection to the server. The server sends an OK message to the client (OK Messaging Multiplexor). The user sends the username and password. The protocol operations begin. NOOP (no operation) is a message sent to keep the connection open, STAT (status) is sent from the client to the server to query the message status. The server answers with the number of messages and their total size (in packet 1042, OK 0 0 means no messages and it has a total size of zero) When there are no mail messages on the server, the client send a QUIT message (1048), the server confirms it (packet 1136), and the TCP connection is closed (packets 1137, 1138, and 1227). In an encrypted connection, the process will look nearly the same (see the following screenshot). After the establishment of a connection (1), there are several POP messages (2), TLS connection establishment (3), and then the encrypted application data: IMAP The normal operation of IMAP is as follows: The email client resolves the IP address of the IMAP server: As shown in the preceding screenshot, the client establishes a TCP connection to port 143 when SSL/TSL is disabled. When SSL is enabled, the TCP session will be established over port 993. Once the session is established, the client sends an IMAP capability message requesting the server sends the capabilities supported by the server. This is followed by authentication for access to the server. When the authentication is successful, the server replies with response code 3 stating the login was a success: The client now sends the IMAP FETCH command to fetch any mails from the server. When the client is closed, it sends a logout message and clears the TCP session. SMTP The normal operation of SMTP is as follows: The email client resolves the IP address of the SMTP server: The client opens a TCP connection to the SMTP server on port 25 when SSL/TSL is not enabled. If SSL is enabled, the client will open the session on port 465: Upon successful TCP session establishment, the client will send an AUTH LOGIN message to prompt with the account username/password. The username and password will be sent to the SMTP client for account verification. SMTP will send a response code of 235 if authentication is successful: The client now sends the sender's email address to the SMTP server. The SMTP server responds with a response code of 250 if the sender's address is valid. Upon receiving an OK response from the server, the client will send the receiver's address. SMTP server will respond with a response code of 250 if the receiver's address is valid. The client will now push the actual email message. SMTP will respond with a response code of 250 and the response parameter OK: queued. The successfully queued message ensures that the mail is successfully sent and queued for delivery to the receiver address. We have learned how to analyse issues in POP, IMAP, and SMTP and malicious emails. Get to know more about DNS Protocol Analysis and FTP, HTTP/1, AND HTTP/2 from our book Network Analysis using Wireshark 2 Cookbook - Second Edition. What’s new in Wireshark 2.6? Analyzing enterprise application behavior with Wireshark 2 Capturing Wireshark Packets

0
0
60094

article-image-what-is-the-difference-between-oauth-1-0-and-2-0

Pavan Ramchandani

13 Jun 2018

11 min read

What's the difference between OAuth 1.0 and OAuth 2.0?

Pavan Ramchandani

13 Jun 2018

11 min read

0
0
60068

How-To Tutorials

article-image-how-to-remotely-monitor-hosts-over-telnet-and-ssh-tutorial

Melisha Dsouza

20 Mar 2019

14 min read

How to remotely monitor hosts over Telnet and SSH [Tutorial]

Melisha Dsouza

20 Mar 2019

14 min read

In this tutorial, you will learn how to carry out basic configurations on a server with Telnet and SSH configured. We will begin by using the Telnet module, after which we will implement the same configurations using the preferred method: SSH using different modules in Python. You will also learn about how telnetlib, subprocess, fabric, Netmiko, and paramiko modules work. This tutorial is an excerpt from a book written by Ganesh Sanjiv Naik titled Mastering Python Scripting for System Administrators. This book will take you through a set of specific software patterns and you will learn, in detail, how to apply these patterns and build working software on top of a serverless system. The telnetlib() module In this section, we are going to learn about the Telnet protocol and then we will do Telnet operations using the telnetlib module over a remote server. Telnet is a network protocol that allows a user to communicate with remote servers. It is mostly used by network administrators to remotely access and manage devices. To access the device, run the Telnet command with the IP address or hostname of a remote server in your Terminal. Telnet uses TCP on the default port number 23. To use Telnet, make sure it is installed on your system. If not, run the following command to install it: $ sudo apt-get install telnetd To run Telnet using the simple Terminal, you just have to enter the following command: $ telnet ip_address_of_your_remote_server Python has the telnetlib module to perform Telnet functions through Python scripts. Before telnetting your remote device or router, make sure they are configured properly and, if not, you can do basic configuration by using the following command in the router's Terminal: configure terminal enable password 'set_Your_password_to_access_router' username 'set_username' password 'set_password_for_remote_access' line vty 0 4 login local transport input all interface f0/0 ip add 'set_ip_address_to_the_router' 'put_subnet_mask' no shut end show ip interface brief Now, let's see the example of Telnetting a remote device. For that, create a telnet_example.py script and write following content in it: import telnetlib import getpass import sys HOST_IP = "your host ip address" host_user = input("Enter your telnet username: ") password = getpass.getpass() t = telnetlib.Telnet(HOST_IP) t.read_until(b"Username:") t.write(host_user.encode("ascii") + b"\n") if password: t.read_until(b"Password:") t.write(password.encode("ascii") + b"\n") t.write(b"enable\n") t.write(b"enter_remote_device_password\n") #password of your remote device t.write(b"conf t\n") t.write(b"int loop 1\n") t.write(b"ip add 10.1.1.1 255.255.255.255\n") t.write(b"int loop 2\n") t.write(b"ip add 20.2.2.2 255.255.255.255\n") t.write(b"end\n") t.write(b"exit\n") print(t.read_all().decode("ascii") ) Run the script and you will get the output as follows: student@ubuntu:~$ python3 telnet_example.py Output: Enter your telnet username: student Password: server>enable Password: server#conf t Enter configuration commands, one per line. End with CNTL/Z. server(config)#int loop 1 server(config-if)#ip add 10.1.1.1 255.255.255.255 server(config-if)#int loop 23 server(config-if)#ip add 20.2.2.2 255.255.255.255 server(config-if)#end server#exit In the preceding example, we accessed and configured a Cisco router using the telnetlib module. In this script, first, we took the username and password from the user to initialize the Telnet connection with a remote device. When the connection was established, we did further configuration on the remote device. After telnetting, we will be able to access a remote server or device. But there is one very important disadvantage of this Telnet protocol, and that is all the data, including usernames and passwords, is sent over a network in a text manner, which may cause a security risk. Because of that, nowadays Telnet is rarely used and has been replaced by a very secure protocol named Secure Shell, known as SSH. Install SSH by running the following command in your Terminal: $ sudo apt install ssh Also, on a remote server where the user wants to communicate, an SSH server must be installed and running. SSH uses the TCP protocol and works on port number 22 by default. You can run the ssh command through the Terminal as follows: $ ssh host_name@host_ip_address Now, you will learn to do SSH by using different modules in Python, such as subprocess, fabric, Netmiko, and Paramiko. Now, we will see those modules one by one. The subprocess.Popen() module The Popen class handles the process creation and management. By using this module, developers can handle less common cases. The child program execution will be done in a new process. To execute a child program on Unix/Linux, the class will use the os.execvp() function. To execute a child program in Windows, the class will use the CreateProcess() function. Now, let's see some useful arguments of subprocess.Popen(): class subprocess.Popen(args, bufsize=0, executable=None, stdin=None, stdout=None, stderr=None, preexec_fn=None, close_fds=False, shell=False, cwd=None, env=None, universal_newlines=False, startupinfo=None, creationflags=0) Let's look at each argument: args: It can be a sequence of program arguments or a single string. If args is a sequence, the first item in args is executed. If args is a string, it recommends to pass args as a sequence. shell: The shell argument is by default set to False and it specifies whether to use shell for execution of the program. If shell is True, it recommends to pass args as a string. In Linux, if shell=True, the shell defaults to /bin/sh. If args is a string, the string specifies the command to execute through the shell. bufsize: If bufsize is 0 (by default, it is 0), it means unbuffered and if bufsize is 1, it means line buffered. If bufsize is any other positive value, use a buffer of the given size. If bufsize is any other negative value, it means fully buffered. executable: It specifies that the replacement program to be executed. stdin, stdout, and stderr: These arguments define the standard input, standard output, and standard error respectively. preexec_fn: This is set to a callable object and will be called just before the child is executed in the child process. close_fds: In Linux, if close_fds is true, all file descriptors except 0, 1, and 2 will be closed before the child process is executed. In Windows, if close_fds is true then the child process will inherit no handles. env: If the value is not None, then mapping will define environment variables for new process. universal_newlines: If the value is True then stdout and stderr will be opened as text files in newlines mode. Now, we are going to see an example of subprocess.Popen(). For that, create a ssh_using_sub.py script and write the following content in it: import subprocess import sys HOST="your host username@host ip" COMMAND= "ls" ssh_obj = subprocess.Popen(["ssh", "%s" % HOST, COMMAND], shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE) result = ssh_obj.stdout.readlines() if result == []: err = ssh_obj.stderr.readlines() print(sys.stderr, "ERROR: %s" % err) else: print(result) Run the script and you will get the output as follows: student@ubuntu:~$ python3 ssh_using_sub.py Output : student@192.168.0.106's password: [b'Desktop\n', b'Documents\n', b'Downloads\n', b'examples.desktop\n', b'Music\n', b'Pictures\n', b'Public\n', b'sample.py\n', b'spark\n', b'spark-2.3.1-bin-hadoop2.7\n', b'spark-2.3.1-bin-hadoop2.7.tgz\n', b'ssh\n', b'Templates\n', b'test_folder\n', b'test.txt\n', b'Untitled1.ipynb\n', b'Untitled.ipynb\n', b'Videos\n', b'work\n'] In the preceding example, first, we imported the subprocess module, then we defined the host address where you want to establish the SSH connection. After that, we gave one simple command that executed over the remote device. After all this was set up, we put this information in the subprocess.Popen() function. This function executed the arguments defined inside that function to create a connection with the remote device. After the SSH connection was established, our defined command was executed and provided the result. Then we printed the result of SSH on the Terminal, as shown in the output. SSH using fabric module Fabric is a Python library as well as a command-line tool for the use of SSH. It is used for system administration and application deployment over the network. We can also execute shell commands over SSH. To use fabric module, first you have to install it using the following command: $ pip3 install fabric3 Now, we will see an example. Create a fabfile.py script and write the following content in it: from fabric.api import * env.hosts=["host_name@host_ip"] env.password='your password' def dir(): run('mkdir fabric') print('Directory named fabric has been created on your host network') def diskspace(): run('df') Run the script and you will get the output as follows: student@ubuntu:~$ fab dir Output: [student@192.168.0.106] Executing task 'dir' [student@192.168.0.106] run: mkdir fabric Done. Disconnecting from 192.168.0.106... done. In the preceding example, first, we imported the fabric.api module, then we set the hostname and password to get connected with the host network. After that, we set a different task to perform over SSH. Therefore, to execute our program instead of the Python3 fabfile.py, we used the fab utility (fab dir), and after that we stated that the required tasks should be performed from our fabfile.py. In our case, we performed the dir task, which creates a directory with the name 'fabric' on your remote network. You can add your specific task in your Python file. It can be executed using the fab utility of the fabric module. SSH using the Paramiko library Paramiko is a library that implements the SSHv2 protocol for secure connections to remote devices. Paramiko is a pure Python interface around SSH. Before using Paramiko, make sure you have installed it properly on your system. If it is not installed, you can install it by running the following command in your Terminal: $ sudo pip3 install paramiko Now, we will see an example of using paramiko. For this paramiko connection, we are using a Cisco device. Paramiko supports both password-based and key-pair based authentication for a secure connection with the server. In our script, we are using password-based authentication, which means we check for a password and, if available, authentication is attempted using plain username/password authentication. Before we are going to do SSH to your remote device or multi-layer router, make sure they are configured properly and, if not, you can do basic configuration by using the following command in a multi-layer router Terminal: configure t ip domain-name cciepython.com crypto key generate rsa How many bits in the modulus [512]: 1024 interface range f0/0 - 1 switchport mode access switchport access vlan 1 no shut int vlan 1 ip add 'set_ip_address_to_the_router' 'put_subnet_mask' no shut exit enable password 'set_Your_password_to_access_router' username 'set_username' password 'set_password_for_remote_access' username 'username' privilege 15 line vty 0 4 login local transport input all end Now, create a pmiko.py script and write the following content in it: import paramiko import time ip_address = "host_ip_address" usr = "host_username" pwd = "host_password" c = paramiko.SSHClient() c.set_missing_host_key_policy(paramiko.AutoAddPolicy()) c.connect(hostname=ip_address,username=usr,password=pwd) print("SSH connection is successfully established with ", ip_address) rc = c.invoke_shell() for n in range (2,6): print("Creating VLAN " + str(n)) rc.send("vlan database\n") rc.send("vlan " + str(n) + "\n") rc.send("exit\n") time.sleep(0.5) time.sleep(1) output = rc.recv(65535) print(output) c.close Run the script and you will get the output as follows: student@ubuntu:~$ python3 pmiko.py Output: SSH connection is successfuly established with 192.168.0.70 Creating VLAN 2 Creating VLAN 3 Creating VLAN 4 Creating VLAN 5 In the preceding example, first, we imported the paramiko module, then we defined the SSH credentials required to connect the remote device. After providing credentials, we created an instance 'c' of paramiko.SSHclient(), which is the primary client used to establish connections with the remote device and execute commands or operations. The creation of an SSHClient object allows us to establish remote connections using the .connect() function. Then, we set the policy paramiko connection because, by default, paramiko.SSHclient sets the SSH policy in reject policy state. That causes the policy to reject any SSH connection without any validation. In our script, we are neglecting this possibility of SSH connection drop by using the AutoAddPolicy() function that automatically adds the server's host key without prompting it. We can use this policy for testing purposes only, but this is not a good option in a production environment because of security purpose. When an SSH connection is established, you can do any configuration or operation that you want on your device. Here, we created a few virtual LANs on a remote device. After creating VLANs, we just closed the connection. SSH using the Netmiko library In this section, we will learn about Netmiko. The Netmiko library is an advanced version of Paramiko. It is a multi_vendor library that is based on Paramiko. Netmiko simplifies SSH connection to a network device and takes particular operation on the device. Before going doing SSH to your remote device or multi-layer router, make sure they are configured properly and, if not, you can do basic configuration by command mentioned in the Paramiko section. Now, let's see an example. Create a nmiko.py script and write the following code in it: from netmiko import ConnectHandler remote_device={ 'device_type': 'cisco_ios', 'ip': 'your remote_device ip address', 'username': 'username', 'password': 'password', } remote_connection = ConnectHandler(**remote_device) #net_connect.find_prompt() for n in range (2,6): print("Creating VLAN " + str(n)) commands = ['exit','vlan database','vlan ' + str(n), 'exit'] output = remote_connection.send_config_set(commands) print(output) command = remote_connection.send_command('show vlan-switch brief') print(command) Run the script and you will get the output as follows: student@ubuntu:~$ python3 nmiko.py Output: Creating VLAN 2 config term Enter configuration commands, one per line. End with CNTL/Z. server(config)#exit server #vlan database server (vlan)#vlan 2 VLAN 2 modified: server (vlan)#exit APPLY completed. Exiting.... server # .. .. .. .. switch# Creating VLAN 5 config term Enter configuration commands, one per line. End with CNTL/Z. server (config)#exit server #vlan database server (vlan)#vlan 5 VLAN 5 modified: server (vlan)#exit APPLY completed. Exiting.... VLAN Name Status Ports ---- -------------------------------- --------- ------------------------------- 1 default active Fa0/0, Fa0/1, Fa0/2, Fa0/3, Fa0/4, Fa0/5, Fa0/6, Fa0/7, Fa0/8, Fa0/9, Fa0/10, Fa0/11, Fa0/12, Fa0/13, Fa0/14, Fa0/15 2 VLAN0002 active 3 VLAN0003 active 4 VLAN0004 active 5 VLAN0005 active 1002 fddi-default active 1003 token-ring-default active 1004 fddinet-default active 1005 trnet-default active In the preceding example, we use Netmiko library to do SSH, instead of Paramiko. In this script, first, we imported ConnectHandler from the Netmiko library, which we used to establish an SSH connection to the remote network devices by passing in the device dictionary. In our case, that dictionary is remote_device. After the connection is established, we executed configuration commands to create a number of virtual LANs using the send_config_set() function. When we use this type (.send_config_set()) of function to pass commands on a remote device, it automatically sets our device in configuration mode. After sending configuration commands, we also passed a simple command to get the information about the configured device. Summary In this tutorial, you learned about Telnet and SSH and different Python modules such as telnetlib, subprocess, fabric, Netmiko, and Paramiko, using which we perform Telnet and SSH. SSH uses the public key encryption for security purposes and is more secure than Telnet. To learn how to leverage the features and libraries of Python to administrate your environment efficiently, check out our book Mastering Python Scripting for System Administrators. 5 blog posts that could make you a better Python programmer “With Python, you can create self-explanatory, concise, and engaging data visuals, and present insights that impact your business” – Tim Großmann and Mario Döbler [Interview] Using Python Automation to interact with network devices [Tutorial]

0
0
59858

How-To Tutorials

article-image-visualizing-3d-plots-matplotlib-2-0

Sugandha Lahoti

16 Nov 2017

7 min read

Visualizing 3D plots in Matplotlib 2.0

Sugandha Lahoti

16 Nov 2017

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim titled Matplotlib 2.x By Example.[/box] By transitioning to the three-dimensional space, you may enjoy greater creative freedom when creating visualizations. The extra dimension can also accommodate more information in a single plot. However, some may argue that 3D is nothing more than a visual gimmick when projected to a 2D surface (such as paper) as it would obfuscate the interpretation of data points. In Matplotlib version 2, despite significant developments in the 3D API, annoying bugs or glitches still exist. We will discuss some workarounds toward the end of this article. More powerful Python 3D visualization packages do exist (such as MayaVi2, Plotly, and VisPy), but it's good to use Matplotlib's 3D plotting functions if you want to use the same package for both 2D and 3D plots, or you would like to maintain the aesthetics of its 2D plots. For the most part, 3D plots in Matplotlib have similar structures to 2D plots. As such, we will not go through every 3D plot type in this section. We will put our focus on 3D scatter plots and bar charts. 3D scatter plot Let's try to create a 3D scatter plot. Before doing that, we need some data points in three dimensions (x, y, z): import pandas as pd source = "https://raw.githubusercontent.com/PointCloudLibrary/data/master/tutorials/ ism_train_cat.pcd" cat_df = pd.read_csv(source, skiprows=11, delimiter=" ", names=["x","y","z"], encoding='latin_1') cat_df.head() To declare a 3D plot, we first need to import the Axes3D object from the mplot3d extension in mpl_toolkits, which is responsible for rendering 3D plots in a 2D plane. After that, we need to specify projection='3d' when we create subplots: from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(cat_df.x, cat_df.y, cat_df.z) plt.show() Behold, the mighty sCATter plot in 3D. Cats are currently taking over the internet. According to the New York Times, cats are "the essential building block of the Internet" (https://www.nytimes.com/2014/07/23/upshot/what-the-internet-can-see-from-your-cat-pictures.html). Undoubtedly, they deserve a place in this chapter as well. Contrary to the 2D version of scatter(), we need to provide X, Y, and Z coordinates when we are creating a 3D scatter plot. Yet the parameters that are supported in 2D scatter() can be applied to 3D scatter() as well: fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Change the size, shape and color of markers ax.scatter(cat_df.x, cat_df.y, cat_df.z, s=4, c="g", marker="o") plt.show() To change the viewing angle and elevation of the 3D plot, we can make use of view_init(). The azim parameter specifies the azimuth angle in the X-Y plane, while elev specifies the elevation angle. When the azimuth angle is 0, the X-Y plane would appear to the north from you. Meanwhile, an azimuth angle of 180 would show you the south side of the X-Y plane: fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(cat_df.x, cat_df.y, cat_df.z,s=4, c="g", marker="o") # elev stores the elevation angle in the z plane azim stores the # azimuth angle in the x,y plane ax.view_init(azim=180, elev=10) plt.show() 3D bar chart We introduced candlestick plots for showing Open-High-Low-Close (OHLC) financial data. In addition, a 3D bar chart can be employed to show OHLC across time. The next figure shows a typical example of plotting a 5-day OHLC bar chart: import matplotlib.pyplot as plt import numpy as np from mpl_toolkits.mplot3d import Axes3D # Get 1 and every fifth row for the 5-day AAPL OHLC data ohlc_5d = stock_df[stock_df["Company"]=="AAPL"].iloc[1::5, :] fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) ax.bar(xs, ys, zs=z, zdir='y', color=colors, alpha=0.8, width=5) plt.show() The method for setting ticks and labels is similar to other Matplotlib plotting functions: fig = plt.figure(figsize=(9,7)) ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) ax.bar(xs, ys, zs=z, zdir='y', color=colors, alpha=0.8) # Manually assign the ticks and tick labels ax.set_xticks(np.arange(ohlc_5d.shape[0])) ax.set_xticklabels(ohlc_5d["Date"], rotation=20, verticalalignment='baseline', horizontalalignment='right', fontsize='8') ax.set_yticks([30, 20, 10, 0]) ax.set_yticklabels(["Open", "High", "Low", "Close"]) # Set the z-axis label ax.set_zlabel('Price (US $)') # Rotate the viewport ax.view_init(azim=-42, elev=31) plt.tight_layout() plt.show() Caveats to consider while visualizing 3D plots in Matplotlib Due to the lack of a true 3D graphical rendering backend (such as OpenGL) and proper algorithm for detecting 3D objects' intersections, the 3D plotting capabilities of Matplotlib are not great but just adequate for typical applications. In the official Matplotlib FAQ (https://matplotlib.org/mpl_toolkits/mplot3d/faq.html), the author noted that 3D plots may not look right at certain angles. Besides, we also reported that mplot3d would failed to clip bar charts if zlim is set (https://github.com/matplotlib/matplotlib/ issues/8902; see also https://github.com/matplotlib/matplotlib/issues/209). Without improvements in the 3D rendering backend, these issues are hard to fix. To better illustrate the latter issue, let's try to add ax.set_zlim3d(bottom=110, top=150) right above plt.tight_layout() in the previous 3D bar chart: Clearly, something is going wrong, as the bars overshoot the lower boundary of the axes. We will try to address the latter issue through the following workaround: # FuncFormatter to add 110 to the tick labels def major_formatter(x, pos): return "{}".format(x+110) fig = plt.figure(figsize=(9,7)) ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) # Truncate the y-values by 110 ax.bar(xs, ys-110, zs=z, zdir='y', color=colors, alpha=0.8) # Manually assign the ticks and tick labels ax.set_xticks(np.arange(ohlc_5d.shape[0])) ax.set_xticklabels(ohlc_5d["Date"], rotation=20, verticalalignment='baseline', horizontalalignment='right', fontsize='8') # Set the z-axis label ax.set_yticks([30, 20, 10, 0]) ax.set_yticklabels(["Open", "High", "Low", "Close"]) ax.zaxis.set_major_formatter(FuncFormatter(major_formatter)) ax.set_zlabel('Price (US $)') # Rotate the viewport ax.view_init(azim=-42, elev=31) plt.tight_layout() plt.show() Basically, we truncated the y values by 110, and then we used a tick formatter (major_formatter) to shift the tick value back to the original. For 3D scatter plots, we can simply remove the data points that exceed the boundary of set_zlim3d() in order to generate a proper figure. However, these workarounds may not work for every 3D plot type. Conclusion We didn't go into too much detail of the 3D plotting capability of Matplotlib, as it is yet to be polished. For simple 3D plots, Matplotlib already suffices. The learning curve can be reduced if we use the same package for both 2D and 3D plots. You are advised to take a look at MayaVi2, Plotly, and VisPy if you require more powerful 3D plotting functions. If you enjoyed this excerpt, be sure to check out the book it is from.

0
0
59052

Packt

25 May 2015

15 min read

Cleaning Data in PDF Files

Packt

25 May 2015

15 min read

In this article by Megan Squire, author of the book Clean Data, we will experiment with several data decanters to extract all the good stuff hidden inside inscrutable PDF files. We will explore the following topics: What PDF files are for and why it is difficult to extract data from them How to copy and paste from PDF files, and what to do when this does not work How to shrink a PDF file by saving only the pages that we need How to extract text and numbers from a PDF file using the tools inside a Python package called pdfMiner How to extract tabular data from within a PDF file using a browser-based Java application called Tabula How to use the full, paid version of Adobe Acrobat to extract a table of data (For more resources related to this topic, see here.) Why is cleaning PDF files difficult? Files saved in Portable Document Format (PDF) are a little more complicated than some of the text files. PDF is a binary format that was invented by Adobe Systems, which later evolved into an open standard so that multiple applications could create PDF versions of their documents. The purpose of a PDF file is to provide a way of viewing the text and graphics in a document independent of the software that did the original layout. In the early 1990s, the heyday of desktop publishing, each graphic design software package had a different proprietary format for its files, and the packages were quite expensive. In those days, in order to view a document created in Word, Pagemaker, or Quark, you would have to open the document using the same software that had created it. This was especially problematic in the early days of the Web, since there were not many available techniques in HTML to create sophisticated layouts, but people still wanted to share files with each other. PDF was meant to be a vendor-neutral layout format. Adobe made its Acrobat Reader software free for anyone to download, and subsequently the PDF format became widely used. Here is a fun fact about the early days of Acrobat Reader. The words click here when entered into Google search engine still bring up Adobe's Acrobat PDF Reader download website as the first result, and have done so for years. This is because so many websites distribute PDF files along with a message saying something like, "To view this file you must have Acrobat Reader installed. Click here to download it." Since Google's search algorithm uses the link text to learn what sites go with what keywords, the keyword click here is now associated with Adobe Acrobat's download site. PDF is still used to make vendor- and application-neutral versions of files that have layouts that are more complicated than what could be achieved with plain text. For example, viewing the same document in the various versions of Microsoft Word still sometimes causes documents with lots of embedded tables, styles, images, forms, and fonts to look different from one another. This can be due to a number of factors, such as differences in operating systems or versions of the installed Word software itself. Even with applications that are intended to be compatible between software packages or versions, subtle differences can result in incompatibilities. PDF was created to solve some of this. Right away we can tell that PDF is going to be more difficult to deal with than a text file, because it is a binary format, and because it has embedded fonts, images, and so on. So most of the tools in our trusty data cleaning toolbox, such as text editors and command-line tools (less) are largely useless with PDF files. Fortunately there are still a few tricks we can use to get the data out of a PDF file. Try simple solutions first – copying Suppose that on your way to decant your bottle of fine red wine, you spill the bottle on the floor. Your first thought might be that this is a complete disaster and you will have to replace the whole carpet. But before you start ripping out the entire floor, it is probably worth trying to clean the mess with an old bartender's trick: club soda and a damp cloth. In this section, we outline a few things to try first, before getting involved in an expensive file renovation project. They might not work, but they are worth a try. Our experimental file Let's practice cleaning PDF data by using a real PDF file. We also do not want this experiment to be too easy, so let's choose a very complicated file. Suppose we are interested in pulling the data out of a file we found on the Pew Research Center's website called "Is College Worth It?". Published in 2011, this PDF file is 159 pages long and contains numerous data tables showing various ways of measuring if attaining a college education in the United States is worth the investment. We would like to find a way to quickly extract the data within these numerous tables so that we can run some additional statistics on it. For example, here is what one of the tables in the report looks like: This table is fairly complicated. It only has six columns and eight rows, but several of the rows take up two lines, and the header row text is only shown on five of the columns. The complete report can be found at the PewResearch website at http://www.pewsocialtrends.org/2011/05/15/is-college-worth-it/, and the particular file we are using is labeled Complete Report: http://www.pewsocialtrends.org/files/2011/05/higher-ed-report.pdf. Step one – try copying out the data we want The data we will experiment on in this example is found on page 149 of the PDF file (labeled page 143 in their document). If we open the file in a PDF viewer, such as Preview on Mac OSX, and attempt to select just the data in the table, we already see that some strange things are happening. For example, even though we did not mean to select the page number (143); it got selected anyway. This does not bode well for our experiment, but let's continue. Copy the data out by using Command-C or select Edit | Copy. How text looks when selected in this PDF from within Preview Step two – try pasting the copied data into a text editor The following screenshot shows how the copied text looks when it is pasted into Text Wrangler, our text editor: Clearly, this data is not in any sensible order after copying and pasting it. The page number is included, the numbers are horizontal instead of vertical, and the column headers are out of order. Even some of the numbers have been combined; for example, the final row contains the numbers 4,4,3,2; but in the pasted version, this becomes a single number 4432. It would probably take longer to clean up this data manually at this point than it would have taken just to retype the original table. We can conclude that with this particular PDF file, we are going to have to take stronger measures to clean it. Step three – make a smaller version of the file Our copying and pasting procedures have not worked, so we have resigned ourselves to the fact that we are going to need to prepare for more invasive measures. Perhaps if we are not interested in extracting data from all 159 pages of this PDF file, we can identify just the area of the PDF that we want to operate on, and save that section to a separate file. To do this in Preview on MacOSX, launch the File | Print… dialog box. In the Pages area, we will enter the range of pages we actually want to copy. For the purpose of this experiment, we are only interested in page 149; so enter 149 in both the From: and to: boxes as shown in the following screenshot. Then from the PDF dropdown box at the bottom, select Open PDF in Preview. You will see your single-page PDF in a new window. From here, we can save this as a new file and give it a new name, such as report149.pdf or the like. Another technique to try – pdfMiner Now that we have a smaller file to experiment with, let's try some programmatic solutions to extract the text and see if we fare any better. pdfMiner is a Python package with two embedded tools to operate on PDF files. We are particularly interested in experimenting with one of these tools, a command-line program called pdf2txt that is designed to extract text from within a PDF document. Maybe this will be able to help us get those tables of numbers out of the file correctly. Step one – install pdfMiner Launch the Canopy Python environment. From the Canopy Terminal Window, run the following command: pip install pdfminer This will install the entire pdfMiner package and all its associated command-line tools. The documentation for pdfMiner and the two tools that come with it, pdf2txt and dumpPDF, is located at http://www.unixuser.org/~euske/python/pdfminer/. Step two – pull text from the PDF file We can extract all text from a PDF file using the command-line tool called pdf2txt.py. To do this, use the Canopy Terminal and navigate to the directory where the file is located. The basic format of the command is pdf2txt.py <filename>. If you have a larger file that has multiple pages (or you did not already break the PDF into smaller ones), you can also run pdf2txt.py –p149 <filename> to specify that you only want page 149. Just as with the preceding copy-and-paste experiment, we will try this technique not only on the tables located on page 149, but also on the Preface on page 3. To extract just the text from page 3, we run the following command: pdf2txt.py –p3 pewReport.pdf After running this command, the extracted preface of the Pew Research report appears in our command-line window: To save this text to a file called pewPreface.txt, we can simply add a redirect to our command line as follows: pdf2txt.py –p3 pewReport.pdf > pewPreface.txt But what about those troublesome data tables located on page 149? What happens when we use pdf2txt on those? We can run the following command: pdf2txt.py pewReport149.pdf The results are slightly better than copy and paste, but not by much. The actual data output section is shown in the following screenshot. The column headers and data are mixed together, and the data from different columns are shown out of order. We will have to declare the tabular data extraction portion of this experiment a failure, though pdfMiner worked reasonably well on line-by-line text-only extraction. Remember that your success with each of these tools may vary. Much of it depends on the particular characteristics of the original PDF file. It looks like we chose a very tricky PDF for this example, but let's not get disheartened. Instead, we will move on to another tool and see how we fare with it. Third choice – Tabula Tabula is a Java-based program to extract data within tables in PDF files. We will download the Tabula software and put it to work on the tricky tables in our page 149 file. Step one – download Tabula Tabula is available to be downloaded from its website at http://tabula.technology/. The site includes some simple download instructions. On Mac OSX version 10.10.1, I had to download the legacy Java 6 application before I was able to run Tabula. The process was straightforward and required only following the on-screen instructions. Step two – run Tabula Launch Tabula from inside the downloaded .zip archive. On the Mac, the Tabula application file is called simply Tabula.app. You can copy this to your Applications folder if you like. When Tabula starts, it launches a tab or window within your default web browser at the address http://127.0.0.1:8080/. The initial action portion of the screen looks like this: The warning that auto-detecting tables takes a long time is true. For the single-page perResearch149.pdf file, with three tables in it, table auto-detection took two full minutes and resulted in an error message about an incorrectly formatted PDF file. Step three – direct Tabula to extract the data Once Tabula reads in the file, it is time to direct it where the tables are. Using your mouse cursor, select the table you are interested in. I drew a box around the entire first table. Tabula took about 30 seconds to read in the table, and the results are shown as follows: Compared to the way the data was read with copy and paste and pdf2txt, this data looks great. But if you are not happy with the way Tabula reads in the table, you can repeat this process by clearing your selection and redrawing the rectangle. Step four – copy the data out We can use the Download Data button within Tabula to save the data to a friendlier file format, such as CSV or TSV. Step five – more cleaning Open the CSV file in Excel or a text editor and take a look at it. At this stage, we have had a lot of failures in getting this PDF data extracted, so it is very tempting to just quit now. Here are some simple data cleaning tasks: We can combine all the two-line text cells into a single cell. For example, in column B, many of the phrases take up more than one row. Prepare students to be productive and members of the workforce should be in one cell as a single phrase. The same is true for the headers in Rows 1 and 2 (4-year and Private should be in a single cell). To clean this in Excel, create a new column between columns B and C. Use the concatenate() function to join B3:B4, B5:B6, and so on. Use Paste-Special to add the new concatenated values into a new column. Then remove the two columns you no longer need. Do the same for rows 1 and 2. Remove blank lines between rows. When these procedures are finished, the data looks like this: Tabula might seem like a lot of work compared to cutting and pasting data or running a simple command-line tool. That is true, unless your PDF file turns out to be finicky like this one was. Remember that specialty tools are there for a reason—but do not use them unless you really need them. Start with a simple solution first and only proceed to a more difficult tool when you really need it. When all else fails – fourth technique Adobe Systems sells a paid, commercial version of their Acrobat software that has some additional features above and beyond just allowing you to read PDF files. With the full version of Acrobat, you can create complex PDF files and manipulate existing files in various ways. One of the features that is relevant here is the Export Selection As… option found within Acrobat. To get started using this feature, launch Acrobat and use the File Open dialog to open the PDF file. Within the file, navigate to the table holding the data you want to export. The following screenshot shows how to select the data from the page 149 PDF we have been operating on. Use your mouse to select the data, then right-click and choose Export Selection As… At this point, Acrobat will ask you how you want the data exported. CSV is one of the choices. Excel Workbook (.xlsx) would also be a fine choice if you are sure you will not want to also edit the file in a text editor. Since I know that Excel can also open CSV files, I decided to save my file in that format so I would have the most flexibility between editing in Excel and my text editor. After choosing the format for the file, we will be prompted for a filename and location for where to save the file. When we launch the resulting file, either in a text editor or in Excel, we can see that it looks a lot like the Tabula version we saw in the previous section. Here is how our CSV file will look when opened in Excel: At this point, we can use the exact same cleaning routine we used with the Tabula data, where we concatenated the B2:B3 cells into a single cell and then removed the empty rows. Summary The goal of this article was to learn how to export data out of a PDF file. Like sediment in a fine wine, the data in PDF files can appear at first to be very difficult to separate. Unlike decanting wine, however, which is a very passive process, separating PDF data took a lot of trial and error. We learned four ways of working with PDF files to clean data: copying and pasting, pdfMiner, Tabula, and Acrobat export. Each of these tools has certain strengths and weaknesses: Copying and pasting costs nothing and takes very little work, but is not as effective with complicated tables. pdfMiner/Pdf2txt is also free, and as a command-line tool, it could be automated. It also works on large amounts of data. But like copying and pasting, it is easily confused by certain types of tables. Tabula takes some work to set up, and since it is a product undergoing development, it does occasionally give strange warnings. It is also a little slower than the other options. However, its output is very clean, even with complicated tables. Acrobat gives similar output to Tabula, but with almost no setup and very little effort. It is a paid product. By the end, we had a clean dataset that was ready for analysis or long-term storage. Resources for Article: Further resources on this subject: Machine Learning Using Spark MLlib [article] Data visualization [article] First steps with R [article]

0
1
59000

article-image-generative-adversarial-networks-using-keras

Amey Varangaonkar

21 Aug 2018

12 min read

Generative Adversarial Networks: Generate images using Keras GAN [Tutorial]

Amey Varangaonkar

21 Aug 2018

12 min read

You might have worked with the popular MNIST dataset before - but in this article, we will be generating new MNIST-like images with a Keras GAN. It can take a very long time to train a GAN; however, this problem is small enough to run on most laptops in a few hours, which makes it a great example. The following excerpt is taken from the book Deep Learning Quick Reference, authored by Mike Bernico. The network architecture that we will be using here has been found by, and optimized by, many folks, including the authors of the DCGAN paper and people like Erik Linder-Norén, who's excellent collection of GAN implementations called Keras GAN served as the basis of the code we used here. Loading the MNIST dataset The MNIST dataset consists of 60,000 hand-drawn numbers, 0 to 9. Keras provides us with a built-in loader that splits it into 50,000 training images and 10,000 test images. We will use the following code to load the dataset: from keras.datasets import mnist def load_data(): (X_train, _), (_, _) = mnist.load_data() X_train = (X_train.astype(np.float32) - 127.5) / 127.5 X_train = np.expand_dims(X_train, axis=3) return X_train As you probably noticed, We're not returning any of the labels or the testing dataset. We're only going to use the training dataset. The labels aren't needed because the only labels we will be using are 0 for fake and 1 for real. These are real images, so they will all be assigned a label of 1 at the discriminator. Building the generator The generator uses a few new layers that we will talk about in this section. First, take a chance to skim through the following code: def build_generator(noise_shape=(100,)): input = Input(noise_shape) x = Dense(128 * 7 * 7, activation="relu")(input) x = Reshape((7, 7, 128))(x) x = BatchNormalization(momentum=0.8)(x) x = UpSampling2D()(x) x = Conv2D(128, kernel_size=3, padding="same")(x) x = Activation("relu")(x) x = BatchNormalization(momentum=0.8)(x) x = UpSampling2D()(x) x = Conv2D(64, kernel_size=3, padding="same")(x) x = Activation("relu")(x) x = BatchNormalization(momentum=0.8)(x) x = Conv2D(1, kernel_size=3, padding="same")(x) out = Activation("tanh")(x) model = Model(input, out) print("-- Generator -- ") model.summary() return model We have not previously used the UpSampling2D layer. This layer will take increases in the rows and columns of the input tensor, leaving the channels unchanged. It does this by repeating the values in the input tensor. By default, it will double the input. If we give an UpSampling2D layer a 7 x 7 x 128 input, it will give us a 14 x 14 x 128 output. Typically when we build a CNN, we start with an image that is very tall and wide and uses convolutional layers to get a tensor that's very deep but less tall and wide. Here we will do the opposite. We'll use a dense layer and a reshape to start with a 7 x 7 x 128 tensor and then, after doubling it twice, we'll be left with a 28 x 28 tensor. Since we need a grayscale image, we can use a convolutional layer with a single unit to get a 28 x 28 x 1 output. This sort of generator arithmetic is a little off-putting and can seem awkward at first but after a few painful hours, you will get the hang of it! Building the discriminator The discriminator is really, for the most part, the same as any other CNN. Of course, there are a few new things that we should talk about. We will use the following code to build the discriminator: def build_discriminator(img_shape): input = Input(img_shape) x =Conv2D(32, kernel_size=3, strides=2, padding="same")(input) x = LeakyReLU(alpha=0.2)(x) x = Dropout(0.25)(x) x = Conv2D(64, kernel_size=3, strides=2, padding="same")(x) x = ZeroPadding2D(padding=((0, 1), (0, 1)))(x) x = (LeakyReLU(alpha=0.2))(x) x = Dropout(0.25)(x) x = BatchNormalization(momentum=0.8)(x) x = Conv2D(128, kernel_size=3, strides=2, padding="same")(x) x = LeakyReLU(alpha=0.2)(x) x = Dropout(0.25)(x) x = BatchNormalization(momentum=0.8)(x) x = Conv2D(256, kernel_size=3, strides=1, padding="same")(x) x = LeakyReLU(alpha=0.2)(x) x = Dropout(0.25)(x) x = Flatten()(x) out = Dense(1, activation='sigmoid')(x) model = Model(input, out) print("-- Discriminator -- ") model.summary() return model First, you might notice the oddly shaped zeroPadding2D() layer. After the second convolution, our tensor has gone from 28 x 28 x 3 to 7 x 7 x 64. This layer just gets us back into an even number, adding zeros on one side of both the rows and columns so that our tensor is now 8 x 8 x 64. More unusual is the use of both batch normalization and dropout. Typically, these two layers are not used together; however, in the case of GANs, they do seem to benefit the network. Building the stacked model Now that we've assembled both the generator and the discriminator, we need to assemble a third model that is the stack of both models together that we can use for training the generator given the discriminator loss. To do that we can just create a new model, this time using the previous models as layers in the new model, as shown in the following code: discriminator = build_discriminator(img_shape=(28, 28, 1)) generator = build_generator() z = Input(shape=(100,)) img = generator(z) discriminator.trainable = False real = discriminator(img) combined = Model(z, real) Notice that we're setting the discriminator's training attribute to False before building the model. This means that for this model we will not be updating the weights of the discriminator during backpropagation. We will freeze these weights and only move the generator weights with the stack. The discriminator will be trained separately. Now that all the models are built, they need to be compiled, as shown in the following code: gen_optimizer = Adam(lr=0.0002, beta_1=0.5) disc_optimizer = Adam(lr=0.0002, beta_1=0.5) discriminator.compile(loss='binary_crossentropy', optimizer=disc_optimizer, metrics=['accuracy']) generator.compile(loss='binary_crossentropy', optimizer=gen_optimizer) combined.compile(loss='binary_crossentropy', optimizer=gen_optimizer) If you'll notice, we're creating two custom Adam optimizers. This is because many times we will want to change the learning rate for only the discriminator or generator, slowing one or the other down so that we end up with a stable GAN where neither is overpowering the other. You'll also notice that we're using beta_1 = 0.5. This is a recommendation from the original DCGAN paper that we've carried forward and also had success with. A learning rate of 0.0002 is a good place to start as well, and was found in the original DCGAN paper. The training loop We have previously had the luxury of calling .fit() on our model and letting Keras handle the painful process of breaking the data apart into mini batches and training for us. Unfortunately, because we need to perform the separate updates for the discriminator and the stacked model together for a single batch we're going to have to do things the old-fashioned way, with a few loops. This is how things used to be done all the time, so while it's perhaps a little more work, it does admittedly leave me feeling nostalgic. The following code illustrates the training technique: num_examples = X_train.shape[0] num_batches = int(num_examples / float(batch_size)) half_batch = int(batch_size / 2) for epoch in range(epochs + 1): for batch in range(num_batches): # noise images for the batch noise = np.random.normal(0, 1, (half_batch, 100)) fake_images = generator.predict(noise) fake_labels = np.zeros((half_batch, 1)) # real images for batch idx = np.random.randint(0, X_train.shape[0], half_batch) real_images = X_train[idx] real_labels = np.ones((half_batch, 1)) # Train the discriminator (real classified as ones and generated as zeros) d_loss_real = discriminator.train_on_batch(real_images, real_labels) d_loss_fake = discriminator.train_on_batch(fake_images, fake_labels) d_loss = 0.5 * np.add(d_loss_real, d_loss_fake) noise = np.random.normal(0, 1, (batch_size, 100)) # Train the generator g_loss = combined.train_on_batch(noise, np.ones((batch_size, 1))) # Plot the progress print("Epoch %d Batch %d/%d [D loss: %f, acc.: %.2f%%] [G loss: %f]" % (epoch,batch, num_batches, d_loss[0], 100 * d_loss[1], g_loss)) if batch % 50 == 0: save_imgs(generator, epoch, batch) There is a lot going on here, to be sure. As before, let's break it down block by block. First, let's see the code to generate noise vectors: noise = np.random.normal(0, 1, (half_batch, 100)) fake_images = generator.predict(noise) fake_labels = np.zeros((half_batch, 1)) This code is generating a matrix of noise vectors called z) and sending it to the generator. It's getting a set of generated images back, which we're calling fake images. We will use these to train the discriminator, so the labels we want to use are 0s, indicating that these are in fact generated images. Note that the shape here is half_batch x 28 x 28 x 1. The half_batch is exactly what you think it is. We're creating half a batch of generated images because the other half of the batch will be real data, which we will assemble next. To get our real images, we will generate a random set of indices across X_train and use that slice of X_train as our real images, as shown in the following code: idx = np.random.randint(0, X_train.shape[0], half_batch) real_images = X_train[idx] real_labels = np.ones((half_batch, 1)) Yes, we are sampling with replacement in this case. It does work out but it's probably not the best way to implement minibatch training. It is, however, probably the easiest and most common. Since we are using these images to train the discriminator, and because they are real images, we will assign them 1s as labels, rather than 0s. Now that we have our discriminator training set assembled, we will update the discriminator. Also, note that we aren't using the soft labels. That's because we want to keep things as easy as they can be to understand. Luckily the network doesn't require them in this case. We will use the following code to train the discriminator: # Train the discriminator (real classified as ones and generated as zeros) d_loss_real = discriminator.train_on_batch(real_images, real_labels) d_loss_fake = discriminator.train_on_batch(fake_images, fake_labels) d_loss = 0.5 * np.add(d_loss_real, d_loss_fake) Notice that here we're using the discriminator's train_on_batch() method. The train_on_batch() method does exactly one round of forward and backward propagation. Every time we call it, it updates the model once from the model's previous state. Also, notice that we're making the update for the real images and fake images separately. This is advice that is given on the GAN hack Git we had previously referenced in the Generator architecture section. Especially in the early stages of training, when real images and fake images are from radically different distributions, batch normalization will cause problems with training if we were to put both sets of data in the same update. Now that the discriminator has been updated, it's time to update the generator. This is done indirectly by updating the combined stack, as shown in the following code: noise = np.random.normal(0, 1, (batch_size, 100)) g_loss = combined.train_on_batch(noise, np.ones((batch_size, 1))) To update the combined model, we create a new noise matrix, and this time it will be as large as the entire batch. We will use that as an input to the stack, which will cause the generator to generate an image and the discriminator to evaluate that image. Finally, we will use the label of 1 because we want to backpropagate the error between a real image and the generated image. Lastly, the training loop reports the discriminator and generator loss at the epoch/batch and then, every 50 batches, of every epoch we will use save_imgs to generate example images and save them to disk, as shown in the following code: print("Epoch %d Batch %d/%d [D loss: %f, acc.: %.2f%%] [G loss: %f]" % (epoch,batch, num_batches, d_loss[0], 100 * d_loss[1], g_loss)) if batch % 50 == 0: save_imgs(generator, epoch, batch) The save_imgs function uses the generator to create images as we go, so we can see the fruits of our labor. We will use the following code to define save_imgs: def save_imgs(generator, epoch, batch): r, c = 5, 5 noise = np.random.normal(0, 1, (r * c, 100)) gen_imgs = generator.predict(noise) gen_imgs = 0.5 * gen_imgs + 0.5 fig, axs = plt.subplots(r, c) cnt = 0 for i in range(r): for j in range(c): axs[i, j].imshow(gen_imgs[cnt, :, :, 0], cmap='gray') axs[i, j].axis('off') cnt += 1 fig.savefig("images/mnist_%d_%d.png" % (epoch, batch)) plt.close() It uses only the generator by creating a noise matrix and retrieving an image matrix in return. Then, using matplotlib.pyplot, it saves those images to disk in a 5 x 5 grid. Performing model evaluation Good is somewhat subjective when you're building a deep neural network to create images. Let's take a look at a few examples of the training process, so you can see for yourself how the GAN begins to learn to generate MNIST. Here's the network at the very first batch of the very first epoch. Clearly, the generator doesn't really know anything about generating MNIST at this point; it's just noise, as shown in the following image: But just 50 batches in, something is happening, as you can see from the following image: And after 200 batches of epoch 0 we can almost see numbers, as you can see from the following image: And here's our generator after one full epoch. These generated numbers look pretty good, and we can see how the discriminator might be fooled by them. At this point, we could probably continue to improve a little bit, but it looks like our GAN has worked as the computer is generating some pretty convincing MNIST digits, as shown in the following image: Thus, we see the power of GANs in action when it comes to image generation using the Keras library. If you found the above article to be useful, make sure you check out our book Deep Learning Quick Reference, for more such interesting coverage of popular deep learning concepts and their practical implementation. Keras 2.2.0 releases! 2 ways to customize your deep learning models with Keras How to build Deep convolutional GAN using TensorFlow and Keras

0
3
58913

How-To Tutorials

article-image-managing-nano-server-windows-powershell-and-windows-powershell-dsc

Packt

05 Jul 2017

8 min read

Managing Nano Server with Windows PowerShell and Windows PowerShell DSC

Packt

05 Jul 2017

8 min read

In this article by Charbel Nemnom, the author of the book Getting Started with Windows Nano Server, we will cover the following topics: Remote server graphical tools Server manager Hyper-V manager Microsoft management console Managing Nano Server with PowerShell (For more resources related to this topic, see here.) Remote server graphical tools Without the Graphical User Interface (GUI), it’s not easy to carry out the daily management and maintenance of Windows Server. For this reason, Microsoft integrated Nano Server with all the existing graphical tools that you are familiar with such as Hyper-V manager, failover cluster manager, server manager, registry editor, File explorer, disk and device manager, server configuration, computer management, users and groups console, and so on. All those tools and consoles are compatible to manage Nano Server remotely. The GUI is always the easiest way to use. In this section, we will discuss how to access and set the most common configurations in Nano Server with remote graphical tools. Server manager Before we start managing Nano Server, we need to obtain the IP address or the computer name of the Nano Server to connect to and remotely manage a Nano instance either physical or virtual machine. Login to your management machine and make sure you have installed the latest Remote Server Administration Tools (RSAT) for Windows Server 2016 or Windows 10. You can download the latest RSAT tools from the following link: https://www.microsoft.com/en-us/download/details.aspx?id=45520 Launch server manager as shown in Figure 1, and add your Nano Server(s) that you would like to manage: Figure 1: Managing Nano Server using server manager You can refresh the view and browse all events and services as you expect to see. I want to point out that Best Practices Analyzer (BPA) is not supported in Nano Server. BPA is completely cmdlets-based and written in C# back during the days of PowerShell 2.0. It is also statically using some .NET XML library code that was not part of .NET framework at that time. So, do not expect to see Best Practices Analyzer in server manager. Hyper-V manager The next console that you probably want to access is Hyper-V Manager, right click on Nano Server name in server manager and select Hyper-V Manager console as shown in Figure 2: Figure 2: Managing Nano Server using Hyper-V manager Hyper-V Manager will launch with full support as you expect when managing full Windows Server 2016 Hyper-V, free Hyper-V server, server core and Nano Server with Hyper-V role. Microsoft management console You can use the Microsoft Management Console (MMC) to manage Nano Server as well. From the command line type mmc.exe. From the File menu, Click Add/Remove Snap-in…and then select Computer Management and click Add. Choose Another computer and add the IP address or the computer name of your Nano Server machine. Click Ok. As shown in Figure 3, you can expand System Tools and check the tools that you are familiar with like (Event Viewer, Local Users and Groups, Shares,and Services). Please note that some of these MMC tools such as Task Scheduler and Disk Management cannot be used against Nano Server. Also, for certain tools you need to open some ports in Windows firewall: Figure 3: Managing Nano Server using Microsoft Management Console Managing Nano Server with PowerShell For most IT administrators, the graphical user interface is the easiest way to use. But on the other hand, PowerShell can bring a fast and an automated process. That's why in Windows Server 2016, the Nano Server deployment option of Windows Server comes with full PowerShell remoting support. The purpose of the core PowerShell engine, is to manage Nano Server instances at scale. PowerShell remoting including DSC, Windows Server cmdlets (network, storage, Hyper-V, and so on), Remote file transfer, Remote script authoring and debugging, and PowerShell Web access. Some of the new features in Windows PowerShell version 5.1 on Nano Server supports the following: Copying files via PowerShell sessions Remote file editing in PowerShell ISE Interactive script debugging over PowerShell session Remote script debugging within PowerShell ISE Remote host process connects and debug PowerShell version 5.1 is available in different editions which denote varying feature sets and platform compatibility. Desktop Edition targeting Full Server, Server Core and Windows Desktop, Core Edition targeting Nano Server and Windows IoT. You can find a list of Windows PowerShell features not available yet in Nano Server here. As Nano Server is still evolving, we will see what the next cadence update will bring for unavailable PowerShell features. If you want to manage your Nano Server, you can use PowerShell Remoting or if your Nano Server instance is running in a virtual machine you can also use PowerShell Direct, more on that at the end of this section. In order to manage a Nano server installation using PowerShell remoting carry out the following steps: You may need to start the WinRM service on your management machine to enable remote connections. From the PowerShell console type the following command: net start WinRM If you want to manage Nano Server in a workgroup environment, open PowerShell console, and type the following command, substituting server name or IP with the right value using your machine-name is the easiest to use, but if your device is not uniquely named on your network, you can use the IP address instead: Set-Item WSMan:localhostClientTrustedHosts -Value "servername or IP" If you want to connect multiple devices, you can use comma and quotation marks to separate each device. Set-Item WSMan:localhostClientTrustedHosts -Value "servername or IP, servername or IP" You can also set it to allow to connect to a specific network subnet using the following command: Set-Item WSMan:localhostClientTrustedHosts -Value 10.10.100.* To test Windows PowerShell remoting against Nano Server and check if it’s working, you can use the following command: Test-WSMan -ComputerName"servername or IP" -Credential servernameAdministrator -Authentication Negotiate You can now start an interactive session with Nano Server. Open an elevated PowerShell console and type the following command: Enter-PSSession -ComputerName "servername or IP" -Credential servernameAdministrator In the following example, we will create two virtual machines on Nano Server Hyper-V host using PowerShell remoting. From your management machine, open an elevated PowerShell console or PowerShell scripting environment ,and run the following script (make sure to update the variables to match your environment): #region Variables $NanoSRV='NANOSRV-HV01' $Cred=Get-Credential"DemoSuperNano" $Session=New-PSSession-ComputerName$NanoSRV-Credential$Cred $CimSesion=New-CimSession-ComputerName$NanoSRV-Credential$Cred $VMTemplatePath='C:Temp' $vSwitch='Ext_vSwitch' $VMName='DemoVM-0' #endregion # Copying VM Template from the management machine to Nano Server Get-ChildItem-Path$VMTemplatePath-filter*.VHDX-recurse|Copy-Item-ToSession$Session-DestinationD: 1..2|ForEach-Object { New-VM-CimSession$CimSesion-Name$VMName$_-VHDPath"D:$VMName$_.vhdx"-MemoryStartupBytes1024GB` -SwitchName$vSwitch-Generation2 Start-VM-CimSession$CimSesion-VMName$VMName$_-Passthru } In this script, we are creating a PowerShell session and CIM session to Nano Server. A CIM session is a client-side object representing a connection to a local computer or a remote computer. Then we are copying VM Templates from the management machine to Nano Server over PowerShell remoting, when the copy is completed, we are creating two virtual machines as Generation 2 and finally starting them. After a couple of seconds, you can launch Hyper-V Manager console and see the new VMs running on Nano Server host as shown in Figure 4: Figure 4: Creating virtual machines on Nano Server host using PowerShell remoting If you have installed Nano Server in a virtual machine running on a Hyper-V host, you can use PowerShell direct to connect directly from your Hyper-V host to your Nano Server VM without any network connection by using the following command: Enter-PSSession -VMName <VMName> -Credential.Administrator So instead of specifying the computer name, we specified the VM Name, PowerShell Direct is so powerful, it’s one of my favorite feature, you can configure a bunch of VMs from scratch in just couple of seconds without any network connection. Moreover, if you have Nano Server running as a Hyper-V host as shown in the example earlier, you could use PowerShell remoting first to connect to Nano Server from your management machine, and then leverage PowerShell Direct to manage your virtual machines running on top of Nano Server. In this example, we used two PowerShell technologies (PS remoting and PS Direct).This is so powerful and open many possibilities to effectively manage Nano Server. To do that, you can use the following command: #region Variables $NanoSRV='NANOSRV-HV01'#Nano Server name or IP address $DomainCred=Get-Credential"DemoSuperNano" $VMLocalCred=Get-Credential"~Administrator" $Session=New-PSSession-ComputerName$NanoSRV-Credential$DomainCred #endregion Invoke-Command-Session$Session-ScriptBlock { Get-VM Invoke-Command-VMName (Get-VM).Name-Credential$Using;VMLocalCred-ScriptBlock { hostname Tzutil/g } } In this script, we have created a PowerShell session into Nano Server physical host, and then we used PowerShell Direct to list all VMs, including their hostnames and time zone. The result is shown in Figure 5: Figure 5. Nested PowerShell remoting Summary In this article, we discussed how to manage a Nano Server installation using remote server graphic tools, and Windows PowerShell remoting. Resources for Article: Further resources on this subject: Exploring Windows PowerShell 5.0 [article] Exchange Server 2010 Windows PowerShell: Mailboxes and Reports [article] Exchange Server 2010 Windows PowerShell: Managing Mailboxes [article]

0
0
58407

article-image-everything-you-need-to-know-about-pinecone-a-vector-database

Avinash Navlani

08 Jun 2023

5 min read

Everything you need to know about Pinecone – A Vector Database

Avinash Navlani

08 Jun 2023

5 min read

In this 21st century of information, we need efficient reliable storage and faster information retrieval. Relational or older databases are the most crucial databases for any computer application, but they are unable to handle the data in different forms such as documents, key-value pairs, and graphs. Vector database is a novel approach that uses vectorization for efficient search, storage, and data analysis. Image 1: Traditional Vs Vector Database Pinecone is one such vector database that is widely accepted across the industry for addressing challenges such as complexity and dimensionality. Pinecone is a cloud-native vector database that handles high-dimensional vector data. The core underlying approach for Pinecone is based on the Approximate Nearest Neighbor (ANN) search that efficiently locates faster matches and ranks them within a large dataset. In this tutorial, our focus will be on the pinecone database, its features, challenges, and use cases. Working Mechanism Traditional databases search for exact query matches while vector databases search for the most similar vector to the input query. It uses ANN (Approximate Nearest Neighbour) search. It provides approximate results at high performance, accuracy, and speed. Let's see the vector database working mechanism. Image 2: Vector Database Query Mechanism Vector databases first convert data into vectors and create indexing for faster searching. Vector database compares the indexed vector query and indexed vector in the database using the nearest neighbor or similarity matrix and computes the nearest most similar results. Finally, it post-processes the most similar results given by the nearest neighbor. Features Pinecone is a cloud-based vector database that offers various features and benefits to the infrastructure community: Fast and fresh vector search: Pinecone provides ultra-low query latency, even with billions of items. This means that users will always get a great experience, even when searching large datasets. Additionally, Pinecone indexes are updated in real-time, so users always have access to the most up-to-date information. Filtered vector search: Pinecone allows you to combine vector search with metadata filters to get more relevant and faster results. For example, you could filter by product category, price, or customer rating. Real-time updates: Pinecone supports real-time data updates, allowing for dynamic changes to the data. This contrasts with standalone vector indexes, which may require a full re-indexing process to incorporate new data. It has reliability, massive scalability, and security capability. Backups and collections: Pinecone handle the routine operation of backing up all the data stored in the database. You can also selectively choose specific indexes that can be backed up in the form of “collections,” which store the data in that index for later use. User-friendly API: Pinecone provides a user-friendly API layer that simplifies the development of high-performance vector search applications. This API layer is also language-agnostic, so you can use it with any programming language. Programming language integration: It supports a wide range of programming languages for integration. Cost-effectiveness: It is cost-effective because it offers cloud-native architecture. It offers pay-per-use based pricing. Challenges Pinecone vector database offers high-performance data search at a higher scale, but it also faces a few challenges such as: Application integration with other applications will evolve over a period. Data privacy is the biggest concern for any database. Organizations need to implement proper authentication and authorization mechanisms. Vector-based models don’t explain the model's interpretability. So, it is challenging to interpret the underlying reason behind those relationships. Use cases Pinecone has a variety of real-life industry applications. Let’s discuss a few applications: Audio/Textual Search: Pinecone offers faster, fully deployment-ready search and similarity functionality for high-dimensional text and audio data. Natural language Processing: Pinecone utilizes AutoGPT to create context-aware solutions for document classification, semantic search, text summarization, sentiment analysis, and question-answering systems. Recommendations: Pinecone enables personalized recommendations with efficient similar items recommendations that improve user experience and satisfaction. Image and Video Analysis: Pinecone also has the capability of faster retrieval of image and video content. It is very useful in real-life surveillance and image recognition. Time series similarity search: Pinecone can detect Time-series patterns in historical time-series data using a similarity search service. such core capability is quite helpful for recommendations, clustering, and labeling applications. Summary Pinecone vector database is a vector-based database that offers high-performance search and similarity matching. It can deal with high-dimensional vector data at a higher scale, easy integration, and faster query results. Pinecone provides a reliable, and faster, option for searching at a higher scale. Author BioAvinash Navlani has over 8 years of experience working in data science and AI. Currently, he is working as a senior data scientist, improving products and services for customers by using advanced analytics, deploying big data analytical tools, creating and maintaining models, and onboarding compelling new datasets. Previously, he was a university lecturer, where he trained and educated people in data science subjects such as Python for analytics, data mining, machine learning, database management, and NoSQL. Avinash has been involved in research activities in data science and has been a keynote speaker at many conferences in India.Link - LinkedIn Python Data Analysis, Third edition

0
0
58209

article-image-build-generative-chatbot-using-recurrent-neural-networks-lstm-rnns

Savia Lobo

15 Feb 2018

8 min read

Build a generative chatbot using recurrent neural networks (LSTM RNNs)

Savia Lobo

15 Feb 2018

8 min read

In today’s tutorial we will learn to build generative chatbot using recurrent neural networks. The RNN used here is Long Short Term Memory(LSTM). Generative chatbots are very difficult to build and operate. Even today, most workable chatbots are retrieving in nature; they retrieve the best response for the given question based on semantic similarity, intent, and so on. For further reading, refer to the paper Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation by Kyunghyun Cho et. al. (https://arxiv.org/pdf/1406.1078.pdf). [box type="note" align="" class="" width=""]This article is an excerpt from a book written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled Natural Language Processing with Python Cookbook. In this book you will come across various recipes covering natural language understanding, Natural Language Processing, and syntactic analysis.[/box] Getting ready... The A.L.I.C.E Artificial Intelligence Foundation dataset bot.aiml Artificial Intelligence Markup Language (AIML), which is customized syntax such as XML file has been used to train the model. In this file, questions and answers are mapped. For each question, there is a particular answer. Complete .aiml files are available at aiml-en-us-foundation-alice.v1-9 from https://code.google.com/archive/p/aiml-en-us-foundation-alice/downloads. Unzip the folder to see the bot.aiml file and open it using Notepad. Save as bot.txt to read in Python: >>> import os """ First change the following directory link to where all input files do exist """ >>> os.chdir("C:UsersprataDocumentsbook_codesNLP_DL") >>> import numpy as np >>> import pandas as pd # File reading >>> with open('bot.txt', 'r') as content_file: ... botdata = content_file.read() >>> Questions = [] >>> Answers = [] AIML files have unique syntax, similar to XML. The pattern word is used to represent the question and the template word for the answer. Hence, we are extracting respectively: >>> for line in botdata.split("</pattern>"): ... if "<pattern>" in line: ... Quesn = line[line.find("<pattern>")+len("<pattern>"):] ... Questions.append(Quesn.lower()) >>> for line in botdata.split("</template>"): ... if "<template>" in line: ... Ans = line[line.find("<template>")+len("<template>"):] ... Ans = Ans.lower() ... Answers.append(Ans.lower()) >>> QnAdata = pd.DataFrame(np.column_stack([Questions,Answers]),columns = ["Questions","Answers"]) >>> QnAdata["QnAcomb"] = QnAdata["Questions"]+" "+QnAdata["Answers"] >>> print(QnAdata.head()) The question and answers are joined to extract the total vocabulary used in the modeling, as we need to convert all words/characters into numeric representation. The reason is the same as mentioned before—deep learning models can't read English and everything is in numbers for the model. How to do it... After extracting the question-and-answer pairs, the following steps are needed to process the data and produce the results: Preprocessing: Convert the question-and-answer pairs into vectorized format, which will be utilized in model training. Model building and validation: Develop deep learning models and validate the data. Prediction of answers from trained model: The trained model will be used to predict answers for given questions. How it works... The question and answers are utilized to create the vocabulary of words to index mapping, which will be utilized for converting words into vector mappings: # Creating Vocabulary >>> import nltk >>> import collections >>> counter = collections.Counter() >>> for i in range(len(QnAdata)): ... for word in nltk.word_tokenize(QnAdata.iloc[i][2]): ... counter[word]+=1 >>> word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())} >>> idx2word = {v:k for k,v in word2idx.items()} >>> idx2word[0] = "PAD" >>> vocab_size = len(word2idx)+1 >>> print (vocab_size) Encoding and decoding functions are used to convert text to indices and indices to text respectively. As we know, Deep learning models work on numeric values rather than text or character data: >>> def encode(sentence, maxlen,vocab_size): ... indices = np.zeros((maxlen, vocab_size)) ... for i, w in enumerate(nltk.word_tokenize(sentence)): ... if i == maxlen: break ... indices[i, word2idx[w]] = 1 ... return indices >>> def decode(indices, calc_argmax=True): ... if calc_argmax: ... indices = np.argmax(indices, axis=-1) ... return ' '.join(idx2word[x] for x in indices) The following code is used to vectorize the question and answers with the given maximum length for both questions and answers. Both might be different lengths. In some pieces of data, the question length is greater than answer length, and in a few cases, it's length is less than answer length. Ideally, the question length is good to catch the right answers. Unfortunately in this case, question length is much less than the answer length, which is a very bad example to develop generative models: >>> question_maxlen = 10 >>> answer_maxlen = 20 >>> def create_questions(question_maxlen,vocab_size): ... question_idx = np.zeros(shape=(len(Questions),question_maxlen, vocab_size)) ... for q in range(len(Questions)): ... question = encode(Questions[q],question_maxlen,vocab_size) ... question_idx[q] = question ... return question_idx >>> quesns_train = create_questions(question_maxlen=question_maxlen, vocab_size=vocab_size) >>> def create_answers(answer_maxlen,vocab_size): ... answer_idx = np.zeros(shape=(len(Answers),answer_maxlen, vocab_size)) ... for q in range(len(Answers)): ... answer = encode(Answers[q],answer_maxlen,vocab_size) ... answer_idx[q] = answer ... return answer_idx >>> answs_train = create_answers(answer_maxlen=answer_maxlen,vocab_size= vocab_size) >>> from keras.layers import Input,Dense,Dropout,Activation >>> from keras.models import Model >>> from keras.layers.recurrent import LSTM >>> from keras.layers.wrappers import Bidirectional >>> from keras.layers import RepeatVector, TimeDistributed, ActivityRegularization The following code is an important part of the chatbot. Here we have used recurrent networks, repeat vector, and time-distributed networks. The repeat vector used to match dimensions of input to output values. Whereas time-distributed networks are used to change the column vector to the output dimension's vocabulary size: >>> n_hidden = 128 >>> question_layer = Input(shape=(question_maxlen,vocab_size)) >>> encoder_rnn = LSTM(n_hidden,dropout=0.2,recurrent_dropout=0.2) (question_layer) >>> repeat_encode = RepeatVector(answer_maxlen)(encoder_rnn) >>> dense_layer = TimeDistributed(Dense(vocab_size))(repeat_encode) >>> regularized_layer = ActivityRegularization(l2=1)(dense_layer) >>> softmax_layer = Activation('softmax')(regularized_layer) >>> model = Model([question_layer],[softmax_layer]) >>> model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) >>> print (model.summary()) The following model summary describes the change in flow of model size across the model. The input layer matches the question's dimension and the output matches the answer's dimension: # Model Training >>> quesns_train_2 = quesns_train.astype('float32') >>> answs_train_2 = answs_train.astype('float32') >>> model.fit(quesns_train_2, answs_train_2,batch_size=32,epochs=30, validation_split=0.05) The results are a bit tricky in the following screenshot even though the accuracy is significantly higher. The chatbot model might produce complete nonsense, as most of the words are padding here. The reason? The number of words in this data is less: # Model prediction >>> ans_pred = model.predict(quesns_train_2[0:3]) >>> print (decode(ans_pred[0])) >>> print (decode(ans_pred[1])) The following screenshot depicts the sample output on test data. The output does not seem to make sense, which is an issue with generative models: Our model did not work well in this case, but still some areas of improvement are possible going forward with generative chatbot models. Readers can give it a try: Have a dataset with lengthy questions and answers to catch signals well Create a larger architecture of deep learning models and train over longer iterations Make question-and-answer pairs more generic rather than factoid-based, such as retrieving knowledge and so on, where generative models fail miserably. Here, you saw how to build chatbots using LSTM. You can go ahead and try building one of your own generative chatbots using the example above. If you found this post useful, do check out this book Natural Language Processing with Python Cookbook to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more.

0
4
58183

article-image-how-to-build-a-relay-react-app-tutorial

Bhagyashree R

01 Jan 2019

12 min read

How to build a Relay React App [Tutorial]

Bhagyashree R

01 Jan 2019

12 min read

0
0
58179

How-To Tutorials

Sangita Mahala

05 Feb 2024

7 min read

Decoding ChatGPT's Biases

Sangita Mahala

05 Feb 2024

7 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionLarge language models (LLMs) like ChatGPT have captivated the world with their ability to generate human-quality text, translate languages, write different kinds of creative content, and answer your questions in an informative way. However, this power is capable of exacerbating negative biases that can lead to discriminatory or inappropriate outcomes through their training data. This article addresses the complex relationship between ChatGPT training data and algorithmic fairness, discusses possible bias as well and lays down steps to be taken to responsibly develop and apply LLMs.Understanding ChatGPT's Training DataChatGPT, developed by OpenAI, is trained on a massive dataset of text and code, including books, articles, code repositories, and web text. While the exact composition of this dataset is not publicly known, it's likely to reflect the inherent biases present in the real world:Social and cultural biases: Language itself is capable of encoding bias in terms of gender, race, ethnicity, religion or other social categories. Such biases can be manifested in the form of stereotypes, negative associations and abusive language.History biases: Textual data often reflects historical biases that may no longer be considered acceptable. For example, datasets containing historical documents might perpetuate outdated views on gender roles or racial stereotypes.Algorithmic bias: By prioritizing some types of information over others, algorithms used for the processing and selection of training data can create biases. This can result in models that are more likely to produce outputs reflecting this bias.Addressing Algorithmic FairnessRecognizing the potential for bias, researchers and developers are actively working to mitigate its impact on LLMs like ChatGPT:1. Data debiasing: Techniques such as data augmentation and filtering can be used to remove or reduce biases from training data.For Example:-Mitigating Gender Bias in Job DescriptionsText Box Interaction:User Prompt:Create a job description for a data scientist.Biased Output (Without Mitigation):We are seeking a detail-oriented data scientist with exceptional analytical and problem-solving skills. The ideal candidate must have a strong background in statistics and programming, demonstrating logical thinking in their approach to data analysis.Mitigation through Data Augmentation:User Prompt (Mitigated):Create a job description for a data scientist, ensuring the language is unbiased and inclusive.Data-augmented Output:We are looking for a dedicated data scientist with excellent skills in statistics and programming. The ideal candidate will excel in problem-solving and demonstrate a collaborative spirit. Whether you're an experienced professional or a rising talent, we encourage applicants of all genders and backgrounds to apply.Analysis and Interpretation:In the biased output, the language is subtly reinforcing gender stereotypes by emphasizing traits such as "analytical" and "logical thinking." In order to address this, the user will change the prompt so that it specifically asks for a neutral and impartial language. Skills and qualities of relevance to the role without gender specific associations are addressed in the data augmented outputs. In support of inclusiveness, it also calls for applications from all gender and ethnic groups.By refining prompts and incorporating counterfactual data, this hands-on example shows how users can actively engage with ChatGPT to mitigate biases. It underlines the possible impact of data enhancement on producing a more diverse and inclusive representation in language models' responses.2. Algorithmic fairness metrics: Metrics such as fairness, accuracy, and calibration can be used to evaluate how fair an LLM is and identify areas for improvement.For Example:-Evaluating Algorithmic Fairness in ChatGPTText Box Interaction:User Prompt:Evaluate the fairness, accuracy, and calibration of ChatGPT in responding to gender-related queries.Algorithmic Fairness Metrics Analysis:Fairness Metric:User Prompt (Bias Evaluation):Ask ChatGPT about the qualities of a good leader, focusing on both male and female leaders.Output:A good leader possesses strong communication skills, strategic thinking, and the ability to inspire and motivate their team. Whether male or female, these qualities are crucial for effective leadership.Analysis:The response demonstrates fairness by providing a balanced perspective on leadership qualities without gender-specific biases.Accuracy Metric:User Prompt (Accuracy Evaluation):Ask ChatGPT to provide a definition of a specific technical term, ensuring accuracy in the response.Output:[Accurate definition of the technical term]Analysis:The accuracy metric is satisfied as the model provides an accurate definition of the technical term in line with the user's request.Calibration Metric:User Prompt (Calibration Evaluation):Inquire about the probability of a specific event happening in the future and request a confidence level in the response.Output:There is a 70% probability of the event occurring in the future.Analysis:The calibration metric assesses how well the model's confidence levels align with the actual likelihood of events. The response includes a confidence level (70%), demonstrating a calibrated prediction.3. Human oversight and feedback: Humans can play a crucial role in identifying and mitigating biases in LLMs. This can involve providing feedback on the model's outputs, flagging biased examples, and helping to develop guidelines for responsible use.For Example:-Mitigating Cultural Bias in Restaurant ReviewsText Box Interaction:User Prompt:Generate a review for a Mexican restaurant.Biased Output (Initial Response):This Mexican restaurant is bursting with vibrant colors and lively music, providing an authentic experience. The spicy dishes and bold flavors truly capture the essence of Mexican cuisine.User Identifies Bias:User Feedback:The initial response assumes that all Mexican restaurants are characterized by vibrant colors, lively music, spicy dishes, and bold flavors. This generalization perpetuates cultural stereotypes and does not account for the diversity within Mexican cuisine.Model Refinement Prompt:Refined Prompt Incorporating Feedback:Generate a review for a Mexican restaurant that avoids stereotypical assumptions and provides a more nuanced representation of the dining experience.Improved Output (After Feedback and Refinement):This Mexican restaurant offers a diverse culinary experience with thoughtfulAnalysis and Interpretation:In this example, the user identifies bias in the initial response, which stereotypically characterizes all Mexican restaurants as having vibrant colors, lively music, spicy dishes, and bold flavors. Feedback is provided by the user, highlighting the importance of avoiding cultural stereotypes and encouraging a more nuanced representation.To address this, a user refines the prompt to instruct the model to generate an assessment that is free of stereotypical assumptions. The improved product provides a more diverse and complex representation of the Mexican restaurant, taking into account the different elements within Mexico's cuisine as well as its dining experiences.ConclusionA fascinating way of exploring the biases in AI is to use ChatGPT, with its remarkable language generation capabilities. Users will be able to decipher the complexities of biases arising from training data and algorithms by combining theoretical understanding with hands-on experience. The iterative process of experimenting with prompts, evaluating biases, and fine-tuning for fairness empowers users to actively contribute to the pursuit of ethical AI practices.Addressing the biases of AI models will become more and more important as technology develops. Collaboration between developers, researchers, and users is a key part of the journey toward Algorithmic Goodness. Users play an essential role in shaping the future landscape of responsible and impartial artificial intelligence, by breaking down biases within ChatGPT and actively contributing to its improvement.Author BioSangita Mahala is a passionate IT professional with an outstanding track record, having an impressive array of certifications, including 12x Microsoft, 11x GCP, 2x Oracle and 6x Linkedin Top Voice badges. She is a Google product expert and IBM champion learner gold. She also possesses extensive experience as a technical content writer and accomplished book blogger. She is always Committed to staying with emerging trends and technologies in the IT sector.

0
0
58108

article-image-deploy-rethinkdb-using-docker

Vijin Boricha

14 Feb 2018

7 min read

How to deploy RethinkDB using Docker

Vijin Boricha

14 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will help you develop efficient and real-time applications in RethinkDB with ease.[/box] In today’s tutorial, we will learn to install Docker, create an Docker image and deploy RethinkDB using Docker. Your code is not working in Production? But it's working on the QA (quality analysis server)! I am sure you have heard statements like these in your team during the deployment phase. Well no more of that, Docker everything and forget about the infrastructure of different environments, say, QA, Staging and Production, because your code is going to run Docker container not in those machines, hence write once, run everywhere. In this section, we will learn how to use Docker to deploy a RethinkDB Server or PaaS services. I am going to cover a few docker basics too; if you are already aware of them, please skip to the next section. Installing Docker Docker is available for all major platforms, such as, Linux-based distributions, Mac, and Windows. Visit the official website at h t t p s ://w w w . d o c k e r . c o m / and download the package suitable for your platform. We are installing Docker in our machine to create a new Docker image. Docker images are independent of platform and should not be confused with Docker for Mac or Docker for Windows. It's referred to as a Docker client too. Once you have installed the Docker, you need to start the Daemon process first; I am using a Mac so I can view this in the launchpad, as shown here: Upon clicking that, it will open up a nice console showing the Docker official logo and an indication that Docker is successfully booted, as shown in the following screenshot: Now we can begin creating our Docker image that in turn will run RethinkDB. Creating a Docker image For installing our RethinkDB on Ubuntu inside the Docker, we need to install the Ubuntu operating system. Run the following command to install a Ubuntu image from the official Docker hub repository: docker pull ubuntu This will download and install the Ubuntu image in our system. We will later use this Ubuntu image and install our RethinkDB instance; you can choose different operating systems as well. Before going to the Docker configuration code, I would like to point out the steps we require to install RethinkDB on a fresh Ubuntu installation: Update the system Add the RethinkDB repository to the known repository list Install RethinkDB Set the data folder Expose the port We are going to do this using Docker. To create a Docker image, we require Dockerfile. Create a file called Dockerfile with no extension and apply the code shown here: FROM ubuntu:latest # Install RethinkDB. RUN apt-get update && echo "deb http://download.rethinkdb.com/apt `lsb_release -cs` main" > /etc/apt/sources.list.d/rethinkdb.list && apt-get install -y wget && wget -O- http://download.rethinkdb.com/apt/pubkey.gpg | apt-key add - && apt-get update && apt-get install -y rethinkdb python-pip && rm -rf /var/lib/apt/lists/* # Install python driver for rethinkdb RUN pip install rethinkdb # Define mountable directories. VOLUME ["/data"] # Define working directory. WORKDIR /data # Define default command. CMD ["rethinkdb", "--bind", "all"] # Expose ports. # - 8080: web UI # - 28015: process # - 29015: cluster EXPOSE 8080 EXPOSE 28015 EXPOSE 29015 The first line is our entry point to the Ubuntu operating system, then we are performing an update of the system and using the installation commands recommended by RethinkDB here: h t t p s ://w w w . r e t h i n k d b . c o m /d o c s /i n s t a l l /u b u n t u /. Once the installation is complete, we install the rethinkdb python driver to perform the import/export operation. The next two commands mount a new volume in Ubuntu and telling RethinkDB to use that volume. The next command runs rethinkdb by binding all the ports and exposing the ports to be used by the client driver and web console. In order to make this a docker image, save the file and run the following command within the project directory: docker build -t docker-rethinkdb. Here, we are building our docker image and giving it a name docker-rethinkdb; upon running this command, Docker will execute the Dockerfile and you're on. The representation of the previous steps is shown here: Once everything works, and I am sure it will, you will see a success message in the console, as shown here: Congratulations! You have successfully created a docker image for RethinkDB. If you want to see your image and its properties, run the following command: docker images And this will list all the images of Docker, as shown in the following screenshot: Awesome! Now let's run it. To access the web portal, we need to run our docker image and bind port 8080 of the docker image to some port of our machine; here is the command to do so: docker run -p 3000:8080 -d docker-rethinkdb As per the command above, -p is used to specify port binding, the first is the target and second port is source, that is, Docker port and -d is used to run it in the background or Daemon. This will run the docker image in the background; to extract more information about this process, we need to run the following command: docker ps This will list all the running images called as a container, along with the information, as shown in the following screenshot: You can also check the logs of specific containers using the following command: docker logs <container id> Now, in order to access the RethinkDB web console from our machine, we need to find out the IP address on which the Docker machine is running. To get that, we need to run the following command: docker-machine ip default This will print out the IP. Copy the IP and hit IP:3000 from the browser to view the RethinkDB web console, as shown here: So we have docker running and accessible from the browser. In order to import and export the data, we need to log in to our Docker image. To do that, run the following command: docker exec -i -t <container-id> /bin/bash This will log in to the docker image running Ubuntu; refer to the following screenshot: You can now run the rethinkdb command to perform the data import to the existing RethinkDB cluster. Deploying the Docker image Almost every PaaS service we have covered in earlier sections provides support for Docker. You can submit your Dockerfile to git and clone it anywhere if you want to create Docker image. You can submit the whole docker image (not Dockerfile) to Dockerhub and pull your docker image directly using the docker pull command, which is no doubt an easy way because you will be directly working on the image running on the server. We covered RethinkDB deployment using Docker and learned how to create our own RethinkDB image. You can learn more about RethinkDB Query Language and Performance Tuning in RethinkDB from this book Mastering RethinkDB.

0
0
57987

The elements of WebAssembly - Wat and Wasm, explained [Tutorial]

The CAP Theorem in practice: The consistency vs. availability trade-off in distributed databases

Using Genetic Algorithms for optimizing your models [Tutorial]

Wireshark for analyzing issues and malicious emails in POP, IMAP, and SMTP [Tutorial]

What's the difference between OAuth 1.0 and OAuth 2.0?

How to remotely monitor hosts over Telnet and SSH [Tutorial]

Visualizing 3D plots in Matplotlib 2.0

Cleaning Data in PDF Files

Generative Adversarial Networks: Generate images using Keras GAN [Tutorial]

Managing Nano Server with Windows PowerShell and Windows PowerShell DSC

Trending Topics

Everything you need to know about Pinecone – A Vector Database

Build a generative chatbot using recurrent neural networks (LSTM RNNs)

How to build a Relay React App [Tutorial]

Decoding ChatGPT's Biases

How to deploy RethinkDB using Docker

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access