Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7018 Articles
article-image-common-wlan-protection-mechanisms-and-their-flaws
Packt
07 Mar 2016
19 min read
Save for later

Common WLAN Protection Mechanisms and their Flaws

Packt
07 Mar 2016
19 min read
In this article by Vyacheslav Fadyushin, the author of the book Building a Pentesting Lab for Wireless Networks, we will discuss various WLAN protection mechanisms and the flaws present in them. To be able to protect a wireless network, it is crucial to clearly understand which protection mechanisms exist and which security flaws do they have. This topic will be useful not only for those readers who are new to Wi-Fi security, but also as a refresher for experienced security specialists. (For more resources related to this topic, see here.) Hiding SSID Let's start with one of the common mistakes done by network administrators: relying only on the security by obscurity. In the frames of the current subject, it means using a hidden WLAN SSID (short for service set identification) or simply WLAN name. Hidden SSID means that a WLAN does not send its SSID in broadcast beacons advertising itself and doesn't respond to broadcast probe requests, thus making itself unavailable in the list of networks on Wi-Fi-enabled devices. It also means that normal users do not see that WLAN in their available networks list. But the lack of WLAN advertising does not mean that a SSID is never transmitted in the air—it is actually transmitted in plaintext with a lot of packet between access points and devices connected to them regardless of a security type used. Therefore, SSIDs are always available for all the Wi-Fi network interfaces in a range and visible for any attacker using various passive sniffing tools. MAC filtering To be honest, MAC filtering cannot be even considered as a security or protection mechanism for a wireless network, but it is still called so in various sources. So let's clarify why we cannot call it a security feature. Basically, MAC filtering means allowing only those devices that have MAC addresses from a pre-defined list to connect to a WLAN, and not allowing connections from other devices. MAC addresses are transmitted unencrypted in Wi-Fi and are extremely easy for an attacker to intercept without even being noticed (refer to the following screenshot): An example of wireless traffic sniffing tool easily revealing MAC addresses Keeping in mind the extreme simplicity of changing a physical address (MAC address) of a network interface, it becomes obvious why MAC filtering should not be treated as a reliable security mechanism. MAC filtering can be used to support other security mechanisms, but it should not be used as an only security measure for a WLAN. WEP Wired equivalent privacy (WEP) was born almost 20 years ago at the same time as the Wi-Fi technology and was integrated as a security mechanism for the IEEE 802.11 standard. Like it often happens with new technologies, it soon became clear that WEP contains weaknesses by design and cannot provide reliable security for wireless networks. Several attack techniques were developed by security researchers that allowed them to crack a WEP key in a reasonable time and use it to connect to a WLAN or intercept network communications between WLAN and client devices. Let's briefly review how WEP encryption works and why is it so easy to break. WEP uses so-called initialization vectors (IV) concatenated with a WLAN's shared key to encrypt transmitted packets. After encrypting a network packet, an IV is added to a packet as it is and sent to a receiving side, for example, an access point. This process is depicted in the following flowchart: The WEP encryption process An attacker just needs to collect enough IVs, which is also a trivial task using additional reply attacks to force victims to generate more IVs. Even worse, there are attack techniques that allow an attacker to penetrate WEP-protected WLANs even without connected clients, which makes those WLANs vulnerable by default. Additionally, WEP does not have a cryptographic integrity control, which makes it vulnerable not only to attacks on confidentiality. There are numerous ways how an attacker can abuse a WEP-protected WLAN, for example: Decrypt network traffic using passive sniffing and statistical cryptanalysis Decrypt network traffic using active attacks (reply attack, for example) Traffic injection attacks Unauthorized WLAN access Although WEP was officially superseded by the WPA technology in 2003, it still can be sometimes found in private home networks and even in some corporate networks (mostly belonging to small companies nowadays). But this security technology has become very rare and will not be used anymore in future, largely due to awareness in corporate networks and because manufacturers no longer activate WEP by default on new devices. In our humble opinion, device manufacturers should not include WEP support in their new devices to avoid its usage and increase their customers' security. From the security specialist's point of view, WEP should never be used to protect a WLAN, but it can be used for Wi-Fi security training purposes. Regardless of a security type in use, shared keys always add the additional security risk; users often tend to share keys, thus increasing the risk of key compromise and reducing accountability for key privacy. Moreover, the more devices use the same key, the greater amount of traffic becomes suitable for an attacker during cryptanalytic attacks, increasing their performance and chances of success. This risk can be minimized by using personal identifiers (key, certificate) for users and devices. WPA/WPA2 Due to numerous WEP security flaws, the next generation of Wi-Fi security mechanism became available in 2003: Wi-Fi Protected Access (WPA). It was announced as an intermediate solution until WPA2 is available and contained significant security improvements over WEP. Those improvements include: Stronger encryption Cryptographic integrity control: WPA uses an algorithm called Michael instead of CRC used in WEP. This is supposed to prevent altering data packets on the fly and protects from resending sniffed packets. Usage of temporary keys: The Temporal Key Integrity Protocol (TKIP) automatically changes encryption keys generated for every packet. This is the major improvement over the static WEP where encryption keys should be entered manually in an AP config. TKIP also operates RC4, but the way how it is used was improved. Support of client authentication: The capability to use dedicated authentication servers for user and device authentication made WPA suitable for use in large enterprise networks. The support of cryptographically strong algorithm Advanced Encryption Standard (AES) was implemented in WPA, but it was not set as mandatory, only optional. Although WPA was a significant improvement over WEP, it was a temporary solution before WPA2 was released in 2004 and became mandatory for all new Wi-Fi devices. WPA2 works very similar to WPA and the main differences between WPA and WPA2 is in the algorithms used to provide security: AES became the mandatory algorithm for encryption in WPA2 instead of default RC4 in WPA TKIP used in WPA was replaced by Counter Cipher Mode with Block Chaining Message Authentication Code Protocol (CCMP) Because of the very similar workflows, WPA and WPA2 are also vulnerable to the similar or the same attacks and usually called and spelled in one word WPA/WPA2. Both WPA and WPA2 can work in two modes: pre-shared key (PSK) or personal mode and enterprise mode. Pre-shared key mode Pre-shared key or personal mode was intended for home and small office use where networks have low complexity. We are more than sure that all our readers have met this mode and that most of you use it at home to connect your laptops, mobile phones, tablets, and so on to home networks. The general idea of PSK mode is using the same secret key on an access point and on a client device to authenticate the device and establish an encrypted connection for networking. The process of WPA/WPA2 authentication using a PSK consists of four phases and is also called a 4-way handshake. It is depicted in the following diagram: WPA/WPA2 4-way handshake The main WPA/WPA2 flaw in PSK mode is the possibility to sniff a whole 4-way handshake and brute-force a security key offline without any interaction with a target WLAN. Generally, the security of a WLAN mostly depends on the complexity of a chosen PSK. Computing a PMK (short for primary master key) used in 4-way handshakes (refer to the handshake diagram) is a very time-consuming process compared to other computing operations and computing hundreds of thousands of them can take very long. But in case of a short and low complexity PSK in use, a brute-force attack does not take long even on a not-so-powerful computer. If a key is complex and long enough, cracking it can take much longer, but still there are ways to speed up this process: Using powerful computers with CUDA (short for Compute Unified Device Architecture), which allows a software to directly communicate with GPUs for computing. As GPUs are natively designed to perform mathematical operations and do it much faster than CPUs, the process of cracking works several times faster with CUDA. Using rainbow tables that contain pairs of various PSKs and their corresponding precomputed hashes. They save a lot of time for an attacker because the cracking software just searches for a value from an intercepted 4-way handshake in rainbow tables and returns a key corresponding to the given PMK if there was a match instead of computing PMKs for every possible character combination. Because WLAN SSIDs are used in 4-way handshakes analogous to a cryptographic salt, PMKs for the same key will differ for different SSIDs. This limits the application of rainbow tables to a number of the most popular SSIDs. Using cloud computing is another way to speed up the cracking process, but it usually costs additional money. The more computing power can an attacker rent (or get through another ways), the faster goes the process. There are also online cloud-cracking services available in Internet for various cracking purposes including cracking 4-way handshakes. Furthermore, as with WEP, the more users know a WPA/WPA2 PSK, the greater the risk of compromise—that's why it is also not an option for big complex corporate networks. WPA/WPA2 PSK mode provides the sufficient level of security for home and small office networks only when a key is long and complex enough and is used with a unique (or at least not popular) WLAN SSID. Enterprise mode As it was already mentioned in the previous section, using shared keys itself poses a security risk and in case of WPA/WPA2 highly relies on a key length and complexity. But there are several factors in enterprise networks that should be taken into account when talking about WLAN infrastructure: flexibility, manageability, and accountability. There are various components that implement those functions in big networks, but in the context of our topic, we are mostly interested in two of them: AAA (short for authentication, authorization, and accounting) servers and wireless controllers. WPA-Enterprise or 802.1x mode was designed for enterprise networks where a high security level is needed and the use of AAA server is required. In most cases, a RADIUS server is used as an AAA server and the following EAP (Extensible Authentication Protocol) types are supported (and several more, depending on a wireless device) with WPA/WPA2 to perform authentication: EAP-TLS EAP-TTLS/MSCHAPv2 PEAPv0/EAP-MSCHAPv2 PEAPv1/EAP-GTC PEAP-TLS EAP-FAST You can find a simplified WPA-Enterprise authentication workflow in the following diagram: WPA-Enterprise authentication Depending on an EAP-type configured, WPA-Enterprise can provide various authentication options. The most popular EAP-type (based on our own experience in numerous pentests) is PEAPv0/MSCHAPv2, which is relatively easily integrated with existing Microsoft Active Directory infrastructures and is relatively easy to manage. But this type of WPA-protection is relatively easy to defeat by stealing and bruteforcing user credentials with a rogue access point. The most secure EAP-type (at least, when configured and managed correctly) is EAP-TLS, which employs certificate-based authentication for both users and authentication servers. During this type of authentication, clients also check server's identity and a successful attack with a rogue access point becomes possible only if there are errors in configuration or insecurities in certificates maintenance and distribution. It is recommended to protect enterprise WLANs with WPA-Enterprise in EAP-TLS mode with mutual client and server certificate-based authentication. But this type of security requires additional work and resources. WPS Wi-Fi Protected Setup (WPS) is actually not a security mechanism, but a key exchange mechanism which plays an important role in establishing connections between devices and access points. It was developed to make the process of connecting a device to an access point easier, but it turned out to be one of the biggest wholes in modern WLANs if activated. WPS works with WPA/WPA2-PSK and allows connecting devices to WLANs with one of the following methods: PIN: Entering a PIN on a device. A PIN is usually printed on a sticker at the back side of a Wi-Fi access point. Push button: Special buttons should be pushed on both an access point and a client device during the connection phase. Buttons on devices can be physical and virtual. NFC: A client should bring a device close to an access point to utilize the Near Field Communication technology. USB drive: Necessary connection information exchange between an access point and a device is done using a USB drive. Because WPS PINs are very short and their first and second parts are validated separately, online brute-force attack on a PIN can be done in several hours allowing an attacker to connect to a WLAN. Furthermore, the possibility of offline PIN cracking was found in 2014 which allows attackers to crack pins in 1 to 30 seconds, but it works only on certain devices. You should also not forget that a person who is not permitted to connect to a WLAN but who can physically access a Wi-Fi router or access point can also read and use a PIN or connect via push button method. Getting familiar with the Wi-Fi attack workflow In our opinion (and we hope you agree with us), planning and building a secure WLAN is not possible without the understanding of various attack methods and their workflow. In this topic, we will give you an overview of how attackers work when they are hacking WLANs. General Wi-Fi attack methodology After refreshing our knowledge about wireless threats and Wi-Fi security mechanisms, let's have a look at the attack methodology used by attackers in the real world. Of course, as with all other types of network attack, wireless attack workflows depend on certain situations and targets, but they still align with the following general sequence in almost all cases: The first step is planning. Normally, attackers need to plan what are they going to attack, how can they do it, which tools are necessary for the task, when is the best time and place to attack certain targets, and which configuration templates will be useful so that they can be prepared in advance. White-hat hackers or penetration testers need to set schedules and coordinate project plans with their customers, choose contact persons on a customer side, define project deliverables, and do some other organizational work if demanded. As with every penetration testing project, the better a project was planned (and we can use the word "project" for black-hat hackers' tasks), the more chances of a successful result. The next step is survey. Getting as accurate as possible and as much as possible information about a target is crucial for a successful hack, especially in uncommon network infrastructures. To hack a WLAN or its wireless clients, an attacker would normally collect at least SSIDs or MAC addresses of access points and clients and information about a security type in use. It is also very helpful for an attacker to understand if WPS is enabled on a target access point. All that data allows attackers not only to set proper configs and choose the right options for their tools, but also to choose appropriate attack types and conditions for a certain WLAN or Wi-Fi client. All collected information, especially non-technical (for example, company and department names, brands, or employee names), can also become useful at the cracking phase to build dictionaries for brute-force attacks. Depending on a type of security and attacker's luck, a data collected at the survey phase can even make the active attack phase unnecessary and allow an attacker to proceed directly with the cracking phase. The active attacks phase involves active interaction between an attacker and targets (WLANs and Wi-Fi clients). At this phase, attackers have to create conditions necessary for a chosen attack type and execute it. It includes sending various Wi-Fi management and control frames and installing rogue access points. If an attacker wants to cause a denial of service in a target WLAN as a goal, such attacks are also executed at this phase. Some of active attacks are essential to successfully hack a WLAN, but some of them are intended to just speed up hacking and can be omitted to avoid causing alarm on various wireless intrusion detection/prevention systems (wIDPS), which can possibly be installed in a target network. Thus, the active attacks phase can be called optional. Cracking is another important phase where an attacker cracks 4-way-hadshakes, WEP data, NTLM hashes, and so on, which were intercepted at the previous phases. There are plenty of various free and commercial tools and services including cloud cracking services. In case of success at this phase, an attacker gets the target WLAN's secret(s) and can proceed with connecting to the WLAN, decrypt intercepted traffic, and so on. The active attacking phase Let's have a closer look to the most interesting parts of the active attack phase—WPA-PSK and WPA-Enterprise attacks—in the following topics. WPA-PSK attacks As both WPA and WPA2 are based on the 4-way handshake, attacking them doesn't differ—an attacker needs to sniff a 4-way handshake in the moment, establishing a connection between an access point and an arbitrary wireless client and bruteforcing a matching PSK. It does not matter whose handshake is intercepted, because all clients use the same PSK for a given target WLAN. Sometimes, attackers have to wait long until a device connects to a WLAN to intercept a 4-way handshake and of course they would like to speed up the process when possible. For that purpose, they force already connected device to disconnect from an access point sending control frames (deauthentication attack) on behalf of a target access point. When a device receives such a frame, it disconnects from a WLAN and tries to reconnect again if the "automatic reconnect" feature is enabled (it is enabled by default on most devices), thus performing another one 4-way handshake that can be intercepted by an attacker. Another possibility to hack a WPA-PSK protected network is to crack a WPS PIN if WPS is enabled on a target WLAN. Enterprise WLAN attacks Attacking becomes a little bit more complicated if WPA-Enterprise security is in place, but could be executed in several minutes by a properly prepared attacker by imitating a legitimate access point with a RADIUS server and by gathering user credentials for further analysis (cracking). To settle this attack, an attacker needs to install a rogue access point with a SSID identical to a target WLAN's SSID and set other parameter (like EAP type) similar to the target WLAN to increase chances on success and reduce the probability of the attack to be quickly detected. Most user Wi-Fi devices choose an access point for a connection to a certain WLAN by a signal strength—they connect to that one which has the strongest signal. That is why an attacker needs to use a powerful Wi-Fi interface for a rogue access point to override signals from legitimate ones and make devices around connect to the rogue access point. A RADIUS server used during such attacks should have a capability to record authentication data, NTLM hashes, for example. From a user perspective, being attacked in such way looks just being unable to connect to a WLAN for an unknown reason and could even be not seen if a user is not using a device at the moment and just passing by a rogue access point. It is worth mentioning that classic physical security or wireless IDPS solutions are not always effective in such cases. An attacker or a penetration tester can install a rogue access point outside of the range of a target WLAN. It will allow hacker to attack user devices without the need to get into a physically controlled area (for example, office building), thus making the rogue access point unreachable and invisible for wireless IDPS systems. Such a place could be a bus or train station, parking or a café where a lot of users of a target WLAN go with their Wi-Fi devices. Unlike WPA-PSK with only one key shared between all WLAN users, the Enterprise mode employs personified credentials for each user whose credentials could be more or less complex depending only on a certain user. That is why it is better to collect as many user credentials and hashes as possible thus increasing the chances of successful cracking. Summary In this article, we looked at the security mechanisms that are used to secure access to wireless networks, their typical threads, and common misconfigurations that lead to security breaches and allow attacker to harm corporate and private wireless networks. The brief attack methodology overview has given us a general understanding of how attackers normally act during wireless attacks and how they bypass common security mechanisms by abusing certain flaws in those mechanisms. We also saw that the most secure and preferable way to protect a wireless network is to use WPA2-Enterprise security along with a mutual client and server authentication, which we are going to implement in our penetration testing lab. Resources for Article: Further resources on this subject: Pentesting Using Python [article] Advanced Wireless Sniffing [article] Wireless and Mobile Hacks [article]
Read more
  • 0
  • 0
  • 9789

Packt
31 Jul 2013
6 min read
Save for later

Quick start – creating your first grid

Packt
31 Jul 2013
6 min read
(For more resources related to this topic, see here.) So, let's do that right now for this example. Create a copy of the entire jqGrid folder named quickstart; easy as pie. Let's begin by creating the simplest grid possible; open up the index.html file from the newly created quickstart folder, and modify the body section to look like the following code: <body><table id="quickstart_grid"></table><script>var dataArray = [{name: 'Bob', phone: '232-532-6268'},{name: 'Jeff', phone: '365-267-8325'}];$("#quickstart_grid").jqGrid({datatype: 'local',data: dataArray,colModel: [{name: 'name', label: 'Name'},{name: 'phone', label: 'Phone Number'}]});</script></body> The first element in the body tags is a standard table element; this is what will be converted into our grid, and it's literally all the HTML needed to make one. The first four lines are just defining some data. I didn't want to get into AJAX or opening other files, so I decided to just create a simple JavaScript array with two entries. The next line is where the magic happens; this single command is being used to create and populate a grid using the information provided. We select the table element with jQuery, and call the jqGrid function, passing it all the properties needed to make a grid. The first two options set the data along with its type, in our case the data is the array we made and the type is local, which is in contrast to some of the other datatypes which use AJAX to retrieve remote data. After that we set the columns, this is done with the colModel property. Now, there are a lot of options for the colModel property, which we will get to later on, numerous settings for customizing and manipulating the data in the table. But for this simple example, we are just specifying the name and label properties, which tell jqGrid the column's label for the header and the value's key from the data array. Now, open index.html in your browser and you should see something like the following screenshot: Not particularly pretty, but you can see that with just a few short lines, we have created a grid and populated it with data that can be sorted. But we can do better; first off, we are only using two of the four standard layers we talked about: the header layer and the body layer. Let's add a caption layer to provide little context, and let's adjust the size of the grid to fit our data. So, modify the call to jqGrid with the following: $("#grid").jqGrid({datatype: 'local',data: dataArray,colModel: [{name: 'name', label: 'Name'},{name: 'phone', label: 'Phone Number'}],height: 'auto',caption: 'Users Grid',}); And refresh your browser; your grid should now look like the following screenshot: That's looking much better. Now, you may be wondering, we only set the height property to auto, so how come the width seems to have snapped to the content? This is due to the fact that the right margin we saw earlier is actually a column for the scroll bar. By default jqGrid sets your grid's height to 150 pixels, this means, regardless of whether you have only one row, or a thousand rows, the height will remain the same, so that there is a gap to hold the scroll bar in an event when you have more rows than that would fit in the given space. When we set the height to auto, it will stretch the grid vertically to contain all the items, making the scroll bar irrelevant and therefore it knows not to place it. Now this is a pretty good quick start example, but to finish things off, let's take a look at the navigation layer, just so we can say we did. For this next part, although we are going to need more data, I can't really show pagination with just two entries, luckily there is a site http://www.json-generator.com/ created by Vazha Omanashvili for doing exactly this. The way it works is, you specify the format and number of rows you want, and it generates it with random data. We are going to keep the format we have been using, of name and phone number, so in the box on the left, enter the following code: ['{{repeat(50)}}',{name: '{{firstName}} {{lastName}}',phone: '{{phone}}'}] Here we're saying we would like 50 rows with a name field containing both firstname and lastname, and we would also like a phone number. This site is not really in the scope of this book, but if you would like to know about other fields, you can click on the help button at the top. Click on Generate to produce a result object, and then just click on the copy to clipboard button. With that done, open your index.html file and replace the array for dataArray with what you copied. Your code should now look like the following code: var dataArray = {"id": 1,"jsonrpc": "2.0","total": 50,"result": [ /* the 50 rows */ ]}; As you can see, the actual rows are under a property named result, so we will need to change the data key in the call to jqGrid from dataArray to dataArray.result. Refreshing the page now you will see the first 20 rows being displayed (that is the default limit). But how can we get to the rest? Well, jqGrid provides a special navigation layer named "pager", which contains a pagination control. To display it, we will need to create an HTML element for it. So, right underneath the table element add a div like: <table id="grid"></table><div id="pager"></div> Then, we just need to add a key to the jqGrid method for the pager and row limit: caption: 'Users Grid',height: 'auto',rowNum: 10,pager: '#pager'}); And that's all there to it, you can adjust the rowNum property to display more or less entries at once, and the pages will be automatically calculated for you. Summary In this article we learned how to create a simple Grid and customize it. We used different properties and functions for the same. We also used different layers to customize the jqGrid. Resources for Article : Further resources on this subject: jQuery and Ajax Integration in Django [Article] Implementing AJAX Grid using jQuery data grid plugin jqGrid [Article] Django JavaScript Integration: jQuery In-place Editing Using Ajax [Article]
Read more
  • 0
  • 0
  • 9785

article-image-installing-openvpn-linux-and-unix-systems-part-1
Packt
31 Dec 2009
10 min read
Save for later

Installing OpenVPN on Linux and Unix Systems: Part 1

Packt
31 Dec 2009
10 min read
Prerequisites All Linux/Unix systems must meet the following requirements to install OpenVPN successfully: Your system must provide support for the Universal TUN/TAP driver. The kernels newer than version 2.4 of almost all modern Linux distributions provide support for TUN/TAP devices. Only if you are using an old distribution or if you have built your own kernel, will you have to add this support to your configuration. This project's web site can be found at http://vtun.sourceforge.net/tun/. OpenSSL libraries have to be installed on your system. I have never encountered any modern Linux/Unix system that does not meet this requirement. However, if you want to compile OpenVPN from source code, the SSL development package may be necessary. The web site is http://www.openssl.org/. The Lempel-Ziv-Oberhumer (LZO) Compression library has to be installed. Again, most modern Linux/Unix systems provide these packages, so there shouldn't be any problem. LZO is a real-time compression library that is used by OpenVPN to compress data before sending. Packages can be found on http://openvpn.net/download.html, and the web site of this project is http://www.oberhumer.com/opensource/lzo/. Most Linux/Unix systems' installation tools are able to resolve these so-called dependencies on their own, but it might be helpful to know where to get the required software. Most commercial Linux systems, like SuSE, provide installation tools, like Yet another Setup Tool (YaST), and contain up-to-date versions of OpenVPN on their installation media (CD or DVD). Furthermore, systems based on RPM software can also install and manage OpenVPN software at the command line. Linux systems, like Debian, use sophisticated package management tools that can install software that is provided by repositories on web servers. No local media is needed, the package management will resolve potential dependencies by itself, and install the newest and safest possible version of OpenVPN. FreeBSD and other BSD-style systems use their package management tools such as pkg_add or the ports system. Like all open source projects, OpenVPN source code is available for download. These compressed tar.gz or tar.bz2 archives can be downloaded from http://openvpn.net/download.html and unpacked to a local directory. This source code has to be configured and translated (compiled) for your operating system. You can also install unstable, developer, or older versions of OpenVPN from http://openvpn.net/download.html. This may be interesting if you want to test new features of forthcoming versions. Daily (unstable!) OpenVPN source code extracts can be obtained from http://sourceforge.net/cvs/?group_id=48978. Here you find the Concurrent Versions System (CVS) repository, where all OpenVPN developers post their changes to the project files. Installing OpenVPN on SuSE Linux Installing OpenVPN on SuSE Linux is almost as easy as installing under Windows or Mac OS X. Linux users may consider it even easier. On SuSE Linux almost all administrative tasks can be carried out using the administration interface YaST. OpenVPN can be installed completely using this. The people distributing SuSE have always tried to include up-to-date software in their distribution. Thus, the installation media of OpenSuSE 11 already contains version 2.0.9 of OpenVPN, and both the Enterprise editions SLES 10 and the forthcoming SLES 11 that offer five years of support. Updates include up-to-date versions of OpenVPN. Both OpenSuSE and SLES use YaST for installing software. Using YaST to install software Start YaST. Under both GNOME and the K Desktop Environment (KDE—the standard desktop under SuSE Linux), you will find YaST in the main menu under System | YaST, or as an icon on the Desktop. If you are logged in as a normal user, you will be prompted to enter your root password and confirm the same. The YaST control center is started. This administration interface consists of many different modules, which are represented by symbols in the right half of the window and grouped by the labels on the left. After starting YaST, click on the symbol labeled Software Management in the right column to start the software management interface of YaST. The software management tool in YaST is very powerful. Under SuSE, data about the installed and installable software is kept in a database, which can be searched very easily. Select the entry Search in the drop-down list Filter: and enter openvpn in the Search field. YaST will find at least one entry that matches your search value openvpn. Depending on the (online) installation sources that you have configured, various add-ons and tools for OpenVPN will be found. If you chose to add the community repositories like I did on this system, then OpenSuSE will list more than 10 hits. Select the entry openvpn by checking the box besides the entry in the first column. If you want to obtain information about the OpenVPN package, have a look at the lower half of the right side—here you will find the software Description, Technical Data, Dependencies, and more information about the package that you have selected. Click on the Accept button to start the OpenVPN installation. If you installed from a local medium, then put your CD or DVD in your local drive now. YaST will retrieve the OpenVPN files from your installation media. If you have configured your system to use one of the web/FTP servers of SuSE for installation, then this might take a while. The files are unpacked and installed on your system, and YaST updates the configuration. This is managed by the script SuSEconfig and other scripts that are called by it. SuSEconfig and YaST were once very infamous for deleting local configuration created by the local administrator or omitting relevant changes. This problem only occurred when updating and re-installing software that was previously installed. However, the latest SuSE versions have proven very reliable, and the system configuration tools never delete configuration files that you have added manually. Instead, the standard configuration files installed with the new software package may be renamed to <file>.rpmnew or similar, and your configuration is loaded. During installation, SuSEconfig calls several helper scripts and updates your configuration, and informs you of the progress in a separate window. After successful software installation, you are prompted if you want to install more packages or exit the installation. Click on the Finish button. The Novell/OpenSuSE teams have added a very handy tool called zypper to their package management. From version 10.1 onwards, you can simply install software from a root console by typing zypper in openvpn. Of course this only works if you know the exact name of the package that you want to install. If not, then you will have to search for it, for example, by using zypper search vpn. Installing OpenVPN on Red Hat Fedora using yum If you are using Red Hat Fedora, the Yellow dog Updater, Modified (yum) is probably the easiest way to install software. It can be found on http://linux.duke.edu/projects/yum/, and provides many interesting features, such as automatic updates, solving dependency problems, and managing installation of software packages. Even though OpenVPN installation on Fedora can only be done on the command line, it still is a very easy task. The installation makes use of the commands wget, rpm, and yum. wget: A command-line download manager suitable for ftp or http downloads. rpm: The Red Hat Package Manager is a software management system used by distributions like SuSE or Red Hat. It keeps track of changes and can solve dependencies between programs. yum: This provides a simple installation program for RPM-based software. To use yum, you have to adapt its configuration file as follows: Log in as administrator (root). Change to Fedora's configuration directory /etc. Save the old, probably original, configurations file yum.conf by renaming or moving it. You can use commands such as mv yum.conf yum.conf_fedora_org to accomplish this. The web site http://www.fedorafaq.org/ provides a suitable configuration file for yum. Download the file http://www.fedorafaq.org/samples/yum.conf using wget. The command-line syntax is: wget http://www.fedorafaq.org/samples/yum.conf At the same web site a sophisticated yum configuration is available for downloading. Install this as well: rpm -Uvh http://www.fedorafaq.org/yum The following excerpt shows the output of these five steps on the system: [root@fedora ~]# cd /etc [root@fedora etc]# mv yum.conf yum.conf.org [root@fedora etc]# wget http://www.fedorafaq.org/samples/yum.conf --11:33:25-- http://www.fedorafaq.org/samples/yum.conf => `yum.conf' Resolving www.fedorafaq.org... 70.84.209.18 Connecting to www.fedorafaq.org[70.84.209.18]:80... connected. HTTP request sent, awaiting response... 200 OK Length: 595 [text/plain] 100%[========================================================== ================================================>] 595 --.- -K/s 11:33:25 (405.20 KB/s) - `yum.conf' saved [595/595] [root@fedora etc]# rpm -Uvh http://www.fedorafaq.org/yum Retrieving http://www.fedorafaq.org/yum Preparing... ######################################### ## [100%] 1:yum-fedorafaq ######################################### ## [100%] [root@fedora etc]# The rest of the OpenVPN installation is very simple. Just enter yum install openvpn in your root shell. Now yum will start and give you a lot of output. We will have a short look at the things yum does. [root@fedora ~]#yum install openvpn Setting up Install Process Setting up repositories livna 100% |=========================| 951 B 00:00 updates-released 100% |=========================| 951 B 00:00 base 100% |=========================| 1.1 kB 00:00 extras 100% |=========================| 1.1 kB 00:00 Reading repository metadata in from local files primary.xml.gz 100% |=========================| 127 kB 00:00 livna : ################################################## 380/380 Added 380 new packages, deleted 0 old in 1.36 seconds primary.xml.gz 100% |=========================| 371 kB 00:00 updates-re: ################################################## 1053/1053 Added 0 new packages, deleted 13 old in 0.93 seconds yum has set up the installation process and integrated online repositories for the installation of software. This feature is the reason why Fedora does not need a URL source for installing OpenVPN. The repository metadata contains information about location, availability, and dependencies between packages. Resolving the dependencies is the next step. Parsing package install arguments Resolving Dependencies --> Populating transaction set with selected packages. Please wait. ---> Downloading header for openvpn to pack into transaction set. openvpn-2.0.9-1.fc5.i386. 100% |=========================| 18 kB 00:00 ---> Package openvpn.i386 0:2.0.9-1.fc5 set to be updated --> Running transaction check --> Processing Dependency: liblzo.so.1 for package: openvpn --> Restarting Dependency Resolution with new changes. --> Populating transaction set with selected packages. Please wait. ---> Downloading header for lzo to pack into transaction set. lzo-1.08-4.i386.rpm 100% |=========================| 3.2 kB 00:00 ---> Package lzo.i386 0:1.08-4 set to be updated --> Running transaction check Dependencies Resolved OpenVPN needs the LZO library for installation, and yum is about to resolve this dependency. As a next step, yum tests whether this library has unresolved dependencies. If this is not the case, we are presented with an overview of the packages to be installed. Confirm by entering y and press the Enter key. yum will start downloading the required packages. If the RPM process that yum is using to install the software packages encounters a missing encryption key, then confirm the import of this key from http://www.fedoraproject.org by entering y and pressing the Enter key. This GPG key is used to control the authenticity of the packages selected for installation. Key imported successfully Running Transaction Test Finished Transaction Test Transaction Test Succeeded Running Transaction Installing: lzo ######################### [1/2] Installing: openvpn ######################### [2/2] Installed: openvpn.i386 0:2.0.9-1.fc5 Dependency Installed: lzo.i386 0:1.08-4 Complete! [root@fedora etc]# That's all! yum has been downloaded, checked, and has installed OpenVPN and the LZO libraries.
Read more
  • 0
  • 0
  • 9773

Packt
27 Apr 2015
9 min read
Save for later

Apache Solr and Big Data – integration with MongoDB

Packt
27 Apr 2015
9 min read
In this article by Hrishikesh Vijay Karambelkar, author of the book Scaling Big Data with Hadoop and Solr - Second Edition, we will go through Apache Solr and MongoDB together. In an enterprise, data is generated from all the software that is participating in day-to-day operations. This data has different formats, and bringing in this data for big-data processing requires a storage system that is flexible enough to accommodate a data with varying data models. A NoSQL database, by its design, is best suited for this kind of storage requirements. One of the primary objectives of NoSQL is horizontal scaling, that is, the P in CAP theorem, but this works at the cost of sacrificing Consistency or Availability. Visit http://en.wikipedia.org/wiki/CAP_theorem to understand more about CAP theorem (For more resources related to this topic, see here.) What is NoSQL and how is it related to Big Data? As we have seen, data models for NoSQL differ completely from that of a relational database. With the flexible data model, it becomes very easy for developers to quickly integrate with the NoSQL database, and bring in large sized data from different data sources. This makes the NoSQL database ideal for Big Data storage, since it demands that different data types be brought together under one umbrella. NoSQL also has different data models, like KV store, document store and Big Table storage. In addition to flexible schema, NoSQL offers scalability and high performance, which is again one of the most important factors to be considered while running big data. NoSQL was developed to be a distributed type of database. When traditional relational stores rely on the high computing power of CPUs and the high memory focused on a centralized system, NoSQL can run on your low-cost, commodity hardware. These servers can be added or removed dynamically from the cluster running NoSQL, making the NoSQL database easier to scale. NoSQL enables most advanced features of a database, like data partitioning, index sharding, distributed query, caching, and so on. Although NoSQL offers optimized storage for big data, it may not be able to replace the relational database. While relational databases offer transactional (ACID), high CRUD, data integrity, and a structured database design approach, which are required in many applications, NoSQL may not support them. Hence, it is most suited for Big Data where there is less possibility of need for data to be transactional. MongoDB at glance MongoDB is one of the popular NoSQL databases, (just like Cassandra). MongoDB supports the storing of any random schemas in the document oriented storage of its own. MongoDB supports the JSON-based information pipe for any communication with the server. This database is designed to work with heavy data. Today, many organizations are focusing on utilizing MongoDB for various enterprise applications. MongoDB provides high availability and load balancing. Each data unit is replicated and the combination of a data with its copes is called a replica set. Replicas in MongoDB can either be primary or secondary. Primary is the active replica, which is used for direct read-write operations, while the secondary replica works like a backup for the primary. MongoDB supports searches by field, range queries, and regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions. Any field in a MongoDB document can be indexed. More information about MongoDB can be read at https://www.mongodb.org/. The data on MongoDB is eventually consistent. Apache Solr can be used to work with MongoDB, to enable database searching capabilities on a MongoDB-based data store. Unlike Cassandra, where the Solr indexes are stored directly in Cassandra through solandra, MongoDB integration with Solr brings in the indexes in the Solr-based optimized storage. There are various ways in which the data residing in MongoDB can be analyzed and searched. MongoDB's replication works by recording all operations made on a database in a log file, called the oplog (operation log). Mongo's oplog keeps a rolling record of all operations that modify the data stored in your databases. Many of the implementers suggest reading this log file using a standard file IO program to push the data directly to Apache Solr, using CURL, SolrJ. Since oplog is a collection of data with an upper limit on maximum storage, it is feasible to synch such querying with Apache Solr. Oplog also provides tailable cursors on the database. These cursors can provide a natural order to the documents loaded in MongoDB, thereby, preserving their order. However, we are going to look at a different approach. Let's look at the schematic following diagram: In this case, MongoDB is exposed as a database to Apache Solr through the custom database driver. Apache Solr reads MongoDB data through the DataImportHandler, which in turns calls the JDBC-based MongoDB driver for connecting to MongoDB and running data import utilities. Since MongoDB supports replica sets, it manages the distribution of data across nodes. It also supports Sharding just like Apache Solr. Installing MongoDB To install MongoDB in your development environment, please follow the following steps: Download the latest version of MongoDB from https://www.mongodb.org/downloads for your supported operating system. Unzip the zipped folder. MongoDB comes up with a default set of different command-line components and utilities:      bin/mongod: The database process.      bin/mongos: Sharding controller.      bin/mongo: The database shell (uses interactive JavaScript). Now, create a directory for MongoDB, which it will use for user data creation and management, and run the following command to start the single node server: $ bin/mongod –dbpath <path to your data directory> --rest In this case, --rest parameter enables support for simple rest APIs that can be used for getting the status. Once the server is started, access http://localhost:28017 from your favorite browser, you should be able to see following administration status page: Now that you have successfully installed MongoDB, try loading a sample data set from the book on MongoDB by opening a new command-line interface. Change the directory to $MONGODB_HOME and run the following command: $ bin/mongoimport --db solr-test --collection zips --file "<file-dir>/samples/zips.json" Please note that the database name is solr-test. You can see the stored data using the MongoDB-based CLI by running the following set of commands from your shell: $ bin/mongo MongoDB shell version: 2.4.9 connecting to: test Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see        http://docs.mongodb.org/ Questions? Try the support group        http://groups.google.com/group/mongodb-user > use test Switched to db test > show dbs exampledb       0.203125GB local   0.078125GB test   0.203125GB > db.zips.find({city:"ACMAR"}) { "city" : "ACMAR", "loc" : [ -86.51557, 33.584132 ], "pop" : 6055, "state" :"AL", "_id" : "35004" } Congratulations! MongoDB is installed successfully Creating Solr indexes from MongoDB To run MongoDB as a database, you will need a JDBC driver built for MongoDB. However, the Mongo-JDBC driver has certain limitations, and it does not work with the Apache Solr DataImportHandler. So, I have extended Mongo-JDBC to work under the Solr-based DataImportHandler. The project repository is available at https://github.com/hrishik/solr-mongodb-dih. Let's look at the setting-up procedure for enabling MongoDB based Solr integration: You may not require a complete package from the solr-mongodb-dih repository, but just the jar file. This can be downloaded from https://github.com/hrishik/solr-mongodb-dih/tree/master/sample-jar. You will also need the following additional jar files:      jsqlparser.jar      mongo.jar These jars are available with the book Scaling Big Data with Hadoop and Solr, Second Edition for download. In your Solr setup, copy these jar files into the library path, that is, the $SOLR_WAR_LOCATION/WEB-INF/lib folder. Alternatively, point your container classpath variable to link them up. Using simple Java source code DataLoad.java (link https://github.com/hrishik/solr-mongodb-dih/blob/master/examples/DataLoad.java), populate the database with some sample schema and tables that you will use to load in Apache Solr. Now create a data source file (data-source-config.xml) as follows: <dataConfig> <dataSource name="mongod" type="JdbcDataSource" driver="com.mongodb. jdbc.MongoDriver" url="mongodb://localhost/solr-test"/> <document>    <entity name="nameage" dataSource="mongod" query="select name, price from grocery">        <field column="name" name="name"/>        <field column="name" name="id"/>        <!-- other files -->    </entity> </document> </dataConfig> Copy the solr-dataimporthandler-*.jar from your contrib directory to a container/application library path. Modify $SOLR_COLLECTION_ROOT/conf/solr-config.xml with DIH entry: <!-- DIH Starts --> <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">    <lst name="defaults">    <str name="config"><path to config>/data-source-config.xml</str>    </lst> </requestHandler>    <!-- DIH ends --> Once this configuration is done, you are ready to test it out. Access http://localhost:8983/solr/dataimport?command=full-import from your browser to run the full import on Apache Solr, where you will see that your import handler has successfully ran, and has loaded the data in Solr store, as shown in the following screenshot: You can validate the content created by your new MongoDB DIH by accessing the Solr Admin page, and running a query: Using this connector, you can perform operations for full-import on various data elements. Since MongoDB is not a relational database, it does support join queries. However, it supports selects, order by, and so on. Summary In this article, we have understood the distributed aspects of any enterprise search where went through Apache Solr and MongoDB together. Resources for Article: Further resources on this subject: Evolution of Hadoop [article] In the Cloud [article] Learning Data Analytics with R and Hadoop [article]
Read more
  • 0
  • 0
  • 9767

article-image-analyzing-moby-dick-frequency-distribution-nltk
Richard Gall
30 Mar 2018
2 min read
Save for later

Analyzing Moby Dick through frequency distribution with NLTK

Richard Gall
30 Mar 2018
2 min read
What is frequency distribution and why does it matter? In the context of natural language processing, frequency distribution is simply a tally of the number of times each unique word is used in a text. Recording the individual word counts of a text can better help us understand not only what topics are being discussed and what information is important but also how that information is being discussed as well. It's a useful method for better understanding language and different types of texts. This video tutorial has been taken from from Natural Language Processing with Python. Word frequency distribution is central to performing content analysis with NLP. Its applications are wide ranging. From understanding and characterizing an author’s writing style to analyzing the vocabulary of rappers, the technique is playing a large part in wider cultural conversations. It’s also used in psychological research in a number of ways to analyze how patients use language to form frameworks for thinking about themselves and the world. Trivial or serious, word frequency distribution is becoming more and more important in the world of research. Of course, manually creating such a word frequency distribution models would be time consuming and inconvenient for data scientists. Fortunately for us, NLTK, Python’s toolkit for natural language processing, makes life much easier. How to use NLTK for frequency distribution Take a look at how to use NLTK to create a frequency distribution for Herman Melville’s Moby Dick in the video tutorial above. In it, you'll find a step by step guide to performing an important data analysis task. Once you've done that, you can try it for yourself, or have a go at performing a similar analysis on another data set. Read Analyzing Textual information using the NLTK library. Learn more about natural language processing - read How to create a conversational assistant or chatbot using Python.
Read more
  • 0
  • 0
  • 9766

article-image-mastering-centos-7-linux-server-0
Packt
30 Dec 2015
19 min read
Save for later

Mastering CentOS 7 Linux Server

Packt
30 Dec 2015
19 min read
In this article written by Bhaskarjyoti Roy, author of the book Mastering CentOS 7 Linux Server, will introduce some advanced user and group management scenarios along with some examples on how to handle advanced level options such as password aging, managing sudoers, and so on, on a day to day basis. Here, we are assuming that we have already successfully installed CentOS 7 along with a root and user credentials as we do in the traditional format. Also, the command examples, in this chapter, assume you are logged in or switched to the root user. (For more resources related to this topic, see here.)  The following topics will be covered: User and group management from GUI and command line Quotas Password aging Sudoers Managing users and groups from GUI and command line We can add a user to the system using useradd from the command line with a simple command as follows: useradd testuser This creates a user entry in the /etc/passwd file and automatically creates the home directory for the user in /home. The /etc/passwd entry looks like this: testuser:x:1001:1001::/home/testuser:/bin/bash But, as we all know, the user is in a locked state and cannot login to the system unless we add a password for the user using the command: passwd testuser This will, in turn, modify the /etc/shadow file, at the same time unlock the user, and the user will be able to login to the system. By default, the preceding set of commands will create both a user and a group for the testuser on the system. What if we want a certain set of users to be a part of a common group? We will use the -g option along with the useradd command to define the group for the user, but we have to make sure that the group already exists. So, to create users such as testuser1, testuser2, and testuser3 and make them part of a common group called testgroup, we will first create the group and then we create the users using the -g or -G switch. So we will do this: # To create the group : groupadd testgroup # To create the user with the above group and provide password and unlock user at the same time : useradd testuser1 -G testgroup passwd testuser1 useradd testuser2 -g 1002 passwd testuser2 Here, we have used both -g and -G. The difference between them is: with -G, we create the user with its default group and assign the user to the common testgroup as well, but with -g, we created the user as part of the testgroup only. In both cases, we can use either the gid or the group name obtained from the /etc/group file. There are a couple of more options that we can use for an advanced level user creation, for example, for system users with uid less than 500, we have to use the -r option, which will create a user on the system but the uid will be less than 500. We also can use -u to define a specific uid, which must be unique and greater than 499. Common options that we can use with the useradd command are: -c: This option is used for comments, generally to define the user's real name such as -c "John Doe". -d: This option is used to define home-dir; by default, the home directory is created in /home such as -d /var/<user name>. -g: This option is used for the group name or the group number for the user's default group. The group must already have been created earlier. -G: This option is used for additional group names or group numbers, separated by commas, of which the user is a member. Again, these groups must also have been created earlier. -r: This option is used to create a system account with a UID less than 500 and without a home directory. -u: This option is the user ID for the user. It must be unique and greater than 499. There are few quick options that we use with the passwd command as well. These are: -l: This option is to lock the password for the user's account -u: This option is to unlock the password for the user's account -e: This option is to expire the password for the user -x: This option is to define the maximum days for the password lifetime -n: This option is to define the minimum days for the password lifetime Quotas In order to control the disk space used in the Linux filesystem, we must use quota, which enables us to control the disk space and thus helps us resolve low disk space issues to a great extent. For this, we have to enable user and group quota on the Linux system. In CentOS 7, the user and group quota are not enabled by default so we have to enable them first. To check whether quota is enabled, or not, we issue the following command: mount | grep ' / ' The image shows that the root filesystem is enabled without quota as mentioned by the noquota in the output. Now, we have to enable quota on the root (/) filesystem, and to do that, we have to first edit the file /etc/default/grub and add the following to the GRUB_CMDLINE_LINUX: rootflags=usrquota,grpquota The GRUB_CMDLINE_LINUX line should read as follows: GRUB_CMDLINE_LINUX="rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=auto  vconsole.keymap=us rhgb quiet rootflags=usrquota,grpquota" The /etc/default/grub should like the following screenshot: Since we have to reflect the changes we just made, we should backup the grub configuration using the following command: cp /boot/grub2/grub.cfg /boot/grub2/grub.cfg.original Now, we have to rebuild the grub with the changes we just made using the command: grub2-mkconfig -o /boot/grub2/grub.cfg Next, reboot the system. Once it's up, login and verify that the quota is enabled using the command we used before: mount | grep ' / ' It should now show us that the quota is enabled and will show us an output as follows: /dev/mapper/centos-root on / type xfs (rw,relatime,attr2,inode64,usrquota,grpquota) Now, since quota is enabled, we will further install quota using the following to operate quota for different users and groups, and so on: yum -y install quota Once quota is installed, we check the current quota for users using the following command: repquota -as The preceding command will report user quotas in a human readable format. From the preceding screenshot, there are two ways we can limit quota for users and groups, one is setting soft and hard limits for the size of disk space used or another is limiting the user or group by limiting the number of files they can create. In both cases, soft and hard limits are used. A soft limit is something that warns the user when the soft limit is reached and the hard limit is the limit that they cannot bypass. We will use the following command to modify a user quota: edquota -u username Now, we will use the following command to modify the group quota: edquota -g groupname If you have other partitions mounted separately, you have to modify the /etc/fstab to enable quota on the filesystem by adding usrquota and grpquota after the defaults for that specific partition as in the following screenshot, where we have enabled the quota for the /var partition: Once you are finished enabling quota, remount the filesystem and run the following commands: To remount /var : mount -o remount /var To enable quota : quotacheck -avugm quotaon -avug Quota is something all system admins use to handle disk space consumed on a server by users or groups and limit over usage of the space. It thus helps them manage the disk space usage on the system. In this regard, it should be noted that you plan before your installation and create partitions accordingly as well so that the disk space is used properly. Multiple separate partitions such as /var and /home etc are always suggested, as generally, these are the partitions, which consume most space on a Linux system. So, if we keep them on a separate partition, it will not eat up the root ('/') filesystem space and will be more failsafe than using an entire filesystem mounted as only root. Password aging It is a good policy to have password aging so that the users are forced to change their password at a certain interval. This, in turn, helps to keep the security of the system as well. We can use chage to configure the password to expire the first time the user logs in to the system. Note: This process will not work if the user logs in to the system using SSH. This method of using chage will ensure that the user is forced to change the password right away. If we use only chage <username>, it will display the current password aging value for the specified user and will allow them to be changed interactively. The following steps need to be performed to accomplish password aging: Lock the user. If the user doesn't exist, we will use the useradd command to create the user. However, we will not assign any password to the user so that it remains locked. But, if the user already exists on the system, we will use the usermod command to lock the user: Usermod -L <username> Force immediate password change using the following command: chage -d 0 <username> Unlock the account, This can be achieved in two ways. One is to assign an initial password and the other way is to assign a null password. We will take the first approach as the second one, though possible, is not a good practice in terms of security. Therefore, here is what we do to assign an initial password: Use the python command to start the command-line python interpreter: import crypt; print crypt.crypt("Q!W@E#R$","Bing0000/") Here, we have used the Q!W@E#R$ password with a salt combination of the alphanumeric character: Bing0000 followed by a (/) character. The output is the encrypted password, similar to 'BiagqBsi6gl1o'. Press Ctrl + D to exit the Python interpreter. At the shell, enter the following command with the encrypted output of the Python interpreter: usermod -p "<encrypted-password>" <username> So, here, in our case, if the username is testuser, we will use the following command: usermod -p "BiagqBsi6gl1o" testuser Now, upon initial login using the "Q!W@E#R$" password, the user will be prompted for a new password. Setting the password policy This is a set of rules defined in some files, which have to be followed when a system user is setting up. It's an important factor in security because one of the many security breach histories was started with hacking user passwords. This is the reason why most organizations set a password policy for their users. All usernames and passwords must comply with this. A password policy usually is defined by the following: Password aging Password length Password complexity Limit login failures Limit prior password reuse Configuring password aging and password length Password aging and password length are defined in /etc/login.defs. Aging basically means the maximum number of days a password might be used, minimum number of days allowed between password changes, and number of warnings before the password expires. Length refers to the number of characters required for creating the password. To configure password aging and length, we should edit the /etc/login.defs file and set different PASS values according to the policy set by the organization. Note: The password aging controls defined here does not affect existing users; it only affects the newly created users. So, we must set these policies when setting up the system or the server at the beginning. The values we modify are: PASS_MAX_DAYS: The maximum number of days a password can be used PASS_MIN_DAYS: The minimum number of days allowed between password changes PASS_MIN_LEN: The minimum acceptable password length PASS_WARN_AGE: The number of days warning to be given before a password expires Let's take a look at a sample configuration of the login.defs file: Configuring password complexity and limiting reused password usage By editing the /etc/pam.d/system-auth file, we can configure the password complexity and the number of reused passwords to be denied. A password complexity refers to the complexity of the characters used in the password, and the reused password deny refers to denying the desired number of passwords the user used in the past. By setting the complexity, we force the usage of the desired number of capital characters, lowercase characters, numbers, and symbols in a password. The password will be denied by the system until and unless the complexity set by the rules are met. We do this using the following terms: Force capital characters in passwords: ucredit=-X, where X is the number of capital characters required in the password Force lower case characters in passwords: lcredit=-X, where X is the number of lower case characters required in the password Force numbers in passwords: dcredit=-X, where X is the number numbers required in the password Force the use of symbols in passwords: ocredit=-X, where X is the number of symbols required in the password For example: password requisite pam_cracklib.so try_first_pass retry=3 type= ucredit=-2 lcredit=-2 dcredit=-2 ocredit=-2 Deny reused passwords: remember=X, where X is the number of past passwords to be denied For example: password sufficient pam_unix.so sha512 shadow nullok try_first_pass use_authtok remember=5 Let's now take a look at a sample configuration of /etc/pam.d/system-auth: Configuring login failures We set the number of login failures allowed by a user in the /etc/pam.d/password-auth, /etc/pam.d/system-auth, and /etc/pam.d/login files. When a user's failed login attempts are higher than the number defined here, the account is locked and only a system administrator can unlock the account. To configure this, make the following additions to the files. The following deny=X parameter configures this, where X is the number of failed login attempts allowed: Add these two lines to the /etc/pam.d/password-auth and /etc/pam.d/system-auth files and only the first line to the /etc/pam.d/login file: auth        required    pam_tally2.so file=/var/log/tallylog deny=3 no_magic_root unlock_time=300 account     required    pam_tally2.so The following screenshot is a sample /etc/pam.d/system-auth file: The following is a sample /etc/pam.d/login file: To see failures, use the following command: pam_tally2 –user=<User Name> To reset the failure attempts and to enable the user to login again, use the following command: pam_tally2 –user=<User Name> --reset Sudoers Separation of user privilege is one of the main features in Linux operating systems. Normal users operate in limited privileged sessions to limit the scope of their influence on the entire system. One special user exists on Linux that we know about already is root, which has super-user privileges. This account doesn't have any restrictions that are present to normal users. Users can execute commands with super-user or root privileges in a number of different ways. There are mainly three different ways to obtain root privileges on a system: Login to the system as root Login to the system as any user and then use the su - command. This will ask you for the root password and once authenticated, will give you the root shell session. We can disconnect this root shell using Ctrl + D or using the command exit. Once exited, we will come back to our normal user shell. Run commands with root privileges using sudo without spawning a root shell or logging in as root. This sudo command works as follows: sudo <command to execute> Unlike su, sudo will request the password of the user calling the command, not the root password. Sudo doesn't work by default and requires to be set up before it functions correctly. In the following section, we will see how to configure sudo and modify the /etc/sudoers file so that it works the way we want it to. visudo Sudo is modified or implemented using the /etc/sudoers file, and visudo is the command that enables us to edit the file. Note: This file should not be edited using a normal text editor to avoid potential race conditions in updating the file with other processes. Instead, the visudo command should be used. The visudo command opens a text editor normally, but then validates the syntax of the file upon saving. This prevents configuration errors from blocking sudo operations. By default, visudo opens the /etc/sudoers file in Vi Editor, but we can configure it to use the nano text editor instead. For that, we have to make sure nano is already installed or we can install nano using: yum install nano -y Now, we can change it to use nano by editing the ~/.bashrc file: export EDITOR=/usr/bin/nano Then, source the file using: . ~/.bashrc Now, we can use visudo with nano to edit the /etc/sudoers file. So, let's open the /etc/sudoers file using visudo and learn a few things. We can use different kinds of aliases for different set of commands, software, services, users, groups, and so on. For example: Cmnd_Alias NETWORKING = /sbin/route, /sbin/ifconfig, /bin/ping, /sbin/dhclient, /usr/bin/net, /sbin/iptables, /usr/bin/rfcomm, /usr/bin/wvdial, /sbin/iwconfig, /sbin/mii-tool Cmnd_Alias SOFTWARE = /bin/rpm, /usr/bin/up2date, /usr/bin/yum Cmnd_Alias SERVICES = /sbin/service, /sbin/chkconfig and many more ... We can use these aliases to assign a set of command execution rights to a user or a group. For example, if we want to assign the NETWORKING set of commands to the group netadmin we will define: %netadmin ALL = NETWORKING Otherwise, if we want to allow the wheel group users to run all the commands, we use the following command: %wheel  ALL=(ALL)  ALL If we want a specific user, john, to get access to all commands we use the following command: john  ALL=(ALL)  ALL We can create different groups of users, with overlapping membership: User_Alias      GROUPONE = abby, brent, carl User_Alias      GROUPTWO = brent, doris, eric, User_Alias      GROUPTHREE = doris, felicia, grant Group names must start with a capital letter. We can then allow members of GROUPTWO to update the yum database and all the commands assigned to the preceding software by creating a rule like this: GROUPTWO    ALL = SOFTWARE If we do not specify a user/group to run, sudo defaults to the root user. We can allow members of GROUPTHREE to shutdown and reboot the machine by creating a command alias and using that in a rule for GROUPTHREE: Cmnd_Alias      POWER = /sbin/shutdown, /sbin/halt, /sbin/reboot, /sbin/restart GROUPTHREE  ALL = POWER We create a command alias called POWER that contains commands to power off and reboot the machine. We then allow the members of GROUPTHREE to execute these commands. We can also create Run as aliases, which can replace the portion of the rule that specifies to the user to execute the command as: Runas_Alias     WEB = www-data, apache GROUPONE    ALL = (WEB) ALL This will allow anyone who is a member of GROUPONE to execute commands as the www-data user or the apache user. Just keep in mind that later, rules will override previous rules when there is a conflict between the two. There are a number of ways that you can achieve more control over how sudo handles a command. Here are some examples: The updatedb command associated with the mlocate package is relatively harmless. If we want to allow users to execute it with root privileges without having to type a password, we can make a rule like this: GROUPONE    ALL = NOPASSWD: /usr/bin/updatedb NOPASSWD is a tag that means no password will be requested. It has a companion command called PASSWD, which is the default behavior. A tag is relevant for the rest of the rule unless overruled by its twin tag later down the line. For instance, we can have a line like this: GROUPTWO    ALL = NOPASSWD: /usr/bin/updatedb, PASSWD: /bin/kill In this case, a user can run the updatedb command without a password as the root user, but entering the root password will be required for running the kill command. Another helpful tag is NOEXEC, which can be used to prevent some dangerous behavior in certain programs. For example, some programs, such as less, can spawn other commands by typing this from within their interface: !command_to_run This basically executes any command the user gives it with the same permissions that less is running under, which can be quite dangerous. To restrict this, we could use a line like this: username    ALL = NOEXEC: /usr/bin/less We should now have clear understanding of what sudo is and how do we modify and provide access rights using visudo. There are many more things left here. You can check the default /etc/sudoers file, which has a good number of examples, using the visudo command, or you can read the sudoers manual as well. One point to remember is that root privileges are not given to regular users often. It is important for us to understand what these command do when you execute with root privileges. Do not take the responsibility lightly. Learn the best way to use these tools for your use case, and lock down any functionality that is not needed. Reference Now, let's take a look at the major reference used throughout the chapter: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/System_Administrators_Guide/index.html Summary In all we learned about some advanced user management and how to manage users through the command line, along with password aging, quota, exposure to /etc/sudoers, and how to modify them using visudo. User and password management is a regular task that a system administrator performs on servers, and it has a very important role in the overall security of the system. Resources for Article:   Further resources on this subject: SELinux - Highly Secured Web Hosting for Python-based Web Applications [article] A Peek Under the Hood – Facts, Types, and Providers [article] Puppet Language and Style [article]
Read more
  • 0
  • 0
  • 9745
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime
article-image-chatgpt-for-natural-language-processing-nlp
Bhavishya Pandit
25 Sep 2023
10 min read
Save for later

ChatGPT for Natural Language Processing (NLP)

Bhavishya Pandit
25 Sep 2023
10 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights and books. Don't miss out – sign up today!IntroductionIn an era defined by the fusion of technology and human interaction, ChatGPT stands at the forefront as a groundbreaking creation. This marvel of machine learning, developed by OpenAI, has transcended mere algorithms to become a conversational AI that possesses the ability to engage, assist, and inspire. As a professional writer deeply immersed in both the realms of language and artificial intelligence, I am excited to delve into the capabilities of ChatGPT and explore its potential impact on a world increasingly reliant on Natural Language Processing (NLP). In this article, we will not only unveil the astonishing abilities of ChatGPT but also shed light on the burgeoning significance of NLP across diverse industries.Accessing GPT APIThe ChatGPT API provides a streamlined way to integrate the power of ChatGPT into applications and services. It operates through a simple yet effective mechanism: users send a list of messages as input, with each message having a 'role' (system, user, or assistant) and 'content' (the text of the message). The conversation typically begins with a system message to set the AI's behavior, followed by alternating user and assistant messages.The API returns a model-generated message as output, which can be easily extracted from the API response. To access this functionality, developers can obtain API keys through the OpenAI platform. These keys grant access to the API, enabling developers to harness the capabilities of ChatGPT within their applications and projects seamlessly.ChatGPT for various NLP tasks1. Sentiment Analysis with ChatGPTUsing ChatGPT for sentiment analysis is a straightforward yet powerful application. To perform sentiment analysis, you can send a message to ChatGPT with user or assistant roles and ask it to determine the sentiment of a piece of text. Here's an example in Python using the OpenAI Python library:import openai openai.api_key = "YOUR_API_KEY" def analyze_sentiment(text):    response = openai.ChatCompletion.create(        model="gpt-3.5-turbo",        messages=[            {"role": "user", "content": f"Analyze the sentiment of the following text: '{text}'"}        ]    )      sentiment = response['choices'][0]['message']['content']      return sentiment text_to_analyze = "I absolutely love this product!" sentiment_result = analyze_sentiment(text_to_analyze) print(f"Sentiment: {sentiment_result}") Potential Applications:1. Social Media Monitoring: ChatGPT's sentiment analysis can be invaluable for businesses and brands aiming to track public sentiment about their products or services on social media platforms. By analyzing user-generated content, companies can gain real-time insights into how their brand is perceived and promptly respond to both positive and negative feedback.2. Customer Feedback Analysis: ChatGPT can assist in automating the process of analyzing customer reviews and feedback. It can categorize comments as positive, negative, or neutral, helping businesses identify areas for improvement and understand customer sentiment more comprehensively.3. Market Research: Researchers can leverage ChatGPT's sentiment analysis capabilities to process large volumes of text data from surveys, focus groups, or online forums. This aids in identifying emerging trends, gauging public opinion, and making data-driven decisions.By integrating ChatGPT's sentiment analysis into these and other applications, organizations can harness the power of natural language understanding to gain deeper insights into the opinions, emotions, and attitudes of their audience, leading to more informed and effective decision-making.2. Language Translation with ChatGPTChatGPT can be harnessed for language translation tasks with ease. It's a versatile tool for converting text from one language to another. Here's a Python code example demonstrating how to use ChatGPT for language translation:import openai openai.api_key = "YOUR_API_KEY" def translate_text(text, source_language, target_language):    response = openai.ChatCompletion.create(        model="gpt-3.5-turbo",        messages=[            {"role": "user", "content": f"Translate the following text from {source_language} to {target_language}: '{text}'"}        ]    )      translation = response['choices'][0]['message']['content']      return translation source_text = "Hello, how are you?" source_language = "English" target_language = "French" translated_text = translate_text(source_text, source_language, target_language) print(f"Translated Text: {translated_text}") Relevance in Multilingual Content Creation and Internationalization:1. Multilingual Content Creation: In an increasingly globalized world, businesses and content creators need to reach diverse audiences. ChatGPT's language translation capabilities facilitate the creation of multilingual content, enabling companies to expand their market reach and engage with customers in their native languages. This is crucial for marketing campaigns, websites, and product documentation.2. Internationalization: For software and apps aiming to go international, ChatGPT can assist in translating user interfaces and content into multiple languages. This enhances the user experience and makes products more accessible to a global user base.3. Cross-Cultural Communication: ChatGPT can help bridge language barriers in real-time conversations, facilitating cross-cultural communication. This is beneficial in customer support, online chat, and international business negotiations.By leveraging ChatGPT's language translation capabilities, organizations and individuals can enhance their global presence, foster better communication across languages, and tailor their content to a diverse and international audience. This, in turn, can lead to increased engagement, improved user satisfaction, and broader market opportunities.3. Text Summarization with ChatGPTChatGPT can be a valuable tool for generating concise and coherent text summaries from lengthy articles or documents. It leverages its natural language processing capabilities to extract the most important information and present it in a condensed form. Here's a Python code example illustrating how to use ChatGPT for text summarization:import openai openai.api_key = "YOUR_API_KEY" def generate_summary(text, max_tokens=50):    response = openai.ChatCompletion.create(        model="gpt-3.5-turbo",        messages=[            {"role": "user", "content": f"Summarize the following text: '{text}'", "role": "assistant", "content": f"Please summarize the following text to around {max_tokens} tokens:"}        ]    )      summary = response['choices'][0]['message']['content']      return summary document_text = SAMPLE_TEXT summary_result = generate_summary(document_text) print(f"Summary: {summary_result}")Applications in Content Curation and Information Extraction:1. Content Curation: Content creators, marketers, and news aggregators can use ChatGPT to automatically summarize news articles, blog posts, or research papers. This streamlines the process of identifying relevant and interesting content to share with their audience.2. Research and Study: Researchers and students can employ ChatGPT to condense lengthy academic papers or reports into more manageable summaries. This helps in quickly grasping the key findings and ideas within complex documents.3. Business Intelligence: In the corporate world, ChatGPT can be employed to summarize market reports, competitor analyses, and industry trends. This enables executives and decision-makers to stay informed and make strategic choices more efficiently.By integrating ChatGPT's text summarization capabilities into various applications, users can enhance their ability to sift through and distill vast amounts of textual information, ultimately saving time and improving decision-making processes.4. Question Answering with ChatGPTChatGPT excels at answering questions, making it a versatile tool for building chatbots, virtual assistants, and FAQ systems. It can provide informative and context-aware responses to a wide range of queries. Here's a Python code example illustrating how to use ChatGPT for question answering:import openai openai.api_key = "YOUR_API_KEY" def ask_question(question, context):    response = openai.ChatCompletion.create(        model="gpt-3.5-turbo",        messages=[            {"role": "user", "content": f"Context: {context}"},            {"role": "user", "content": f"Question: {question}"}        ]    )      answer = response['choices'][0]['message']['content']      return answer context = "The Eiffel Tower is a famous landmark in Paris, France. It was completed in 1889 and stands at 324 meters tall." question = "When was the Eiffel Tower built?" answer_result = ask_question(question, context) print(f"Answer: {answer_result}")Use in Chatbots, FAQs, and Virtual Assistants:1. Chatbots: ChatGPT can serve as the core intelligence behind chatbots, responding to user inquiries and engaging in natural conversations. Businesses can use chatbots for customer support, lead generation, and interactive marketing, delivering real-time assistance to users.2. FAQ Systems: Implementing ChatGPT in FAQ systems allows users to ask questions in a more natural and conversational manner. It ensures that users receive accurate and context-aware responses from a repository of frequently asked questions.3. Virtual Assistants: Virtual assistants powered by ChatGPT can assist users in various tasks, such as scheduling appointments, providing information, and even helping with language translation or summarization. They can be integrated into websites, applications, or devices to enhance user experiences.By harnessing ChatGPT's question-answering capabilities, organizations can create intelligent and responsive digital agents that deliver efficient and accurate information to users, improving customer satisfaction and user engagement across a wide range of applications.Ethical ConsiderationsAI and NLP technologies, like ChatGPT, raise ethical concerns, primarily concerning bias and misuse. Biases in training data can lead to unfair or discriminatory responses, while misuse can involve generating harmful content or misinformation. To responsibly use ChatGPT, consider:1. Bias Mitigation: Carefully curate and review training data to minimize biases. Implement debiasing techniques and provide guidelines for human reviewers to ensure fairness.2. Transparency: Be transparent about the AI's capabilities and limitations. Avoid giving it false identities or promoting misleading information.3. Content Moderation: Implement strong content moderation to prevent misuse. Regularly monitor and fine-tune the AI's responses to ensure they align with ethical standards.4. User Education: Educate users on the nature of AI-generated content, promoting critical thinking and responsible consumption.By proactively addressing these ethical concerns and adhering to guidelines, we can harness AI and NLP technologies like ChatGPT for positive, inclusive, and responsible outcomes.ConclusionIn conclusion, ChatGPT is a remarkable AI tool that showcases the transformative potential of Natural Language Processing (NLP). Key takeaways include its capabilities in sentiment analysis, language translation, text summarization, question answering, and chatbot development. However, ethical considerations like bias and misuse are critical and must be addressed responsibly. I encourage readers to harness ChatGPT and NLP in their projects, emphasizing transparency, bias mitigation, and responsible usage. By doing so, we can unlock the vast possibilities of these technologies while fostering fairness, accuracy, and positive impact across various domains. Explore, innovate, and shape a future where language and AI empower us all.Author BioBhavishya Pandit is a Data Scientist at Rakuten! He has been extensively exploring GPT to find use cases and build products that solve real-world problems.
Read more
  • 0
  • 0
  • 9744

article-image-design-tools-and-basics
Packt
29 Aug 2013
20 min read
Save for later

Design Tools and Basics

Packt
29 Aug 2013
20 min read
(For more resources related to this topic, see here.) Owning a Makerbot 3D printer means being able to make anything you want at a push of a button, right? 3D printer owners quickly find that while 3D printers have no end of things they can produce, they also are not without their limitations. Designing an object without 3D printing in mind will result in a failed print that more resembles a bird nest or a bowl of spaghetti. Making a 3D printable object requires learning a few rules, some careful planning, and design. But once you know the rules the results can be astounding. 3D printers can even produce things with ease that traditional manufacturing cannot, for example, objects with complex internal geometry that machining cannot touch. There are many places online such as Makerbot's own Thingiverse that hosts a daily growing library of printable objects. Printing out other people's designs is all well and good for a while, but the most exciting part about 3D printing is that it can produce your designs and models. Eventually, learning how to model for 3D printing is a must. Can you learn 3D modeling? If you've ever won a round of Pictionary you've got all the artistic skill it takes to get started. If you've ever gotten past level 1 on Tetris then you've got spatial reasoning. If you've ever played with modeling clay then you know all about designing in three dimensions. Design basics There are some design rules and basic ideas that will be true regardless of the modeling software used. The working of 3D printing 3D printing has come a long way in terms of technology and cost allowing home 3D printers to be a reality. In this process there have been choices that will limit what can be printed. Seeing a 3D printer in action is the best way to learn about the process. Fortunately there are many 3D printing time lapse videos online of printers in action that can be found with a simple search. 3D printers build an object layer-by-layer from the bottom to the top. Plastic filament is heated and extruded, and each layer is built upon the last one. Usually the outside of the object is drawn and sometimes additional shells are added for strength. Then the inside is usually filled with a lattice to save plastic and provide some support for higher layers, however the inside is mostly air. This continues until the object is complete as shown in the following screenshot: Because of this layer-by-layer process, if a design is made so that any part has nothing underneath it, dangling in the air, then the printer will still extrude some plastic to try to print the part which will just dangle from the nozzle and be dragged into the next area where it will build up an ugly mess and ruin the print: Building for supportless prints One way of fixing the dangling object problem is to configure the preparing software to build the model "with supports". This means the slicer will automatically build a support lattice of plastic, up to the dangling part so that it has something to print on. Higher-end printers can actually print with a different material that can be dissolved away, but so far most home printers only use break-away supports. Either way after the print is complete it is left to the user to clean up this support material to extract the desired part. While supports do allow the creation of objects that would be impossible any other way, the supports themselves are a waste of material and often don't remove cleanly leading to a messy bottom surface where they contact the print. If a part is designed needing supports that are hard to remove, such as if they're internal and partially obscured, it can be difficult and frustrating to completely remove the support material (this can be true for even the higher-end 3D printers). The process of removing it may actually damage the print. It is possible and very easy with just the slightest application of cleverness to make designs that are printable without the need for any supports. So the blueprints in this article focus on making designs that print without supports. The limitations imposed by this demands just a little more effort but allow for the teaching of principles that are generally good to know. Designing for dual extruders Some models of Makerbot and other 3D printers have the ability to print in multiple colors at once using two different extruder heads feeding plastic from two different spools. There are some fun prints that come from this process. But as most Makerbots and other brands of home 3D printers do not have dual extruders at this time this article will not explore this process in detail. The basic idea of the process is creating two files that are aligned to print in the same space and combining them in the slicer. Designing supportless – overhangs and bridges When designing for supportless printing the rules are simple: Y prints, H prints okay, T does not print well. Branching out with overhangs It is possible to have the current layer slightly larger than the previous layer provided the overhang is not more than 45 degrees. This is because the current layer will have enough of the previous layer to stick to. Hence a shape like the capital letter Y will successfully print standing up. However, if the overhang is too great or too abrupt the new layer will droop causing a print fail, hence a shape like the capital letter T does not print. (If the T is serif and thus has downward dangling bits, it will fail even worse, as illustrated previously.) So it is important to try to keep overhangs within a 45 degree cone as they go upwards. Building bridges If a part of the print has nothing above it, but has something on either side that it can attach to, then it may be able to bridge the gap. But use caution. The printer makes no special effort in making bridges; they are drawn like any other layer: outline first, then infill. As long as the outline has something to attach to on both sides it should be fine. But if that outline is too complex or contains parts that will print in mid-air, it may not succeed. Being aware of bridges in the design and keeping them simple is the key to successful bridging. Even with a simple bridge some 3D printers need a little bit more calibration to print it well. Hence a shape like the capital letter H will successfully print most of the time. Of course this discussion is purely illustrative of the way overhangs work or fail. In real life if a Y, H, or T needed to be printed the best way to do it would be to lay them down. But for purposes of illustration it still stands that Y prints, H prints okay, T does not. Choosing a modeling tool There are many choices of modeling programs that can be used to produce 3D printable objects. There are many factors including versatility, simplicity, and cost to take into account. A tool with too steep a learning curve can turn off new users to the idea all together. A tool with too limited a set of tools can frustrate a user when they hit the limit. Investing a lot of money into something that doesn't end up going anywhere can be extremely disappointing. So it is important to explore the options. SolidWorks (www.solidworks.com) and other drafting oriented programs can do technical shapes with extreme precision. They include the necessary tools to accurately describe a shape that can be brought into the real work with high fidelity. However these sorts of tools tend to be costly and don't do artistic or organic shapes very well. Their highly technical nature also gives them a steep learning curve. OpenSCAD (www.openscad.org) is free and famous among the people who make 3D printers and can make technically accurate models as well. OpenSCAD also allows the models to be parametric, meaning that by changing a few variables and recalculating a new shape is generated. But OpenSCAD is difficult to use unless the user has a very technical and programmatic mind since the shapes are literally built from lines of code. Zbrush (pixologic.com/zbrush), Sculptris (pixologic.com/sculptris), or Wings3D (www.wings3d.com) are great tools for modeling organic shapes like the kind used in video games or animation. Sculptris and Wings3D are free and are very easy to pick up and use. But these tools lack when precision is necessary. Sketchup (www.sketchup.com) is a great free program with a library of shapes built in ready to import and play with. Its modeling tools are great for precise or architectural models. Sketchup doesn't do organic shapes well either and it can be tricky loading the plug-ins necessary for Sketchup to export their models to something printable. Even then models from Sketchup often have to go through an extensive clean up phase before they'll be ready to print. Autodesk 123D (www.123dapp.com) is not one but a whole suite of free programs designed around 3D modeling with specific focus on 3D printing. There are programs to design creatures or precise shapes. There is even an app for converting pictures of real life objects into 3D models. Some are programs that run in browser, some are downloads and some are apps for Apple devices. It's an eclectic and powerful group of programs. The Autodesk 123D suite's weakness is in its general immaturity. Autodesk is making great efforts to make modeling for 3D printing accessible for everyone but its tools still need to mature somewhat before they'll be ready to explore in depth. Blender (blender.org) is a 3D animation program that features a robust set of modeling tools. Good for artistic and organic shapes, it can also be used when precision is needed. On top all that it is free and open source, so it is still in constant development. If Blender doesn't have a particular feature it is only a matter of time until it will be added. If fact by the time this article is published chances are the version of Blender used to make it will already be out of date, but most of Blender's functionality remains unchanged version-to-version. Blender is also completely customizable so that every feature, from the key strokes used to the overall look, can be changed. The downside of Blender is that its user interface is somewhat unintuitive. This causes Blender to have a famously difficult learning curve. Because of Blender's versatility and availability it is the tool of choice for the beginning 3D designer and this work. Installing Blender This will be the first project in the article. Fortunately downloading and installing Blender is as easy as 1-2-3-4. Go to www.blender.org. Click on the Download link. Choose and download the installer for your system: Windows, Mac, or Linux, 32 bit or 64 bit. Run the installer. The installer will guide the process of loading Blender and adding icons to the system. Windows Blender.org offers installer executable and ZIP files. The zip files are for advanced users who want a portable version of Blender. When in doubt choose the executable since it will set up icons making for easy access. If in doubt whether to use the 32 or 64-bit versions picking the 32-bit will insure compatibility, but it is a good idea to find out what type of system it is being installed on as 64-bit offers significant performance improvements. Windows 7 or greater will confirm that the installer should be run. Click on Yes to assure Windows that it's okay to install Blender. Then the installer will run. The install wizard's defaults are fine for most users. Simply put the mouse over the Next button and click on every button that appears under it. On the second screen read over the Blender Terms of Service and click on I Agree to proceed. Unless you manage your installed programs directories yourself it is best to leave the defaults on the third screen as it is. Then click on Next and the install process will start. When the install process finishes leave the check box check and click on Finish to exit the installer and run Blender. Getting acquainted with Blender When Blender starts up, the Splash Screen can be dismissed by left-clicking outside the screen. The screenshots in this article use a custom color palette for print and a smaller window to minimize wasted screen space. Customizing the color palette will be discussed briefly later but these cosmetic changes will not affect the instructions presented at all. The Blender interface is broken into several different customizable panels to keep things organized. Each panel has resizing widgets in the upper-right and lower-left corners. By clicking on these widgets the panels can be split to add more panels or expanded into the territory of another to collapse panels at the user's preference. However, for now the default panels will be discussed as they provide the most common functionality for beginners. The 3D View panel The main window where things will be happening is the 3D View panel. The largest portion of the 3D view consists of the viewport where most of the work will take place. On the left-hand side of the 3D View panel is a tool bar that consists of tools relevant to the selection and mode. If the tool bar is ever not visible it can be revealed (or hidden again) by pressing Tor by clicking on the plus icon on the right-hand side that will appear when the toolbar is hidden. The specific tools in this bar will be explored as they are needed in the projects. There is another plus icon on the right that will bring up the viewport properties with properties relevant to the current selection or the viewport. This can also be revealed or hidden by pressing the Nkey. Again, the specifics will be explored further as needed. At the bottom of the 3D View panel is the 3D view menu bar with additional options followed by menus and icons related to editing and views. Hovering the mouse over each button will show what they are for. The Outliner panel The Outliner panel contains a hieratical view of all the objects in the scene. Each object can be selected by clicking on their name or the object can be hidden, locked, or excluded from rendering. Rendering means making a high quality picture from a scene for things such as animation or presentations. Doing a proper render includes setting up scene lights, cameras, textures, material properties, and many other functions that will not be explored in this article as it does not do anything that helps produce models for printing. However, exploring this functionality outside of this article can be good when trying to show off the models if printing them is not an option. The Properties panel The Properties panel is broken up into many tabs indicated by small icons. Hovering over the icons will show the name of the tab. For modeling the two tabs that will be used the most are the object and modifier tabs. Specific exploration of the tools contained therein will be done as needed. The Info panel On the top of the Blender windows is the Info panel. On the left-hand side of the Info panel there is an easy to navigate menu similar to the menu in most applications. This menu can be collapsed by clicking on the + button next to it and expanded by clicking the same. On the far right of the Info panel there is data about the current scene. If the data cannot be seen, hover the mouse over the panel, and use the middle scroll wheel or click-and-hold the middle mouse button (pressing the wheel like a button) and moving the mouse to pan the panel until the desired data is visible. The Timeline panel The Timeline panel is only relevant to doing animation and can be effectively ignored or collapsed for the purposes of this article. Because this article is only using a limited subset of Blender's functionality some things such as the Timeline panel could be customized away. However, since it is not the focus of this article to tell the reader how to customize their version of Blender, and because Blender has a much broader application, the screenshots that follow will have the Timeline panel visible. The reader is encouraged to explore Blender's other functionalities such as rendering and animation at their leisure and desire. Proper stance While all of Blender's functions are available from buttons and menus on the screen, typical Blender users rely heavily on hotkeys and shortcuts. Already the T and N keys have been discussed to bring up or collapse the Tools and Properties tabs in the 3D view. For this reason it is recommended that to use Blender have one hand on the mouse and the other hand on the keyboard at all times. This tends to be a common stance for many people but is mentioned for the few for whom learning this will be of great help. Blender customization One of Blender's strengths is its customizability. Almost every feature from the look and color, right down to the keystrokes and hotkeys are used for every action in Blender. Customization is accomplished in the Properties menu accessed from the File | User Preferences menu. The buttons across the top, switch to the various categories. Each category is packed with options. A full exploration of these options is beyond the scope of this article but the reader is encouraged to explore these options and make Blender their own. For instance if the reader is using a setup where a middle mouse button is unavailable, Blender contains an option to emulate the middle mouse button by pressing Alt along with left-click. Other systems may require other accommodations many of which are available in this menu. Setting up for Mac OSX Mac OSX users require special consideration. Blender is made for a three button mouse. If a single button mouse is all that is available, click when the instructions say Left-Mouse Button, use Alt with click for Middle-Mouse Button, and press command with click for Right-Mouse Button. General Blender tips Blender employs some conventions that are unique to its environment and as such getting acquainted with its most common quirks early can avoid frustration. First and perhaps most importantly, Ctrl+ Z for undo works in Blender will undo a multitude of mistakes. Undo in Blender remembers many past steps allowing backing up to a point before a grievous error was made. Remembering this when following along with the blueprints that follow will save the reader much frustration. Next, the location of the mouse pointer is important when using hotkeys. For example the T and N keys for the Tools and Properties tabs do not bring up those tabs if the mouse pointer is not hovering over the 3D View. If the mouse is hovering over a different panel the reaction could be unpredictable. Pressing the A key with the mouse over the 3D View will toggle selection of all objects in the scene. Pressing the A key with the mouse over the Object Tools tab will collapse the expandable menu hiding all the options of that menu. Blender uses the right-click on the mouse to select objects by default. This is perhaps the most counter intuitive thing for first time users, particularly because it will be encountered so frequently. But not everything has been swapped, just object selection. This behavior can of course be customized. If the reader would like to customize selection to the left mouse button then it is left to them to adjust the instructions accordingly. Finally, the relation of Blender units to real life units is not by default defined in Blender. Generally it is just easier to remember that 1 grid point will translate to 1 millimeter in the printed object. As the scale is increased Blender inserts darker grid lines every 10 grid lines by default which correlate to centimeters. So the default cube in the default scene would measure 2 mm on each side, which is less than 1/10 of an inch, which is very small. Suggested shortcuts In the projects in this article, when a new idea is introduced it will first be introduced with detailed steps. Once a process is taught the next time the name of the operation and the shortcut for that process will be all that is given. This does not mean that the keyboard shortcuts are the only way to do an operation but they are often the preferred method for experienced Blender users. The reader is free to accomplish the operation in any way that is comfortable for them. The blueprints This article has been designed to teach 3D printing design in a hands-on approach. A series of projects or blueprints will be presented and each one will introduce new tools and techniques. Each one builds on the last. Despite being a "virtual" process, 3D modeling has a surprisingly muscle memory aspect to it. The movements and processes need to be more than a mechanical process being executed, they need to be practiced so they can be fluid and eventually seamless. To that end the reader is advised not to skip any of the blueprints and follow along actually doing each one. The objects being designed in this article are, most of them, very small so that they can be printed without taking too much time or producing too much waste. The reader can make larger versions if they like but that is left for their own challenge activities. Summary 3D printing is cool. Learning to design your own models is the best way to take full advantage of 3D printing today. This article will teach 3D modeling by a series of hands-on activities so it's a good idea not to skip and actually follow along with each blueprint. While home 3D printers have the capability to do break away supports these are messy and wasteful. It is possible to design things to be able to print without the need of any supports. When designing things for support-less 3D prints remember Y prints, H prints okay, T does not print. Keep outward inclines gradual and no more than 45 degrees to be safe. There are many 3D modeling programs to choose from. Some are expensive, some are free. Some are better for technical works, others do artistic or organic shapes better. Some are easy to learn, some take more practice. This article will use Blender since it is free and open source, has tools for modeling technical and organic shapes and is not too difficult to learn if you learn by doing. Blender can be a bit tricky to get started with since it employs some conventions unique to its environment. Blender can be customized but this article will stick with the defaults so everyone is on the same page. Generally remember that Ctrl+ Z undoes a multiple mistakes and can get Blender back to the state it was before, useful in tutorials to get back on track. The location of the mouse pointer is important when using Blender's hotkeys, which is the best way to learn to use Blender. Blender uses the right-click on mouse for selection by default. Finally, Blender's units translate to real life by 1 Blender grid space = 1 millimeter. Resources for Article: Further resources on this subject: Building Your First Bean in ColdFusion [Article] Designer Friendly Templates [Article] The Trivadis Integration Architecture Blueprint [Article]
Read more
  • 0
  • 0
  • 9734

article-image-oracle-vm-management
Packt
16 Oct 2009
6 min read
Save for later

Oracle VM Management

Packt
16 Oct 2009
6 min read
Before we get to manage the VMs in the Oracle VM Manager, let's take a quick look at the Oracle VM Manager by logging into it. Getting started with Oracle VM Manager In this article, we will perform the following actions while exploring the Oracle VM Manager: Registering an account Logging in to Oracle VM Manager Create a Server Pool After we are done with the Oracle VM Manager installation, we will use one of the following links to log on to the Oracle VM Manager: Within the local machine: http://127.0.0.1:8888/OVS Logging in remotely: http://vmmgr:8888/OVS Here, vmmgr refers to the host name or IP address of your Oracle VM Manager host. How to register an account Registering of an account can be done in several ways. If, during the installation of Oracle VM Manager, we have chosen to configure the default admin account "admin", then we can use this account directly to log on to Oracle's IntraCloud portal we call Oracle VM Manager. We will explain later in detail about the user accounts and why we would need separate accounts for separate roles for fine-grained access control; something that is crucial for security purposes. So let's have a quick look at the three available options: Default installation: This option applies if we have performed the default installation ourselves and have gone ahead to create the account ourselves. Here we have the default administrator role. Request for account creation: Contacting the administrator of Oracle VM Manager is another way to attain an account with the privileges, such as administrator, manager, and user. Create yourself: If we need to conduct basic functions of a common user with operator's role such as creating and using virtual machines, or importing resources, we can create a new account ourselves. However, we will need the administrator to assign us the server pools and groups to our account before we can get started. Here by default we are granted a user role. We will talk more about roles later in this article. Now let's go about registering a new account with Oracle VM Manager. Once on the Oracle VM Manager Login page click on the Register link. We are presented with the following screen. We must enter a Username of our choice and a hard-to-crack password twice. Also, we have to fill in our First Name and Last Name and complete the registration with a valid email address. Click Next: Next, we need to confirm our account details by clicking on the Confirm button. Now our account will be created and a confirmation message is displayed on the Oracle VM Manager Login screen. It should be noted that we will need some Server Pools and groups before we can get started. We will have to ask the administrator to assign us access to those pools and groups. It's time now to login to our newly created account. Logging in to Oracle VM Manager Again we will need to either access the URL locally by typing http://127.0.0.1:8888/OVS or by typing the following: http://hostname:8888/ OVS. If we are accessing the Oracle VM Manager Portal remotely, replace the "hostname" with either the FQDN (Fully Qualified Distinguished Name) if the machine is registered in our DNS or just the hostname of the VM Manager machine. We can login to the portal by simply typing in our Username and Password that we just created. Depending on the role and the server pools that we have been assigned, we will be displayed with the tabs upon the screen as shown in the following table. To change the role, we will need to contact our enterprise domain administrator. Only administrators are allowed to change the roles of accounts. If we forget our password, we can click on Forgot Password and on submitting our account name, the password will be sent to the registered email address that we had provided when we registered the account. The following table discusses the assigned tabs that are displayed for each Oracle VM Manager roles:   Role Grants User Virtual Machines, Resources Administrator Virtual Machines, Resources, Servers, Server Pools, Administration Manager Virtual Machines, Resources, Servers, Server Pools   We can obviously change the roles by editing the Profile (on the upper-right section of the portal). As it can be seen in the following screenshot, we have access to the Virtual Machines pane and the Resources pane. We will continue to add Servers to the pool when logged in as admin. Oracle VM management: Managing Server Pool A Server Pool is logically an autonomous region that contains one or more physical servers and the dynamic nature of such pool and pools of pools makes what we call  an infinite Cloud infrastructure. Currently Oracle has its Cloud portal with Amazon but it is very much viable to have an IntraCloud portal or private Cloud where we can run all sorts of Linux and Windows flavors on our Cloud backbone. It eventually rests on the array of SAN, NAS, or other next generation storage substrate on which the VMs reside. We must ensure that we have the following prerequisites properly checked before creating the Virtual Machines on our IntraCloud Oracle VM. Oracle VM Servers: These are available to deploy as Utility Master, Server Master pool, and Virtual Machine Servers. Repositories: Used for Live Migration or Hot Migration of the VMs and for local storage on the Oracle VM Servers. FQDN/IP address of Oracle VM Servers: It is better to have the Oracle VM Servers known as OracleVM01.AVASTU.COM and OracleVM02.AVASTU. COM. This way you don't have to bother about the IP changes or infrastructural relocation of the IntraCloud to another location. Oracle VM Agent passwords: Needed to access the Oracle VM Servers. Let's now go about exploring the designing process of the Oracle VM. Then we will do the following systematically: Creating the Server Pool Editing Server Pool information Search and retrieval within Server Pool Restoring Server Pool Enabling HA Deleting a Server Pool However, we can carry out these actions only as a Manager or an Administrator. But first let's take a look at the decisions on what type of Server Pools will suit us the best and what the architectural considerations could be around building your Oracle VM farm.
Read more
  • 0
  • 0
  • 9730

article-image-hdfs-and-mapreduce
Packt
17 Jul 2014
11 min read
Save for later

HDFS and MapReduce

Packt
17 Jul 2014
11 min read
(For more resources related to this topic, see here.) Essentials of HDFS HDFS is a distributed filesystem that has been designed to run on top of a cluster of industry standard hardware. The architecture of HDFS is such that there is no specific need for high-end hardware. HDFS is a highly fault-tolerant system and can handle failures of nodes in a cluster without loss of data. The primary goal behind the design of HDFS is to serve large data files efficiently. HDFS achieves this efficiency and high throughput in data transfer by enabling streaming access to the data in the filesystem. The following are the important features of HDFS: Fault tolerance: Many computers working together as a cluster are bound to have hardware failures. Hardware failures such as disk failures, network connectivity issues, and RAM failures could disrupt processing and cause major downtime. This could lead to data loss as well slippage of critical SLAs. HDFS is designed to withstand such hardware failures by detecting faults and taking recovery actions as required. The data in HDFS is split across the machines in the cluster as chunks of data called blocks. These blocks are replicated across multiple machines of the cluster for redundancy. So, even if a node/machine becomes completely unusable and shuts down, the processing can go on with the copy of the data present on the nodes where the data was replicated. Streaming data: Streaming access enables data to be transferred in the form of a steady and continuous stream. This means if data from a file in HDFS needs to be processed, HDFS starts sending the data as it reads the file and does not wait for the entire file to be read. The client who is consuming this data starts processing the data immediately, as it receives the stream from HDFS. This makes data processing really fast. Large data store: HDFS is used to store large volumes of data. HDFS functions best when the individual data files stored are large files, rather than having large number of small files. File sizes in most Hadoop clusters range from gigabytes to terabytes. The storage scales linearly as more nodes are added to the cluster. Portable: HDFS is a highly portable system. Since it is built on Java, any machine or operating system that can run Java should be able to run HDFS. Even at the hardware layer, HDFS is flexible and runs on most of the commonly available hardware platforms. Most production level clusters have been set up on commodity hardware. Easy interface: The HDFS command-line interface is very similar to any Linux/Unix system. The commands are similar in most cases. So, if one is comfortable with the performing basic file actions in Linux/Unix, using commands with HDFS should be very easy. The following two daemons are responsible for operations on HDFS: Namenode Datanode The namenode and datanodes daemons talk to each other via TCP/IP. Configuring HDFS All HDFS-related configuration is done by adding/updating the properties in the hdfs-site.xml file that is found in the conf folder under the Hadoop installation folder. The following are the different properties that are part of the hdfs-site.xml file: dfs.namenode.servicerpc-address: This specifies the unique namenode RPC address in the cluster. Services/daemons such as the secondary namenode and datanode daemons use this address to connect to the namenode daemon whenever it needs to communicate. This property is shown in the following code snippet: <property> <name>dfs.namenode.servicerpc-address</name> <value>node1.hcluster:8022</value> </property> dfs.namenode.http-address: This specifies the URL that can be used to monitor the namenode daemon from a browser. This property is shown in the following code snippet: <property> <name>dfs.namenode.http-address</name> <value>node1.hcluster:50070</value> </property> dfs.replication: This specifies the replication factor for data block replication across the datanode daemons. The default is 3 as shown in the following code snippet: <property> <name>dfs.replication</name> <value>3</value> </property> dfs.blocksize: This specifies the block size. In the following example, the size is specified in bytes (134,217,728 bytes is 128 MB): <property> <name>dfs.blocksize</name> <value>134217728</value> </property> fs.permissions.umask-mode: This specifies the umask value that will be used when creating files and directories in HDFS. This property is shown in the following code snippet: <property> <name>fs.permissions.umask-mode</name> <value>022</value> </property> The read/write operational flow in HDFS To get a better understanding of HDFS, we need to understand the flow of operations for the following two scenarios: A file is written to HDFS A file is read from HDFS HDFS uses a single-write, multiple-read model, where the files are written once and read several times. The data cannot be altered once written. However, data can be appended to the file by reopening it. All files in the HDFS are saved as data blocks. Writing files in HDFS The following sequence of steps occur when a client tries to write a file to HDFS: The client informs the namenode daemon that it wants to write a file. The namenode daemon checks to see whether the file already exists. If it exists, an appropriate message is sent back to the client. If it does not exist, the namenode daemon makes a metadata entry for the new file. The file to be written is split into data packets at the client end and a data queue is built. The packets in the queue are then streamed to the datanodes in the cluster. The list of datanodes is given by the namenode daemon, which is prepared based on the data replication factor configured. A pipeline is built to perform the writes to all datanodes provided by the namenode daemon. The first packet from the data queue is then transferred to the first datanode daemon. The block is stored on the first datanode daemon and is then copied to the next datanode daemon in the pipeline. This process goes on till the packet is written to the last datanode daemon in the pipeline. The sequence is repeated for all the packets in the data queue. For every packet written on the datanode daemon, a corresponding acknowledgement is sent back to the client. If a packet fails to write onto one of the datanodes, the datanode daemon is removed from the pipeline and the remainder of the packets is written to the good datanodes. The namenode daemon notices the under-replication of the block and arranges for another datanode daemon where the block could be replicated. After all the packets are written, the client performs a close action, indicating that the packets in the data queue have been completely transferred. The client informs the namenode daemon that the write operation is now complete. The following diagram shows the data block replication process across the datanodes during a write operation in HDFS: Reading files in HDFS The following steps occur when a client tries to read a file in HDFS: The client contacts the namenode daemon to get the location of the data blocks of the file it wants to read. The namenode daemon returns the list of addresses of the datanodes for the data blocks. For any read operation, HDFS tries to return the node with the data block that is closest to the client. Here, closest refers to network proximity between the datanode daemon and the client. Once the client has the list, it connects the closest datanode daemon and starts reading the data block using a stream. After the block is read completely, the connection to datanode is terminated and the datanode daemon that hosts the next block in the sequence is identified and the data block is streamed. This goes on until the last data block for that file is read. The following diagram shows the read operation of a file in HDFS: Understanding the namenode UI Hadoop provides web interfaces for each of its services. The namenode UI or the namenode web interface is used to monitor the status of the namenode and can be accessed using the following URL: http://<namenode-server>:50070/ The namenode UI has the following sections: Overview: The general information section provides basic information of the namenode with options to browse the filesystem and the namenode logs. The following is the screenshot of the Overview section of the namenode UI: The Cluster ID parameter displays the identification number of the cluster. This number is same across all the nodes within the cluster. A block pool is a set of blocks that belong to a single namespace. The Block Pool Id parameter is used to segregate the block pools in case there are multiple namespaces configured when using HDFS federation. In HDFS federation, multiple namenodes are configured to scale the name service horizontally. These namenodes are configured to share datanodes amongst themselves. We will be exploring HDFS federation in detail a bit later. Summary: The following is the screenshot of the cluster's summary section from the namenode web interface: Under the Summary section, the first parameter is related to the security configuration of the cluster. If Kerberos (the authorization and authentication system used in Hadoop) is configured, the parameter will show as Security is on. If Kerberos is not configured, the parameter will show as Security is off. The next parameter displays information related to files and blocks in the cluster. Along with this information, the heap and non-heap memory utilization is also displayed. The other parameters displayed in the Summary section are as follows: Configured Capacity: This displays the total capacity (storage space) of HDFS. DFS Used: This displays the total space used in HDFS. Non DFS Used: This displays the amount of space used by other files that are not part of HDFS. This is the space used by the operating system and other files. DFS Remaining: This displays the total space remaining in HDFS. DFS Used%: This displays the total HDFS space utilization shown as percentage. DFS Remaining%: This displays the total HDFS space remaining shown as percentage. Block Pool Used: This displays the total space utilized by the current namespace. Block Pool Used%: This displays the total space utilized by the current namespace shown as percentage. As you can see in the preceding screenshot, in this case, the value matches that of the DFS Used% parameter. This is because there is only one namespace (one namenode) and HDFS is not federated. DataNodes usages% (Min, Median, Max, stdDev): This displays the usages across all datanodes in the cluster. This helps administrators identify unbalanced nodes, which may occur when data is not uniformly placed across the datanodes. Administrators have the option to rebalance the datanodes using a balancer. Live Nodes: This link displays all the datanodes in the cluster as shown in the following screenshot: Dead Nodes: This link displays all the datanodes that are currently in a dead state in the cluster. A dead state for a datanode daemon is when the datanode daemon is not running or has not sent a heartbeat message to the namenode daemon. Datanodes are unable to send heartbeats if there exists a network connection issue between the machines that host the datanode and namenode daemons. Excessive swapping on the datanode machine causes the machine to become unresponsive, which also prevents the datanode daemon from sending heartbeats. Decommissioning Nodes: This link lists all the datanodes that are being decommissioned. Number of Under-Replicated Blocks: This represents the number of blocks that have not replicated as per the replication factor configured in the hdfs-site.xml file. Namenode Journal Status: The journal status provides location information of the fsimage file and the state of the edits logfile. The following screenshot shows the NameNode Journal Status section: NameNode Storage: The namenode storage table provides the location of the fsimage file along with the type of the location. In this case, it is IMAGE_AND_EDITS, which means the same location is used to store the fsimage file as well as the edits logfile. The other types of locations are IMAGE, which stores only the fsimage file and EDITS, which stores only the edits logfile. The following screenshot shows the NameNode Storage information:
Read more
  • 0
  • 0
  • 9726
article-image-low-level-c-practices
Packt
21 Oct 2012
14 min read
Save for later

Low-level C# Practices

Packt
21 Oct 2012
14 min read
Working with generics Visual Studio 2005 included .NET version 2.0 which included generics. Generics give developers the ability to design classes and methods that defer the specification of specific parts of a class or method's specification until declaration or instantiation. Generics offer features previously unavailable in .NET. One benefit to generics, that is potentially the most common, is for the implementation of collections that provide a consistent interface to collections of different data types without needing to write specific code for each data type. Constraints can be used to restrict the types that are supported by a generic method or class, or can guarantee specific interfaces. Limits of generics Constraints within generics in C# are currently limited to a parameter-less constructor, interfaces, or base classes, or whether or not the type is a struct or a class (value or reference type). This really means that code within a generic method or type can either be constructed or can make use of methods and properties. Due to these restrictions types within generic types or methods cannot have operators. Writing sequence and iterator members Visual Studio 2005 and C# 2.0 introduced the yield keyword. The yield keyword is used within an iterator member as a means to effectively implement an IEnumerable interface without needing to implement the entire IEnumerable interface. Iterator members are members that return a type of IEnumerable or IEnumerable<T>, and return individual elements in the enumerable via yield return, or deterministically terminates the enumerable via yield break. These members can be anything that can return a value, such as methods, properties, or operators. An iterator that returns without calling yield break has an implied yield break, just as a void method has an implied return. Iterators operate on a sequence but process and return each element as it is requested. This means that iterators implement what is known as deferred execution. Deferred execution is when some or all of the code, although reached in terms of where the instruction pointer is in relation to the code, hasn't entirely been executed yet. Iterators are methods that can be executed more than once and result in a different execution path for each execution. Let's look at an example: public static IEnumerable<DateTime> Iterator() { Thread.Sleep(1000); yield return DateTime.Now; Thread.Sleep(1000); yield return DateTime.Now; Thread.Sleep(1000); yield return DateTime.Now; }   The Iterator method returns IEnumerable which results in three DateTime values. The creation of those three DateTime values is actually invoked at different times. The Iterator method is actually compiled in such a way that a state machine is created under the covers to keep track of how many times the code is invoked and is implemented as a special IEnumerable<DateTime> object. The actual invocation of the code in the method is done through each call of the resulting IEnumerator. MoveNext method. The resulting IEnumerable is really implemented as a collection of delegates that are executed upon each invocation of the MoveNext method, where the state, in the simplest case, is really which of the delegates to invoke next. It's actually more complicated than that, especially when there are local variables and state that can change between invocations and is used across invocations. But the compiler takes care of all that. Effectively, iterators are broken up into individual bits of code between yield return statements that are executed independently, each using potentially local shared data. What are iterators good for other than a really cool interview question? Well, first of all, due to the deferred execution, we can technically create sequences that don't need to be stored in memory all at one time. This is often useful when we want to project one sequence into another. Couple that with a source sequence that is also implemented with deferred execution, we end up creating and processing IEnumerables (also known as collections) whose content is never all in memory at the same time. We can process large (or even infinite) collections without a huge strain on memory. For example, if we wanted to model the set of positive integer values (an infinite set) we could write an iterator method shown as follows: static IEnumerable<BigInteger> AllThePositiveIntegers() { var number = new BigInteger(0); while (true) yield return number++; }   We can then chain this iterator with another iterator, say something that gets all of the positive squares: static IEnumerable<BigInteger> AllThePostivieIntegerSquares( IEnumerable<BigInteger> sourceIntegers) { foreach(var value in sourceIntegers) yield return value*value; }   Which we could use as follows: foreach(var value in AllThePostivieIntegerSquares(AllThePositiveIntegers())) Console.WriteLine(value);   We've now effectively modeled two infi nite collections of integers in memory. Of course, our AllThePostiveIntegerSquares method could just as easily be used with fi nite sequences of values, for example: foreach (var value in AllThePostivieIntegerSquares( Enumerable.Range(0, int.MaxValue) .Select(v => new BigInteger(v)))) Console.WriteLine(value);   In this example we go through all of the positive Int32 values and square each one without ever holding a complete collection of the set of values in memory. As we see, this is a useful method for composing multiple steps that operate on, and result in, sequences of values. We could have easily done this without IEnumerable<T>, or created an IEnumerator class whose MoveNext method performed calculations instead of navigating an array. However, this would be tedious and is likely to be error-prone. In the case of not using IEnumerable<T>, we'd be unable to operate on the data as a collection with things such as foreach. Context: When modeling a sequence of values that is either known only at runtime, or each element can be reliably calculated at runtime. Practice: Consider using an iterator. Working with lambdas Visual Studio 2008 introduced C# 3.0 . In this version of C# lambda expressions were introduced. Lambda expressions are another form of anonymous functions. Lambdas were added to the language syntax primarily as an easier anonymous function syntax for LINQ queries. Although you can't really think of LINQ without lambda expressions, lambda expressions are a powerful aspect of the C# language in their own right. They are concise expressions that use implicitly-typed optional input parameters whose types are implied through the context of their use, rather than explicit de fi nition as with anonymous methods. Along with C# 3.0 in Visual Studio 2008, the .NET Framework 3.5 was introduced which included many new types to support LINQ expressions, such as Action<T> and Func<T>. These delegates are used primarily as definitions for different types of anonymous methods (including lambda expressions). The following is an example of passing a lambda expression to a method that takes a Func<T1, T2, TResult> delegate and the two arguments to pass along to the delegate: ExecuteFunc((f, s) => f + s, 1, 2);   The same statement with anonymous methods: ExecuteFunc(delegate(int f, int s) { return f + s; }, 1, 2);   It's clear that the lambda syntax has a tendency to be much more concise, replacing the delegate and braces with the "goes to" operator (=>). Prior to anonymous functions, member methods would need to be created to pass as delegates to methods. For example: ExecuteFunc(SomeMethod, 1, 2);   This, presumably, would use a method named SomeMethod that looked similar to: private static int SomeMethod(int first, int second) { return first + second; }   Lambda expressions are more powerful in the type inference abilities, as we've seen from our examples so far. We need to explicitly type the parameters within anonymous methods, which is only optional for parameters in lambda expressions. LINQ statements don't use lambda expressions exactly in their syntax. The lambda expressions are somewhat implicit. For example, if we wanted to create a new collection of integers from another collection of integers, with each value incremented by one, we could use the following LINQ statement: var x = from i in arr select i + 1;   The i + 1 expression isn't really a lambda expression, but it gets processed as if it were first converted to method syntax using a lambda expression: var x = arr.Select(i => i + 1);   The same with an anonymous method would be: var x = arr.Select(delegate(int i) { return i + 1; });   What we see in the LINQ statement is much closer to a lambda expression. Using lambda expressions for all anonymous functions means that you have more consistent looking code. Context: When using anonymous functions. Practice: Prefer lambda expressions over anonymous methods. Parameters to lambda expressions can be enclosed in parentheses. For example: var x = arr.Select((i) => i + 1);   The parentheses are only mandatory when there is more than one parameter: var total = arr.Aggregate(0, (l, r) => l + r);   Context: When writing lambdas with a single parameter. Practice: Prefer no parenthesis around the parameter declaration. Sometimes when using lambda expressions, the expression is being used as a delegate that takes an argument. The corresponding parameter in the lambda expression may not be used within the right-hand expression (or statements). In these cases, to reduce the clutter in the statement, it's common to use the underscore character (_) for the name of the parameter. For example: task.ContinueWith(_ => ProcessSecondHalfOfData());   The task.ContinueWith method takes an Action <Task> delegate. This means the previous lambda expression is actually given a task instance (the antecedent Task). In our example, we don't use that task and just perform some completely independent operation. In this case, we use (_) to not only signify that we know we don't use that parameter, but also to reduce the clutter and potential name collisions a little bit. Context: When writing lambda expression that take a single parameter but the parameter is not used. Practice: Use underscore (_) for the name of the parameter. There are two types of lambda expressions. So far, we've seen expression lambdas. Expression lambdas are a single expression on the right-hand side that evaluates to a value or void. There is another type of lambda expression called statement lambdas. These lambdas have one or more statements and are enclosed in braces. For example: task.ContinueWith(_ => { var value = 10; value += ProcessSecondHalfOfData(); ProcessSomeRandomValue(value); });   As we can see, statement lambdas can declare variables, as well as have multiple statements. Working with extension methods Along with lambda expressions and iterators, C# 3.0 brought us extension methods. These static methods (contained in a static class whose first argument is modified with the this modifier) were created for LINQ so IEnumerable types could be queried without needing to add copious amounts of methods to the IEnumerable interface. An extension method has the basic form of: public static class EnumerableExtensions { public static IEnumerable<int> IntegerSquares( this IEnumerable<int> source) { return source.Select(value => value * value); } }   As stated earlier, extension methods must be within a static class, be a static method, and the first parameter must be modified with the this modifier. Extension methods extend the available instance methods of a type. In our previous example, we've effectively added an instance member to IEnumerable<int> named IntegerSquares so we get a sequence of integer values that have been squared. For example, if we created an array of integer values, we will have added a Cubes method to that array that returns a sequence of the values cubed. For example: var values = new int[] {1, 2, 3}; foreach (var v in values.Cubes()) { Console.WriteLine(v); }   Having the ability to create new instance methods that operate on any public members of a specific type is a very powerful feature of the language. This, unfortunately, does not come without some caveats. Extension methods suffer inherently from a scoping problem. The only scoping that can occur with these methods is the namespaces that have been referenced for any given C# source file. For example, we could have two static classes that have two extension methods named Cubes. If those static classes are in the same namespace, we'd never be able to use those extensions methods as extension methods because the compiler would never be able to resolve which one to use. For example: public static class IntegerEnumerableExtensions { public static IEnumerable<int> Squares( this IEnumerable<int> source) { return source.Select(value => value * value); } public static IEnumerable<int> Cubes( this IEnumerable<int> source) { return source.Select(value => value * value * value); } } public static class EnumerableExtensions { public static IEnumerable<int> Cubes( this IEnumerable<int> source) { return source.Select(value => value * value * value); } }   If we tried to use Cubes as an extension method, we'd get a compile error, for example: var values = new int[] {1, 2, 3}; foreach (var v in values.Cubes()) { Console.WriteLine(v); }   This would result in error CS0121: The call is ambiguous between the following methods or properties. To resolve the problem, we'd need to move one (or both) of the classes to another namespace, for example: namespace Integers { public static class IntegerEnumerableExtensions { public static IEnumerable<int> Squares( this IEnumerable<int> source) { return source.Select(value => value*value); } public static IEnumerable<int> Cubes( this IEnumerable<int> source) { return source.Select(value => value*value*value); } } } namespace Numerical { public static class EnumerableExtensions { public static IEnumerable<int> Cubes( this IEnumerable<int> source) { return source.Select(value => value*value*value); } } }   Then, we can scope to a particular namespace to choose which Cubes to use: Context: When considering extension methods, due to potential scoping problems. Practice: Use extension methods sparingly. Context: When designing extension methods. Practice: Keep all extension methods that operate on a specific type in their own class. Context: When designing classes to contain methods to extend a specific type, TypeName. Practice: Consider naming the static class TypeNameExtensions. Context: When designing classes to contain methods to extend a specific type, in order to scope the extension methods. Practice: Consider placing the class in its own namespace. Generally, there isn't much need to use extension methods on types that you own. You can simply add an instance method to contain the logic that you want to have. Where extension methods really shine is for effectively creating instance methods on interfaces. Typically, when code is necessary for shared implementations of interfaces, an abstract base class is created so each implementation of the interface can derive from it to implement these shared methods. This is a bit cumbersome in that it uses the one-and-only inheritance slot in C#, so an interface implementation would not be able to derive or extend any other classes. Additionally, there's no guarantee that a given interface implementation will derive from the abstract base and runs the risk of not being able to be used in the way it was designed. Extension methods get around this problem by being entirely independent from the implementation of an interface while still being able to extend it. One of the most notable examples of this might be the System.Linq.Enumerable class introduced in .NET 3.5. The static Enumerable class almost entirely consists of extension methods that extend IEnumerable. It is easy to develop the same sort of thing for our own interfaces. For example, say we have an ICoordinate interface to model a three-dimensional position in relation to the Earth's surface: namespace ConsoleApplication { using Numerical; internal class Program { private static void Main(string[] args) { var values = new int[] {1, 2, 3}; foreach (var v in values.Cubes()) { Console.WriteLine(v); } } } }   We could create a static class to contain extension methods to provide shared functionality between any implementation of ICoordinate. For example: public interface ICoordinate { /// <summary>North/south degrees from equator.</summary> double Latitude { get; set; } /// <summary>East/west degrees from meridian.</summary> double Longitude { get; set; } /// <summary>Distance from sea level in meters.</summary> double Altitude { get; set; } }   Context: When designing interfaces that require shared code. Practice: Consider providing extension methods instead of abstract base implementations.
Read more
  • 0
  • 0
  • 9723

article-image-software-defined-data-center
Packt
14 Nov 2016
33 min read
Save for later

The Software-defined Data Center

Packt
14 Nov 2016
33 min read
In this article by Valentin Hamburger, author of the book Building VMware Software-Defined Data Centers, we are introduced and briefed about the software-defined data center (SDDC) that has been introduced by VMware, to further describe the move to a cloud like IT experience. The term software-defined is the important bit of information. It basically means that every key function in the data center is performed and controlled by software, instead of hardware. This opens a whole new way of operating, maintaining but also innovating in a modern data center. (For more resources related to this topic, see here.) But how does a so called SDDC look like – and why is a whole industry pushing so hard towards its adoption? This question might also be a reason why you are reading this article, which is meant to provide a deeper understanding of it and give practical examples and hints how to build and run such a data center. Meanwhile it will also provide the knowledge of mapping business challenges with IT solutions. This is a practice which becomes more and more important these days. IT has come a long way from a pure back office, task oriented role in the early days, to a business relevant asset, which can help organizations to compete with their competition. There has been a major shift from a pure infrastructure provider role to a business enablement function. Today, most organizations business is just as good as their internal IT agility and ability to innovate. There are many examples in various markets where a whole business branch was built on IT innovations such as Netflix, Amazon Web Services, Uber, Airbnb – just to name a few. However, it is unfair to compare any startup with a traditional organization. A startup has one application to maintain and they have to build up a customer base. A traditional organization has a proven and wide customer base and many applications to maintain. So they need to adapt their internal IT to become a digital enterprise, with all the flexibility and agility of a startup, but also maintaining the trust and control over their legacy services. This article will cover the following points: Why is there a demand for SDDC in IT What is SDDC Understand the business challenges and map it to SDDC deliverables The relation of a SDDC and an internal private cloud Identify new data center opportunities and possibilities Become a center of innovation to empower your organizations business The demand for change Today organizations face different challenges in the market to stay relevant. The biggest move was clearly introduced by smartphones and tablets. It was not just a computer in a smaller device, they changed the way IT is delivered and consumed by end users. These devices proved that it can be simple to consume and install applications. Just search in an app store – choose what you like – use it as long as you like it. If you do not need it any longer, simply remove it. All with very simplistic commands and easy to use gestures. More and more people relying on IT services by using a smartphone as their terminal to almost everything. These devices created a demand for fast and easy application and service delivery. So in a way, smartphones have not only transformed the whole mobile market, they also transformed how modern applications and services are delivered from organizations to their customers. Although it would be quite unfair to compare a large enterprise data center with an app store or enterprise service delivery with any app installs on a mobile device, there are startups and industries which rely solely on the smartphone as their target for services, such as Uber or WhatsApp. On the other side, smartphone apps also introduce a whole new way of delivering IT services, since any company never knows how many people will use the app simultaneously. But in the backend they still have to use web servers and databases to continuously provide content and data for these apps. This also introduces a new value model for all other companies. People start to judge a company by the quality of their smartphone apps available. Also people started to migrate to companies which might offer a better smartphone integration as the previous one used. This is not bound to a single industry, but affects a broad spectrum of industries today such as the financial industry, car manufacturers, insurance groups, and even food retailers, just to name a few. A classic data center structure might not be ideal for quick and seamless service delivery. These architectures are created by projects to serve a particular use case for a couple of years. An example of this bigger application environments are web server farms, traditional SAP environments, or a data warehouse. Traditionally these were designed with an assumption about their growth and use. Special project teams have set them up across the data center pillars, as shown in the following figure. Typically, those project teams separate after such the application environment has been completed. All these pillars in the data center are required to work together, but every one of them also needs to mind their own business. Mostly those different divisions also have their own processes which than may integrate in a data center wide process. There was a good reason to structure a data center in this way, the simple fact that nobody can be an expert for every discipline. Companies started to create groups to operate certain areas in a data center, each building their own expertise for their own subject. This was evolving and became the most applied model for IT operations within organizations. Many, if not all, bigger organizations have adopted this approach and people build their careers on this definitions. It served IT well for decades and ensured that each party was adding its best knowledge to any given project. However, this setup has one flaw, it has not been designed for massive change and scale. The bigger these divisions get, the slower they can react to request from other groups in the data center. This introduces a bi-directional issue – since all groups may grow in a similar rate, the overall service delivery time might also increase exponentially. Unfortunately, this also introduces a cost factor when it comes to service deployments across these pillars. Each new service, an organization might introduce or develop, will require each area of IT to contribute. Traditionally, this is done by human hand overs from one department to the other. Each of these hand overs will delay the overall project time or service delivery time, which is also often referred to as time to market. It reflects the needed time interval from the request of a new service to its actual delivery. It is important to mention that this is a level of complexity every modern organization has to deal with, when it comes to application deployment today. The difference between organizations might be in the size of the separate units, but the principle is always the same. Most organizations try to bring their overall service delivery time down to be quicker and more agile. This is often related to business reasons as well as IT cost reasons. In some organizations the time to deliver a brand new service from request to final roll out may take 90 working days. This means – a requestor might wait 18 weeks or more than four and a half month from requesting a new business service to its actual delivery. Do not forget that this reflects the complete service delivery – over all groups until it is ready for production. Also, after these 90 days the requirement of the original request might have changed which would lead into repeating the entire process. Often a quicker time to market is driven by the lines of business (LOB) owners to respond to a competitor in the market, who might already deliver their services faster. This means that today's IT has changed from a pure internal service provider to a business enabler supporting its organization to fight the competition with advanced and innovative services. While this introduces a great chance to the IT department to enable and support their organizations business, it also introduces a threat at the same time. If the internal IT struggles to deliver what the business is asking for, it may lead to leverage shadow IT within the organization. The term shadow IT describes a situation where either the LOBs of an organization or its application developers have grown so disappointed with the internal IT delivery times, that they actually use an external provider for their requirements. This behavior is not agreed with the IT security and can lead to heavy business or legal troubles. This happens more often than one might expect, and it can be as simple as putting some internal files on a public cloud storage provider. These services grant quick results. It is as simple as Register – Download – Use. They are very quick in enrolling new users and sometimes provide a limited use for free. The developer or business owner might not even be aware that there is something non-compliant going on while using this services. So besides the business demand for a quicker service delivery and the security aspect, there an organizations IT department has now also the pressure of staying relevant. But SDDC can provide much more value to the IT than just staying relevant. The automated data center will be an enabler for innovation and trust and introduce a new era of IT delivery. It can not only provide faster service delivery to the business, it can also enable new services or offerings to help the whole organization being innovative for their customers or partners. Business challenges—the use case Today's business strategies often involve a digital delivery of services of any kind. This implies that the requirements a modern organization has towards their internal IT have changed drastically. Unfortunately, the business owners and the IT department tend to have communication issues in some organizations. Sometimes they even operate completely disconnected from each other, as if each of them where their own small company within the organization. Nevertheless, a lot of data center automation projects are driven by enhanced business requirements. In some of these cases, the IT department has not been made aware of what these business requirements look like, or even what the actual business challenges are. Sometimes IT just gets as little information as: We are doing cloud now. This is a dangerous simplification since, the use case is key when it comes to designing and identifying the right solution to solve the organizations challenges. It is important to get the requirements from both sides, the IT delivery side as well as the business requirements and expectations. Here is a simple example how a use case might be identified and mapped to technical implementation. The business view John works as a business owner in an insurance company. He recognizes that their biggest competitor in the market started to offer a mobile application to their clients. The app is simple and allows to do online contract management and tells the clients which products the have enrolled as well as rich information about contract timelines and possible consolidation options. He asks his manager to start a project to also deliver such an application to their customers. Since it is only a simple smartphone application, he expects that it's development might take a couple of weeks and than they can start a beta phase. To be competitive he estimates that they should have something useable for their customers within maximum of 5 months. Based on these facts, he got approval from his manager to request such a product from the internal IT. The IT view Tom is the data center manager of this insurance company. He got informed that the business wants to have a smartphone application to do all kinds of things for the new and existing customers. He is responsible to create a project and bring all necessary people on board to support this project and finally deliver the service to the business. The programming of the app will be done by an external consulting company. Tom discusses a couple of questions regarding this request with his team: How many users do we need to serve? How much time do we need to create this environment? What is the expected level of availability? How much compute power/disk space might be required? After a round of brainstorming and intense discussion, the team still is quite unsure how to answer these questions. For every question there is a couple of variables the team cannot predict. Will only a few of their thousands of users adopt to the app, what if they undersize the middleware environment? What if the user adoption rises within a couple of days, what if it lowers and the environment is over powered and therefor the cost is too high? Tom and his team identified that they need a dynamic solution to be able to serve the business request. He creates a mapping to match possible technical capabilities to the use case. After this mapping was completed, he is using it to discuss with his CIO if and how it can be implemented. Business challenge Question IT capability Easy to use app to win new customers/keep existing How many users do we need to server? Dynamic scale of an environment based on actual performance demand. How much time do we need to create this environment? To fulfill the expectations the environment needs to be flexible. Start small – scale big. What is the expected level of availability? Analytics and monitoring over all layers. Including possible self healing approach. How much compute power/disk space might be required? Create compute nodes based on actual performance requirements on demand. Introduce a capacity on demand model for required resources. Given this table, Tom revealed that with their current data center structure it is quite difficult to deliver what the business is asking for. Also, he got a couple of requirements from other departments, which are going in a similar direction. Based on these mappings, he identified that they need to change their way of deploying services and applications. They will need to use a fair amount of automation. Also, they have to span these functionalities across each data center department as a holistic approach, as shown in the following diagram: In this example, Tom actually identified a very strong use case for SDDC in his company. Based on the actual business requirements of a "simple" application, the whole IT delivery of this company needs to adopt. While this may sound like pure fiction, these are the challenges modern organizations need to face today. It is very important to identify the required capabilities for the entire data center and not just for a single department. You will also have to serve the legacy applications and bring them onto the new model. Therefore it is important to find a solution, which is serving the new business case as well as the legacy applications either way. In the first stage of any SDDC introduction in an organization, it is key to keep always an eye on the big picture. Tools to enable SDDC There is a basic and broadly accepted declaration of what a SDDC needs to offer. It can be considered as the second evolutionary step after server virtualization. It offers an abstraction layer from the infrastructure components such as compute, storage, and network by using automation and tools as such as a self service catalog In a way, it represents a virtualization of the whole data center with the purpose to simplify the request and deployment of complex services. Other capabilities of an SDDC are: Automated infrastructure/service consumption Policy based services and applications deployment Changes to services can be made easily and instantly All infrastructure layers are automated (storage, network, and compute) No human intervention is needed for infrastructure/service deployment High level of standardization is used Business logic is for chargeback or show back functionality All of the preceding points define a SDDC technically. But it is important to understand that a SDDC is considered to solve the business challenges of the organization running it. That means based on the actual business requirements, each SDDC will serve a different use case. Of course there is a main setup you can adopt and roll out – but it is important to understand your organizations business challenges in order to prevent any planning or design shortcomings. Also, to realize this functionality, SDDC needs a couple of software tools. These are designed to work together to deliver a seamless environment. The different parts can be seen like gears in a watch where each gear has an equally important role to make the clockwork function correctly. It is important to remember this when building your SDDC, since missing on one part can make another very complex or even impossible afterwards. This is a list of VMware tools building a SDDC: vRealize Business for Cloud vRealize Operations Manager vRealize Log Insight vRealize Automation vRealize Orchestrator vRealize Automation Converged Blueprint vRealize Code Stream VMware NSX VMware vSphere vRealize Business for Cloud is a charge back/show back tool. It can be used to track cost of services as well as the cost of a whole data center. Since the agility of a SDDC is much higher than for a traditional data center, it is important to track and show also the cost of adding new services. It is not only important from a financial perspective, it also serves as a control mechanism to ensure users are not deploying uncontrolled services and leaving them running even if they are not required anymore. vRealize Operations Manager is serving basically two functionalities. One is to help with the troubleshooting and analytics of the whole SDDC platform. It has an analytics engine, which applies machine learning to the behavior of its monitored components. The other important function is capacity management. It is capable of providing what-if analysis and informs about possible shortcomings of resources way before they occur. These functionalities also use the machine learning algorithms and get more accurate over time. This becomes very important in an dynamic environment where on-demand provisioning is granted. vRealize Log Insight is a unified log management. It offers rich functionality and can search and profile a large amount of log files in seconds. It is recommended to use it as a universal log endpoint for all components in your SDDC. This includes all OSes as well as applications and also your underlying hardware. In an event of error, it is much simpler to have a central log management which is easy searchable and delivers an outcome in seconds. vRealize Automation (vRA) is the base automation tool. It is providing the cloud portal to interact with your SDDC. The portal it provides offers the business logic such as service catalogs, service requests, approvals, and application life cycles. However, it relies strongly on vRealize Orchestrator for its technical automation part. vRA can also tap into external clouds to extend the internal data center. Extending a SDDC is mostly referred to as hybrid cloud. There are a couple of supported cloud offerings vRA can manage. vRealize Orchestrator (vRO) is providing the workflow engine and the technical automation part of the SDDC. It is literally the orchestrator of your new data center. vRO can be easily bound together with vRA to form a very powerful automation suite, where anything with an application programming interface (API) can be integrated. Also it is required to integrate third-party solutions into your deployment workflows, such as configuration management database (CMDB), IP address management (IPAM), or ticketing systems via IT service management (ITSM). vRealize Automation Converged Blueprint was formally known as vRealize Automation Application Services and is an add-on functionality to vRA, which takes care of application installations. It can be used with pre-existing scripts (like Windows PowerShell or Bash on Linux) – but also with variables received from vRA. This makes it very powerful when it comes to on demand application installations. This tool can also make use of vRO to provide even better capabilities for complex application installations. vRealize Code Stream is an addition to vRA and serves specific use cases in the DevOps area of the SDDC. It can be used with various development frameworks such as Jenkins. Also it can be used as a tool for developers to build and operate their own software test, QA and deployment environment. Not only can the developer build these separate stages, the migration from one stage into another can also be fully automated by scripts. This makes it a very powerful tool when it comes to stage and deploy modern and traditional applications within the SDDC. VMware NSX is the network virtualization component. Given the complexity some applications/services might introduce, NSX will provide a good and profound solution to help solving it. The challenges include: Dynamic network creation Microsegmentation Advanced security Network function virtualization VMware vSphere is mostly the base infrastructure and used as the hypervisor for server virtualization. You are probably familiar with vSphere and its functionalities. However, since the SDDC is introducing a change to you data center architecture, it is recommended to re-visit some of the vSphere functionalities and configurations. By using the full potential of vSphere it is possible to save effort when it comes to automation aspects as well as the service/application deployment part of the SDDC. This represents your toolbox required to build the platform for an automated data center. All of them will bring tremendous value and possibilities, but they also will introduce change. It is important that this change needs to be addressed and is a part of the overall SDDC design and installation effort. Embrace the change. The implementation journey While a big part of this article focuses on building and configuring the SDDC, it is important to mention that there are also non-technical aspects to consider. Creating a new way of operating and running your data center will always involve people. It is important to also briefly touch this part of the SDDC. Basically there are three major players when it comes to a fundamental change in any data center, as shown in the following image: Basically there are three major topics relevant for every successful SDDC deployment. Same as for the tools principle, these three disciplines need to work together in order to enable the change and make sure that all benefits can be fully leveraged. These three categories are: People Process Technology The process category Data center processes are as established and settled as IT itself. Beginning with the first operator tasks like changing tapes or starting procedures up to highly sophisticated processes to ensure that the service deployment and management is working as expected they have already come a long way. However, some of these processes might not be fit for purpose anymore, once automation is applied to a data center. To build a SDDC it is very important to revisit data center processes and adopt them to work with the new automation tasks. The tools will offer integration points into processes, but it is equally important to remove bottle necks for the processes as well. However, keep in mind that if you automate a bad process, the process will still be bad – but fully automated. So it is also necessary to re-visit those processes so that they can become slim and effective as well. Remember Tom, the data center manager. He has successfully identified that they need a SDDC to fulfill the business requirements and also did a use case to IT capabilities mapping. While this mapping is mainly talking about what the IT needs to deliver technically, it will also imply that the current IT processes need to adopt to this new delivery model. The process change example in Tom's organization If the compute department works on a service involving OS deployment, they need to fill out an Excel sheet with IP addresses and server names and send it to the networking department. The network admins will ensure that there is no double booking by reserving the IP address and approve the requested host name. After successfully proving the uniqueness of this data, name and IP gets added to the organizations DNS server. The manual part of this process is not longer feasible once the data center enters the automation era – imagine that every time somebody orders a service involving a VM/OS deploy, the network department gets an e-mail containing the Excel with the IP and host name combination. The whole process will have to stop until this step is manually finished. To overcome this, the process has to be changed to use an automated solution for IPAM. The new process has to track IP and host names programmatically to ensure there is no duplication within the entire data center. Also, after successfully checking the uniqueness of the data, it has to be added to the Domain Name System (DNS). While this is a simple example on one small process, normally there is a large number of processes involved which need to be re-viewed for a fully automated data center. This is a very important task and should not be underestimated since it can be a differentiator for success or failure of an SDDC. Think about all other processes in place which are used to control the deploy/enable/install mechanics in your data center. Here is a small example list of questions to ask regarding established processes: What is our current IPAM/DNS process? Do we need to consider a CMDB integration? What is our current ticketing process? (ITSM) What is our process to get resources from network, storage, and compute? What OS/VM deployment process is currently in place? What is our process to deploy an application (hand overs, steps, or departments involved)? What does our current approval process look like? Do we need a technical approval to deliver a service? Do we need a business approval to deliver a service? What integration process do we have for a service/application deployment? DNS, Active Directory (AD), Dynamic Host Configuration Protocol (DHCP), routing, Information Technology Infrastructure Library (ITIL), and so on Now for the approval question, normally these are an exception for the automation part, since approvals are meant to be manual in the first place (either technical or business). If all the other answers to this example questions involve human interaction as well, consider to change these processes to be fully automated by the SDDC. Since human intervention creates waiting times, it has to be avoided during service deployments in any automated data center. Think of it as the robotic construction bands todays car manufacturers are using. The processes they have implemented, developed over ages of experience, are all designed to stop the band only in case of an emergency. The same comes true for the SDDC – try to enable the automated deployment through your processes, stop the automation only in case of an emergency. Identifying processes is the simple part, changing them is the tricky part. However, keep in mind that this is an all new model of IT delivery, therefore there is no golden way of doing it. Once you have committed to change those processes, keep monitoring if they truly fulfill their requirement. This leads to another process principle in the SDDC: Continual Service Improvement (CSI). Re-visit what you have changed from time to time and make sure that those processes are still working as expected, if they don't, change them again. The people category Since every data center is run by people, it is important to also consider that a change of technology will also impact those people. There are some claims that a SDDC can be run with only half of the staff or save a couple of employees since all is automated. The truth is, a SDDC will transform IT roles in a data center. This means that some classic roles might vanish, while others will be added by this change. It is unrealistic to say that you can run an automated data center with half the staff than before. But it is realistic to say that your staff can concentrate on innovation and development instead of working a 100% to keep the lights on. And this is the change an automated data center introduces. It opens up the possibilities to evolve into a more architecture and design focused role for current administrators. The people example in Tom's organization Currently there are two admins in the compute department working for Tom. They are managing and maintaining the virtual environment, which is largely VMware vSphere. They are creating VMs manually, deploying an OS by a network install routine (which was a requirement for physical installs – so they kept the process) and than handing the ready VMs over to the next department to finish installing the service they are meant for. Recently they have experienced a lot of demand for VMs and each of them configures 10 to 12 VMs per day. Given this, they cannot concentrate on other aspects of their job, like improving OS deployments or the hand over process. At a first look it seems like the SDDC might replace these two employees since the tools will largely automate their work. But that is like saying a jackhammer will replace a construction worker. Actually their roles will shift to a more architectural aspect. They need to come up with a template for OS installations and an improvement how to further automate the deployment process. Also they might need to add new services/parts to the SDDC in order to fulfill the business needs continuously. So instead of creating all the VMs manually, they are now focused on designing a blueprint, able to be replicated as easy and efficient as possible. While their tasks might have changed, their workforce is still important to operate and run the SDDC. However, given that they focus on design and architectural tasks now, they also have the time to introduce innovative functions and additions to the data center. Keep in mind that an automated data center affects all departments in an IT organization. This means that also the tasks of the network and storage as well as application and database teams will change. In fact, in a SDDC it is quite impossible to still operate the departments disconnected from each other since a deployment will affect all of them. This also implies that all of these departments will have admins shifting to higher-level functions in order to make the automation possible. In the industry, this shift is also often referred to as Operational Transformation. This basically means that not only the tools have to be in place, you also have to change the way how the staff operates the data center. In most cases organizations decide to form a so-called center of excellence (CoE) to administer and operate the automated data center. This virtual group of admins in a data center is very similar to project groups in traditional data centers. The difference is that these people should be permanently assigned to the CoE for a SDDC. Typically you might have one champion from each department taking part in this virtual team. Each person acts as an expert and ambassador for their department. With this principle, it can be ensured that decisions and overlapping processes are well defined and ready to function across the departments. Also, as an ambassador, each participant should advertise the new functionalities within their department and enable their colleagues to fully support the new data center approach. It is important to have good expertise in terms of technology as well as good communication skills for each member of the CoE. The technology category This is the third aspect of the triangle to successfully implement a SDDC in your environment. Often this is the part where people spend most of their attention, sometimes by ignoring one of the other two parts. However, it is important to note that all three topics need to be equally considered. Think of it like a three legged chair, if one leg is missing it can never stand. The term technology does not necessarily only refer to new tools required to deploy services. It also refers to already established technology, which has to be integrated with the automation toolset (often referred to as third-party integration). This might be your AD, DHCP server, e-mail system, and so on. There might be technology which is not enabling or empowering the data center automation, so instead of only thinking about adding tools, there might also be tools to be removed or replaced. This is a normal IT lifecycle task and has been gone through many iterations already. Think of things like a fax machine or the telex – you might not use them anymore, they have been replaced by e-mail and messaging. The technology example in Tom's organization The team uses some tools to make their daily work easier when it comes to new service deployments. One of the tools is a little graphical user interface to quickly add content to AD. The admins use it to insert the host name, Organizational Unit as well as creating the computer account with it. This was meant to save admin time, since they don't have to open all the various menus in the AD configuration to accomplish these tasks. With the automated service delivery, this has to be done programmatically. Once a new OS is deployed it has to be added to the AD including all requirements by the deployment tool. Since AD offers an API this can be easily automated and integrated into the deployment automation. Instead of painfully integrating the graphical tool, this is now done directly by interfacing the organizations AD, ultimately replacing the old graphical tool. The automated deployment of a service across the entire data center requires a fair amount of communication. Not in a traditional way, but machine-to-machine communication leveraging programmable interfaces. Using such APIs is another important aspect of the applied data center technologies. Most of today's data center tools, from backup all the way up to web servers, do come with APIs. The better the API is documented, the easier the integration into the automation tool. In some cases you might need the vendors to support you with the integration of their tools. If you have identified a tool in the data center, which does not offer any API or even command-line interface (CLI) option at all, try to find a way around this software or even consider replacing it with a new tool. APIs are the equivalent of hand overs in the manual world. The better the communication works between tools, the faster and easier the deployment will be completed. To coordinate and control all this communication, you will need far more than scripts to run. This is a task for an orchestrator, which can run all necessary integration workflows from a central point. This orchestrator will act like a conductor for a big orchestra. It will form the backbone of your SDDC. Why are these three topics so important? The technology aspect closes the triangle and brings the people and the processes parts together. If the processes are not altered to fit the new deployment methods – automation will be painful and complex to implement. If the deployment stops at some point, since the processes require manual intervention, the people will have to fill in this gap. This means that they now have new roles, but also need to maintain some of their old tasks to keep the process running. By introducing such an unbalanced implementation of an automated data center, the workload for people can actually increase, while the service delivery times may not dramatically decrease. This may lead to an avoidance of the automated tasks, since the manual intervention might seen as faster by individual admins. So it is very important to accept all three aspects as the main part of the SDDC implementation journey. They all need to be addressed equally and thoughtfully to unveil the benefits and improvements an automated data center has to offer. However, keep in mind that this truly is a journey. A SDDC is not implemented in days but in month. Given this, also the implementation team in the data center has this time to adopt themselves and their process to this new way of delivering IT services. Also all necessary departments and their lead needs to be involved in this procedure. A SDDC implementation is always a team effort. Additional possibilities and opportunities All the previews mentioned topics serve the sole goal to install and use the SDDC within your data center. However, once you have the SDDC running the real fun begins since you can start to introduce additional functionalities impossible for any traditional data center. Lets just briefly touch on some of the possibilities from an IT view. The self-healing data center This is a concept where the automatic deployment of services is connected to a monitoring system. Once the monitoring system detects that a service or environment may be facing constraints, it can automatically trigger an additional deployment for this service to increase the throughput. While this is application dependent, for infrastructure services this can become quite handy. Think of ESXi host auto deployments if compute power is becoming a constraint, or data store deployments if disk space is running low. If this automation is acting to aggressive for your organization, it can be used with an approval function. Once the monitoring detects a shortcoming it will ask for approval to fix it with a deployment action. Instead of getting an e-mail from your monitoring system that there is a constraint identified, you get an e-mail with the constraint and the resolving action. All you need to do is to approve the action. The self-scaling data center A similar principle is to use a capacity management tool to predict the growth of your environment. If it approaches a trigger, the system can automatically generate an order letter, containing all needed components to satisfy the growing capacity demands. This can than be sent to finance or the purchasing management for approval and before you even get into any capacity constraints, the new gear might be available and ready to run. However, consider the regular turnaround time for ordering hardware, which might affect how far in the future you have to set the trigger for such functionality. Both of this opportunities are more than just nice to haves, they enable your data center to be truly flexible and proactive. Due to the fact that a SDDC is offering a high amount of agility, it will also need some self-monitoring to stay flexible and useable and to fulfill unpredicted demand. Summary In this article we discussed the main principles and declarations of an SDDC. It provided an overview of the opportunities and possibilities this new data center architecture provides. Also, it covered the changes which will be introduced by this new approach. Finally it discussed the implementation journey and its involvement with people, processes and technology. Resources for Article: Further resources on this subject: VM, It Is Not What You Think! [article] Introducing vSphere vMotion [article] Creating a VM using VirtualBox - Ubuntu Linux [article]
Read more
  • 0
  • 0
  • 9714

article-image-redis-autosuggest
Packt
18 Sep 2014
8 min read
Save for later

Redis in Autosuggest

Packt
18 Sep 2014
8 min read
In this article by Arun Chinnachamy, the author of Redis Applied Design Patterns, we are going to see how to use Redis to build a basic autocomplete or autosuggest server. Also, we will see how to build a faceting engine using Redis. To build such a system, we will use sorted sets and operations involving ranges and intersections. To summarize, we will focus on the following topics in this article: (For more resources related to this topic, see here.) Autocompletion for words Multiword autosuggestion using a sorted set Faceted search using sets and operations such as union and intersection Autosuggest systems These days autosuggest is seen in virtually all e-commerce stores in addition to a host of others. Almost all websites are utilizing this functionality in one way or another from a basic website search to programming IDEs. The ease of use afforded by autosuggest has led every major website from Google and Amazon to Wikipedia to use this feature to make it easier for users to navigate to where they want to go. The primary metric for any autosuggest system is how fast we can respond with suggestions to a user's query. Usability research studies have found that the response time should be under a second to ensure that a user's attention and flow of thought are preserved. Redis is ideally suited for this task as it is one of the fastest data stores in the market right now. Let's see how to design such a structure and use Redis to build an autosuggest engine. We can tweak Redis to suit individual use case scenarios, ranging from the simple to the complex. For instance, if we want only to autocomplete a word, we can enable this functionality by using a sorted set. Let's see how to perform single word completion and then we will move on to more complex scenarios, such as phrase completion. Word completion in Redis In this section, we want to provide a simple word completion feature through Redis. We will use a sorted set for this exercise. The reason behind using a sorted set is that it always guarantees O(log(N)) operations. While it is commonly known that in a sorted set, elements are arranged based on the score, what is not widely acknowledged is that elements with the same scores are arranged lexicographically. This is going to form the basis for our word completion feature. Let's look at a scenario in which we have the words to autocomplete: jack, smith, scott, jacob, and jackeline. In order to complete a word, we need to use n-gram. Every word needs to be written as a contiguous sequence. n-gram is a contiguous sequence of n items from a given sequence of text or speech. To find out more, check http://en.wikipedia.org/wiki/N-gram. For example, n-gram of jack is as follows: j ja jac jack$ In order to signify the completed word, we can use a delimiter such as * or $. To add the word into a sorted set, we will be using ZADD in the following way: > zadd autocomplete 0 j > zadd autocomplete 0 ja > zadd autocomplete 0 jac > zadd autocomplete 0 jack$ Likewise, we need to add all the words we want to index for autocompletion. Once we are done, our sorted set will look as follows: > zrange autocomplete 0 -1 1) "j" 2) "ja" 3) "jac" 4) "jack$" 5) "jacke" 6) "jackel" 7) "jackeli" 8) "jackelin" 9) "jackeline$" 10) "jaco" 11) "jacob$" 12) "s" 13) "sc" 14) "sco" 15) "scot" 16) "scott$" 17) "sm" 18) "smi" 19) "smit" 20) "smith$" Now, we will use ZRANK and ZRANGE operations over the sorted set to achieve our desired functionality. To autocomplete for ja, we have to execute the following commands: > zrank autocomplete jac 2 zrange autocomplete 3 50 1) "jack$" 2) "jacke" 3) "jackel" 4) "jackeli" 5) "jackelin" 6) "jackeline$" 7) "jaco" 8) "jacob$" 9) "s" 10) "sc" 11) "sco" 12) "scot" 13) "scott$" 14) "sm" 15) "smi" 16) "smit" 17) "smith$" Another example on completing smi is as follows: zrank autocomplete smi 17 zrange autocomplete 18 50 1) "smit" 2) "smith$" Now, in our program, we have to do the following tasks: Iterate through the results set. Check if the word starts with the query and only use the words with $ as the last character. Though it looks like a lot of operations are performed, both ZRANGE and ZRANK are O(log(N)) operations. Therefore, there should be virtually no problem in handling a huge list of words. When it comes to memory usage, we will have n+1 elements for every word, where n is the number of characters in the word. For M words, we will have M(avg(n) + 1) records where avg(n) is the average characters in a word. The more the collision of characters in our universe, the less the memory usage. In order to conserve memory, we can use the EXPIRE command to expire unused long tail autocomplete terms. Multiword phrase completion In the previous section, we have seen how to use the autocomplete for a single word. However, in most real world scenarios, we will have to deal with multiword phrases. This is much more difficult to achieve as there are a few inherent challenges involved: Suggesting a phrase for all matching words. For instance, the same manufacturer has a lot of models available. We have to ensure that we list all models if a user decides to search for a manufacturer by name. Order the results based on overall popularity and relevance of the match instead of ordering lexicographically. The following screenshot shows the typical autosuggest box, which you find in popular e-commerce portals. This feature improves the user experience and also reduces the spell errors: For this case, we will use a sorted set along with hashes. We will use a sorted set to store the n-gram of the indexed data followed by getting the complete title from hashes. Instead of storing the n-grams into the same sorted set, we will store them in different sorted sets. Let's look at the following scenario in which we have model names of mobile phones along with their popularity: For this set, we will create multiple sorted sets. Let's take Apple iPhone 5S: ZADD a 9 apple_iphone_5s ZADD ap 9 apple_iphone_5s ZADD app 9 apple_iphone_5s ZADD apple 9 apple_iphone_5s ZADD i 9 apple_iphone_5s ZADD ip 9 apple_iphone_5s ZADD iph 9 apple_iphone_5s ZADD ipho 9 apple_iphone_5s ZADD iphon 9 apple_iphone_5s ZADD iphone 9 apple_iphone_5s ZADD 5 9 apple_iphone_5s ZADD 5s 9 apple_iphone_5s HSET titles apple_iphone_5s "Apple iPhone 5S" In the preceding scenario, we have added every n-gram value as a sorted set and created a hash that holds the original title. Likewise, we have to add all the titles into our index. Searching in the index Now that we have indexed the titles, we are ready to perform a search. Consider a situation where a user is querying with the term apple. We want to show the user the five best suggestions based on the popularity of the product. Here's how we can achieve this: > zrevrange apple 0 4 withscores 1) "apple_iphone_5s" 2) 9.0 3) "apple_iphone_5c" 4) 6.0 As the elements inside the sorted set are ordered by the element score, we get the matches ordered by the popularity which we inserted. To get the original title, type the following command: > hmget titles apple_iphone_5s 1) "Apple iPhone 5S" In the preceding scenario case, the query was a single word. Now imagine if the user has multiple words such as Samsung nex, and we have to suggest the autocomplete as Samsung Galaxy Nexus. To achieve this, we will use ZINTERSTORE as follows: > zinterstore samsung_nex 2 samsung nex aggregate max ZINTERSTORE destination numkeys key [key ...] [WEIGHTS weight [weight ...]] [AGGREGATE SUM|MIN|MAX] This computes the intersection of sorted sets given by the specified keys and stores the result in a destination. It is mandatory to provide the number of input keys before passing the input keys and other (optional) arguments. For more information about ZINTERSTORE, visit http://redis.io/commands/ZINTERSTORE. The previous command, which is zinterstore samsung_nex 2 samsung nex aggregate max, will compute the intersection of two sorted sets, samsung and nex, and stores it in another sorted set, samsung_nex. To see the result, type the following commands: > zrevrange samsung_nex 0 4 withscores 1) samsung_galaxy_nexus 2) 7 > hmget titles samsung_galaxy_nexus 1) Samsung Galaxy Nexus If you want to cache the result for multiword queries and remove it automatically, use an EXPIRE command and set expiry for temporary keys. Summary In this article, we have seen how to perform autosuggest and faceted searches using Redis. We have also understood how sorted sets and sets work. We have also seen how Redis can be used as a backend system for simple faceting and autosuggest system and make the system ultrafast. Further resources on this subject: Using Redis in a hostile environment (Advanced) [Article] Building Applications with Spring Data Redis [Article] Implementing persistence in Redis (Intermediate) [Article] Resources used for creating the article: Credit for the featured tiger image: Big Cat Facts - Tiger
Read more
  • 0
  • 0
  • 9713
article-image-interact-with-hbase-using-hbase-shell-tutorial
Amey Varangaonkar
25 Jun 2018
9 min read
Save for later

How to interact with HBase using HBase shell [Tutorial]

Amey Varangaonkar
25 Jun 2018
9 min read
HBase is among the top five most popular and widely-deployed NoSQL databases. It is used to support critical production workloads across hundreds of organizations. It is supported by multiple vendors (in fact, it is one of the few databases that is multi-vendor), and more importantly has an active and diverse developer and user community. In this article, we see how to work with the HBase shell in order to efficiently work on the massive amounts of data. The following excerpt is taken from the book '7 NoSQL Databases in a Week' authored by Aaron Ploetz et al. Working with the HBase shell The best way to get started with understanding HBase is through the HBase shell. Before we do that, we need to first install HBase. An easy way to get started is to use the Hortonworks sandbox. You can download the sandbox for free from https://hortonworks.com/products/sandbox/. The sandbox can be installed on Linux, Mac and Windows. Follow the instructions to get this set up. On any cluster where the HBase client or server is installed, type hbase shell to get a prompt into HBase: hbase(main):004:0> version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 This tells you the version of HBase that is running on the cluster. In this instance, the HBase version is 1.1.2, provided by a particular Hadoop distribution, in this case HDP 2.3.6: hbase(main):001:0> help HBase Shell, version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command. Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group. This provides the set of operations that are possible through the HBase shell, which includes DDL, DML, and admin operations. hbase(main):001:0> create 'sensor_telemetry', 'metrics' 0 row(s) in 1.7250 seconds => Hbase::Table - sensor_telemetry This creates a table called sensor_telemetry, with a single column family called metrics. As we discussed before, HBase doesn't require column names to be defined in the table schema (and in fact, has no provision for you to be able to do so): hbase(main):001:0> describe 'sensor_telemetry' Table sensor_telemetry is ENABLED sensor_telemetry COLUMN FAMILIES DESCRIPTION {NAME => 'metrics', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE =>'0'} 1 row(s) in 0.5030 seconds This describes the structure of the sensor_telemetry table. The command output indicates that there's a single column family present called metrics, with various attributes defined on it. BLOOMFILTER indicates the type of bloom filter defined for the table, which can either be a bloom filter of the ROW type, which probes for the presence/absence of a given row key, or of the ROWCOL type, which probes for the presence/absence of a given row key, col-qualifier combination. You can also choose to have BLOOMFILTER set to None. The BLOCKSIZE configures the minimum granularity of an HBase read. By default, the block size is 64 KB, so if the average cells are less than 64 KB, and there's not much locality of reference, you can lower your block size to ensure there's not more I/O than necessary, and more importantly, that your block cache isn't wasted on data that is not needed. VERSIONS refers to the maximum number of cell versions that are to be kept around: hbase(main):004:0> alter 'sensor_telemetry', {NAME => 'metrics', BLOCKSIZE => '16384', COMPRESSION => 'SNAPPY'} Updating all regions with the new schema... 1/1 regions updated. Done. 0 row(s) in 1.9660 seconds Here, we are altering the table and column family definition to change the BLOCKSIZE to be 16 K and the COMPRESSION codec to be SNAPPY: hbase(main):004:0> version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 hbase(main):005:0> describe 'sensor_telemetry' Table sensor_telemetry is ENABLED sensor_telemetry COLUMN FAMILIES DESCRIPTION {NAME => 'metrics', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '16384', REPLICATION_SCOPE => '0'} 1 row(s) in 0.0410 seconds This is what the table definition now looks like after our ALTER table statement. Next, let's scan the table to see what it contains: hbase(main):007:0> scan 'sensor_telemetry' ROW COLUMN+CELL 0 row(s) in 0.0750 seconds No surprises, the table is empty. So, let's populate some data into the table: hbase(main):007:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'temperature', '65' ERROR: Unknown column family! Valid column names: metrics:* Here, we are attempting to insert data into the sensor_telemetry table. We are attempting to store the value '65' for the column qualifier 'temperature' for a row key '/94555/20170308/18:30'. This is unsuccessful because the column 'temperature' is not associated with any column family. In HBase, you always need the row key, the column family and the column qualifier to uniquely specify a value. So, let's try this again: hbase(main):008:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'metrics:temperature', '65' 0 row(s) in 0.0120 seconds Ok, that seemed to be successful. Let's confirm that we now have some data in the table: hbase(main):009:0> count 'sensor_telemetry' 1 row(s) in 0.0620 seconds => 1 Ok, it looks like we are on the right track. Let's scan the table to see what it contains: hbase(main):010:0> scan 'sensor_telemetry' ROW COLUMN+CELL /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810397402,value=65 1 row(s) in 0.0190 seconds This tells us we've got data for a single row and a single column. The insert time epoch in milliseconds was 1501810397402. In addition to a scan operation, which scans through all of the rows in the table, HBase also provides a get operation, where you can retrieve data for one or more rows, if you know the keys: hbase(main):011:0> get 'sensor_telemetry', '/94555/20170308/18:30' COLUMN CELL metrics:temperature timestamp=1501810397402, value=65 OK, that returns the row as expected. Next, let's look at the effect of cell versions. As we've discussed before, a value in HBase is defined by a combination of Row-key, Col-family, Col-qualifier, Timestamp. To understand this, let's insert the value '66', for the same row key and column qualifier as before: hbase(main):012:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'metrics:temperature', '66' 0 row(s) in 0.0080 seconds Now let's read the value for the row key back: hbase(main):013:0> get 'sensor_telemetry', '/94555/20170308/18:30' COLUMN CELL metrics:temperature timestamp=1501810496459, value=66 1 row(s) in 0.0130 seconds This is in line with what we expect, and this is the standard behavior we'd expect from any database. A put in HBase is the equivalent to an upsert in an RDBMS. Like an upsert, put inserts a value if it doesn't already exist and updates it if a prior value exists. Now, this is where things get interesting. The get operation in HBase allows us to retrieve data associated with a particular timestamp: hbase(main):015:0> get 'sensor_telemetry', '/94555/20170308/18:30', {COLUMN => 'metrics:temperature', TIMESTAMP => 1501810397402} COLUMN CELL metrics:temperature timestamp=1501810397402,value=65 1 row(s) in 0.0120 seconds   We are able to retrieve the old value of 65 by providing the right timestamp. So, puts in HBase don't overwrite the old value, they merely hide it; we can always retrieve the old values by providing the timestamps. Now, let's insert more data into the table: hbase(main):028:0> put 'sensor_telemetry', '/94555/20170307/18:30', 'metrics:temperature', '43' 0 row(s) in 0.0080 seconds hbase(main):029:0> put 'sensor_telemetry', '/94555/20170306/18:30', 'metrics:temperature', '33' 0 row(s) in 0.0070 seconds Now, let's scan the table back: hbase(main):030:0> scan 'sensor_telemetry' ROW COLUMN+CELL /94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956, value=33 /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941,value=67 3 row(s) in 0.0310 seconds We can also scan the table in reverse key order: hbase(main):031:0> scan 'sensor_telemetry', {REVERSED => true} ROW COLUMN+CELL /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956,value=33 3 row(s) in 0.0520 seconds What if we wanted all the rows, but in addition, wanted all the cell versions from each row? We can easily retrieve that: hbase(main):032:0> scan 'sensor_telemetry', {RAW => true, VERSIONS => 10} ROW COLUMN+CELL /94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956, value=33 /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810496459, value=66 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810397402, value=65 Here, we are retrieving all three values of the row key /94555/20170308/18:30 in the scan result set. HBase scan operations don't need to go from the beginning to the end of the table; you can optionally specify the row to start scanning from and the row to stop the scan operation at: hbase(main):034:0> scan 'sensor_telemetry', {STARTROW => '/94555/20170307'} ROW COLUMN+CELL /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 /94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 2 row(s) in 0.0550 seconds HBase also provides the ability to supply filters to the scan operation to restrict what rows are returned by the scan operation. It's possible to implement your own filters, but there's rarely a need to. There's a large collection of filters that are already implemented: hbase(main):033:0> scan 'sensor_telemetry', {ROWPREFIXFILTER => '/94555/20170307'} ROW COLUMN+CELL /94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 1 row(s) in 0.0300 seconds This returns all the rows whose keys have the prefix /94555/20170307: hbase(main):033:0> scan 'sensor_telemetry', { FILTER => SingleColumnValueFilter.new( Bytes.toBytes('metrics'), Bytes.toBytes('temperature'), CompareFilter::CompareOp.valueOf('EQUAL'), BinaryComparator.new(Bytes.toBytes('66')))} The SingleColumnValueFilter can be used to scan a table and look for all rows with a given column value. We saw how fairly easy it is to work with your data in HBase using the HBase shell. If you found this excerpt useful, make sure you check out the book 'Seven NoSQL Databases in a Week', to get more hands-on information about HBase and the other popular NoSQL databases out there today. Read More Level Up Your Company’s Big Data with Mesos 2018 is the year of graph databases. Here’s why. Top 5 NoSQL Databases
Read more
  • 0
  • 0
  • 9711

article-image-chatgpt-for-ab-testing-in-marketing-campaigns
Valentina Alto
22 Sep 2023
5 min read
Save for later

ChatGPT for A/B Testing in Marketing Campaigns

Valentina Alto
22 Sep 2023
5 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, Modern Generative AI with ChatGPT and OpenAI Models, by Valentina Alto. Master core data architecture design concepts and Azure Data & AI services to gain a cloud data and AI architect’s perspective to developing end-to-end solutions.IntroductionIn the ever-evolving landscape of digital marketing, staying competitive and meeting customer expectations is paramount. This article explores the revolutionary potential of ChatGPT in enhancing multiple aspects of marketing. From refining A/B testing strategies to elevating SEO optimization techniques and harnessing sentiment analysis for measuring customer satisfaction, ChatGPT emerges as a pivotal tool. A/B testing for marketing comparisonAnother interesting field where ChatGPT can assist marketers is A/B testing.A/B testing in marketing is a method of comparing two different versions of a marketing campaign, advertisement, or website to determine which one performs better. In A/B testing, two variations of the same campaign or element are created, with only one variable changed between the two versions. The goal is to see which version generates more clicks, conversions, or other desired outcomes.An example of A/B testing might be testing two versions of an email campaign, using different subject lines, or testing two versions of a website landing page, with different call-to-action buttons. By measuring the response rate of each version, marketers can determine which version performs better and make data-driven decisions about which version to use going forward.A/B testing allows marketers to optimize their campaigns and elements for maximum effectiveness, leading to better results and a higher return on investment.Since this method involves the process of generating many variations of the same content, the generative power of ChatGPT can definitely assist in that.Let’s consider the following example. I’m promoting a new product I developed: a new, light and thin climbing harness for speed climbers. I’ve already done some market research and I know my niche audience. I also know that one great channel of communication for that audience is publishing on an online climbing blog, of which most climbing gyms’ members are fellow readers.My goal is to create an outstanding blog post to share the launch of this new harness, and I want to test two different versions of it in two groups. The blog post I’m about to publish and that I want to be the object of my A/B testing is the following:Figure – An example of a blog post to launch climbing gearHere, ChatGPT can help us on two levels:The first level is that of rewording the article, using different keywords or different attention grabbing slogans. To do so, once this post is provided as context, we can ask ChatGPT to work on the article and slightly change some elements:Figure – New version of the blog post generated by ChatGPTAs per my request, ChatGPT was able to regenerate only those elements I asked for (title, subtitle, and closing sentence) so that I can monitor the effectiveness of those elements by monitoring the reaction of the two audience groups.The second level is working on the design of the web page, namely, changing the collocation of the image rather than the position of the buttons. For this purpose, I created a simple web page for the blog post published in the climbing blog (you can find the code in the book’s GitHub repository at https://github.com/PacktPublishing/The-Ultimate-Guideto-ChatGPT-and-OpenAI/tree/main/Chapter%207%20-%20ChatGPT%20 for%20Marketers/Code):Figure  – Sample blog post published on the climbing blogWe can directly feed ChatGPT with the HTML code and ask it to change some layout elements, such as the position of the buttons or their wording. For example, rather than Buy Now, a reader might be more gripped by an I want one! button.So, lets feed ChatGPT with the HTML source code:Figure – ChatGPT changing HTML codeLet’s see what the output looks like:Figure – New version of the websiteAs you can see, ChatGPT only intervened at the button level, slightly changing their layout, position, color, and wording.Indeed, inspecting the source code of the two versions of the web pages, we can see how it differs in the button sections:Figure – Comparison between the source code of the two versions of the websiteConclusionChatGPT is a valuable tool for A/B testing in marketing. Its ability to quickly generate different versions of the same content can reduce the time to market of new campaigns. By utilizing ChatGPT for A/B testing, you can optimize your marketing strategies and ultimately drive better results for your business.Author BioValentina Alto graduated in 2021 in data science. Since 2020, she has been working at Microsoft as an Azure solution specialist, and since 2022, she has been focusing on data and AI workloads within the manufacturing and pharmaceutical industry. She has been working closely with system integrators on customer projects to deploy cloud architecture with a focus on modern data platforms, data mesh frameworks, IoT and real-time analytics, Azure Machine Learning, Azure Cognitive Services (including Azure OpenAI Service), and Power BI for dashboarding. Since commencing her academic journey, she has been writing tech articles on statistics, machine learning, deep learning, and AI in various publications and has authored a book on the fundamentals of machine learning with Python.
Read more
  • 0
  • 0
  • 9700
Modal Close icon
Modal Close icon