Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-introduction-railo-open-source
Packt
31 Mar 2010
10 min read
Save for later

Introduction to Railo Open Source

Packt
31 Mar 2010
10 min read
What is Railo? Railo is an open source Java application server that implements CFML (ColdFusion Markup Language), a tag based language from Adobe's commercial product “ColdFusion.” Its performance is excellent, and it includes features that significantly increase productivity. Railo is a relative newcomer, but has been making some impressive ripples in the industry lately. This article is a primer on some of the critical advantages of Railo and why it is worth a serious look for web application development. Isn’t ColdFusion dead? A few years back, an article was published naming 10 technologies that were dead or dying, and to many people's surprise, ColdFusion was in that list. That caused a lot of waves. One thing about CFML developers – they are passionate about their programming language! ColdFusion has seen moderate success in specific vertical markets, but has been notably well accepted by the US Government. In comparison to dominant development languages, CFML never seemed to find real favor with the masses. Since ColdFusion was re-engineered to run entirely on Java, and with the arrival of Adobe Flex a few years ago which integrates Flash and ColdFusion, this has changed quite a bit. Adobe's ColdFusion product integrates so well with Flex that it has spawned new interest. One of the largest complaints about Adobe ColdFusion has always been the price. It’s been my experience that CFML developers consider themselves to be industry peers of LAMP (Linux, Apache, MySQL, PHP) developers, who use all open source tools. The majority of LAMP developers consider their skills much higher than that of CFML developers. This has only fed the fury over the years of CFML developers who claim that the investment in purchasing ColdFusion is a quick return on investment since CFML is so much more productive. Now along comes Railo, offering a free and open source solution to the CFML developers' dreams. Not only is it free, but also it performs fantastic, is stable, and is updated reasonably frequently. This is good news for CFML, which is, in my opinion, highly underrated, mostly due to poor marketing and sales price points over the years. CFML is actually quite a powerful and surprisingly productive language, and was designed to be a RAD (Rapid Application Development) tool. It has grown into a significantly better product, and certainly does deserve more respect than it has had. But enough about CFML, let’s talk about why I find Railo is so impressive and what distinguishes itself from the competition. What can you do with Railo? Perhaps the best way to answer this is to say, “What CAN'T you do with Railo?” The CFML language is essentially a big java tag library. CFML has grown into an impressive library over the years and Railo supports everything that Adobe's product supports that is in mainstream use. (There is some difference between the support as both Railo and Adobe release new versions of their products). The core features of Railo's language provide easy to learn tags for everything from database queries to sending dynamic email messages to scripting connections with ftp and Amazon s3 storage. Pretty much anything you can do with PHP you can do with Railo. Here's the catch – generally speaking, it takes less time to implement a solution using CFML than it does with PHP, ASP.net or pure Java. Use CFML for the basics; Extend using Java. While Railo gives you a LOT of built in functions, the real truth of the situation is that it is Java under the hood. All the tags and functions ultimately get compiled and run as Java byte code. The language is well designed, however, so that you can mix and match your CFML and Java code. For instance, if you wanted to read in a text file, you can use the built in tag CFFILE: <cffile action="read" file="c:webmessage.txt" variable="strContent"></cffile> This reads in the contents of the text file, and stores it in the specified variable. To display that content in the web browser, you would output it like so: <cfoutput>#strContent#</cfoutput> To illustrate how Java can be used directly in your code, this same task can be done using Java objects instead of the built in CFML tags like so: <cfobject type="Java" class=" java.io.FileReader" Action="Create" name="myFileReader"> <cfset Result = fileReader.init("c:webmessage.txt"); <cfoutput>#strContent#</cfoutput> These two small pieces of code achieve the same goals. My point is that the CFML language isn't limited to just CFML, you can instantiate and use any Java object anywhere within your code. This makes the language incredibly flexible, since you can use the CFML tags for quick and easy tasks, and use Java for heavy lifting where needed. Deployment and Development Environments All versions of Railo can be downloaded either as an “express,” “server” or “custom” deployment. The express edition is extremely easy for developers to get up and running and usually involves just decompressing a zip file onto your local system and starting it up. The server package comes along with Caucho Resin, a very high performance java application server. (Side note – some of the tools included with Resin are pretty impressive as well, including their all-java implementation of PHP!). The custom deployment package is for launching Railo on other Java servlet containers like Tomcat or Weblogic. Setting up Railo on a production server wasn't difficult, granted it is a bit more involved than installing RPMs of your favorite PHP version, but documentation was easily found on Railo's site and other sites found through Google. Like Adobe's product, Railo comes with web administration tools to manage the server and application-specific settings and resources. This is a big step up from the PHP and Linux world, where you normally need to configure a lot of your application's settings (data sources for example) in configuration files. The Railo administrator goes a few steps beyond Adobe as well, and makes context specific administration consoles available, so individual applications and websites can define their own sandboxed data sources, virtual mappings, and more. This is a really nice touch, and has been a requested feature for a long time. Where Railo Shines I have already reviewed some of the reasons why Railo is impressive. Aside from being a very powerful RAD, with performance that rivals or beats Adobe, Railo distinguishes itself further with some impressive features. Virtual File systems and Mappings As developers, we have all had to deal with managing remote or compressed files at one time or another. This feature in Railo does in a few mouse clicks what takes hundreds of lines of code. Railo lets you map remote file systems, like FTP, drive shares, and even Amazon S3 buckets and assign them to a virtual path in your application! This means that you can use the simple built in functions for file manipulation, and treat those files as if they were sitting right on the local file system. The support goes even further, and lets you map Java jar files and .zip files, so you can dynamically reference and run code sitting inside compressed archives. Setting up new mappings is a point-and-click affair in the Railo administrator or can be done programmatically. Application Distribution and Source Code Security The Java world has always been a step (alright, several steps) ahead of web application developers in packaging and distribution of applications. Many developers have their own home-grown methods for deploying a site and many web development applications, like Dreamweaver, have an FTP based method of deployment. Ultimately, it usually means handing over unprotected source code. CFML development has been the same way (yes, Adobe did have a way to compile .cfm templates, but my research shows it is both clumsy to use and not very popular). Railo brings “Java world” package deployment to CFML development. You can compile a whole application to Java byte code, compress it to a jar file and deploy it on any other Railo server. Railo is even smart enough to let you map a remote jar file on an FTP site and run it as a local web application. This means you have all the tools you need to deploy web applications and not expose your source. Built in AMF Support for Flex/Flash Applications Since Adobe open-sourced their BlazeDS AMF tools, Railo has integrated them making an easy to use system that “just works” with Flash applications. Inter-Application Integration, PDF and Video Manipulation CFML already has great capability for integrating with a huge number of database systems and can be expanded to use any of the huge number of open source Java projects. Railo can be used to talk to Amazon Web Services, like EC2 and S3 for cloud computing applications. Railo also has built in features for file conversions, such as dynamically generating PDFs, and programmatic editing and format conversions of digital video. A few simple lines of code can convert your video files to different formats, extract thumbnails for web previews, and then you could have them dropped on Amazon S3 to be served from the cloud. Very cool stuff, and worth looking at some of the examples on the Railo website. As you look over code that uses these features, it looks quite simple and it is amazing that Railo makes them look like child’s play, but there is serious inter-system integration going on behind the scenes. Railo makes it so very easy to add these capabilities to any web application. Infinitely Expandable with Java As mentioned above, it is easy to invoke Java classes from within CFML pages. Since Railo itself runs in a Java container, that means that any classes or code from the Java world can be integrated and used with a Railo application. My Experience Building a Railo Project My company has used ColdFusion for several projects. One of our commercial products is built on it and was originally designed for Adobe ColdFusion. Our product does a lot of heavy lifting with databases, internationalization, document format conversions, PDF previews and a lot more. Early in 2009 we did a complete conversion of the source to be compatible with Railo. There were only minor areas where our code needed to change, and most of them were with custom Java code that we wrote that simply needed updated to compatible with Railo's Java libraries. The pleasant surprise came when we were done and noticed a significant performance increase running on Railo. Summary In summary, I have been very impressed with Railo. It is community-driven; the people at Railo are responsive and truly care about the developer community, and the product really delivers what it claims. They have provided an application development platform that is both industry compatible and innovative. I think all seasoned web application developers will be able to appreciate what Railo has to offer. I believe that such powerful integration done so easily with only a few lines of code will draw a lot of attention. This is definitely a technology you should keep an eye on.
Read more
  • 0
  • 0
  • 4830

article-image-modeling-shading-texturing-lighting-and-compositing-soda-can-blender-249-part-1
Packt
05 Feb 2010
4 min read
Save for later

Modeling, Shading, Texturing, Lighting, and Compositing a Soda Can in Blender 2.49: Part 1

Packt
05 Feb 2010
4 min read
I wanted to encapsulate this article with the latest version of Blender (being 2.5), I would not do so not until everyone gets comfortable with it and who knows, on one of my proceeding articles, we might delve more into an introduction of the new version. But for now, let’s be courteous enough to use the fully functional 2.49 version of Blender. If you don’t have it right now, I suggest you head over to http://www.blender.org/download/get-blender/ and grab your own copy. And you also might want to have a copy of the latest GIMP from http://www.gimp.org/downloads/. REQUIREMENTS: Skill level: Intermediate Blender 2.49b (stable) GIMP 2.6.8 INTRODUCTION: So basically, we’ll use Blender’s modeling tools, material indexes, powerful texture system, basic UV unwrapping, some lighting techniques, and of course the node compositor which is built-in in Blender. I dedicate this article to my family and the whole Blender community who have been very supportive of me during my past years of struggle and learning. It was just a wish before that someday hopefully I might be able to get the hang of using this application as much as I did with GIMP and finally somehow, it did happen. REFERENCE PHASE: Before we even begin doing modeling and firing up Blender itself, let’s get ourselves some decent reference images to base our model. Anything will do; it depends entirely on your tastes and preferences. Doing a quick Google search, here’s some that I found: MODELING PHASE: After studying carefully the shape and size of our reference soda cans, we can now proceed and start creating our basis shape for the entire process. I think this might be a good time to say this line, “Fire up Blender!” Depending on your User Defaults and Preferences, your startup screen might look a bit differently than mine and your default object on the scene might be different too. If you have objects other than a cube on your scene, kindly, delete them first since we’re only going to use the cube as our starting point. So if you don’t have one right now, go ahead and add it from the Spacebar > Add > Mesh > Cube menu. Adding a Cube to the Scene You might have wondered why a Cube and not a Cylinder. It’s because we don’t want to work on some extra polygons, just a few points will do. And we would be using some of Blender’s Modifiers to add contours and interpolations in between points to achieve smooth curves on the segments. With our cube on the scene now, go ahead and select it (Right Mouse Button [RMB]), then press CTRL+ 2 on your keyboard to add a Subsurf Modifier on the selected object or click the Editing (F9) button and scroll until you see the Modifiers tab then click Add Modifier and finally choose Subsurf. This will add a new modifier on our current stack. Adding a Subsurf Modifier After doing this, modify some of the subsurf options accordingly. Go ahead and change the Render Levels value to 3, or if you wish to, you could also change the Levels value to 3 such that what you see in your viewport is what you get on the render, but at the cost of a bit of a slowdown on your viewport (depending on the power your computer has). But still, despite adding a Subdivision Surface/Subsurf modifier on our Cube, why does it look polygonal still? That is because by default, the faces’ interpolation around the neighboring ones is set to Solid, that’s why we see this sharp edged transition in between faces. To make it smoother, just go ahead and click on the Editing(F9) button and scroll until you see the Links and Materials tab then click Set Smooth, or in Edit Mode, press W on your keyboard to bring up the Specials Menu and choose Set Smooth. Voila! Smoothing out the Geometry After this step, go to front view by pressing Numpad 1 on your numeric keypad and go to Edit Mode by pressing TAB, or choosing it from the Mode dropdown on the bottom of your 3D view. Select the top-most four vertices and move them 1 Blender Unit up along the Z-axis, do this by holding down the Ctrl key to constrain your movements on increments of 1, then press Z on your keyboard to constrain your movement on the z-axis only and not elsewhere. Moving the Top-most Vertices along Z
Read more
  • 0
  • 0
  • 4829

article-image-java-hibernate-collections-associations-and-advanced-concepts
Packt
15 Sep 2015
16 min read
Save for later

Java Hibernate Collections, Associations, and Advanced Concepts

Packt
15 Sep 2015
16 min read
In this article by Yogesh Prajapati and Vishal Ranapariya, the author of the book Java Hibernate Cookbook, he has provide a complete guide to the following recipes: Working with a first-level cache One-to-one mapping using a common join table Persisting Map (For more resources related to this topic, see here.) Working with a first-level cache Once we execute a particular query using hibernate, it always hits the database. As this process may be very expensive, hibernate provides the facility to cache objects within a certain boundary. The basic actions performed in each database transaction are as follows: The request reaches the database server via the network. The database server processes the query in the query plan. Now the database server executes the processed query. Again, the database server returns the result to the querying application through the network. At last, the application processes the results. This process is repeated every time we request a database operation, even if it is for a simple or small query. It is always a costly transaction to hit the database for the same records multiple times. Sometimes, we also face some delay in receiving the results because of network routing issues. There may be some other parameters that affect and contribute to the delay, but network routing issues play a major role in this cycle. To overcome this issue, the database uses a mechanism that stores the result of a query, which is executed repeatedly, and uses this result again when the data is requested using the same query. These operations are done on the database side. Hibernate provides an in-built caching mechanism known as the first-level cache (L1 cache). Following are some properties of the first-level cache: It is enabled by default. We cannot disable it even if we want to. The scope of the first-level cache is limited to a particular Session object only; the other Session objects cannot access it. All cached objects are destroyed once the session is closed. If we request for an object, hibernate returns the object from the cache only if the requested object is found in the cache; otherwise, a database call is initiated. We can use Session.evict(Object object) to remove single objects from the session cache. The Session.clear() method is used to clear all the cached objects from the session. Getting ready Let's take a look at how the L1 cache works. Creating the classes For this recipe, we will create an Employee class and also insert some records into the table: Source file: Employee.java @Entity @Table public class Employee { @Id @GeneratedValue private long id; @Column(name = "name") private String name; // getters and setters @Override public String toString() { return "Employee: " + "nt Id: " + this.id + "nt Name: " + this.name; } } Creating the tables Use the following table script if the hibernate.hbm2ddl.auto configuration property is not set to create: Use the following script to create the employee table: CREATE TABLE `employee` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `name` varchar(255) DEFAULT NULL, PRIMARY KEY (`id`) ); We will assume that two records are already inserted, as shown in the following employee table: id name 1 Yogesh 2 Aarush Now, let's take a look at some scenarios that show how the first-level cache works. How to do it… Here is the code to see how caching works. In the code, we will load employee#1 and employee#2 once; after that, we will try to load the same employees again and see what happens: Code System.out.println("nLoading employee#1..."); /* Line 2 */ Employee employee1 = (Employee) session.load(Employee.class, new Long(1)); System.out.println(employee1.toString()); System.out.println("nLoading employee#2..."); /* Line 6 */ Employee employee2 = (Employee) session.load(Employee.class, new Long(2)); System.out.println(employee2.toString()); System.out.println("nLoading employee#1 again..."); /* Line 10 */ Employee employee1_dummy = (Employee) session.load(Employee.class, new Long(1)); System.out.println(employee1_dummy.toString()); System.out.println("nLoading employee#2 again..."); /* Line 15 */ Employee employee2_dummy = (Employee) session.load(Employee.class, new Long(2)); System.out.println(employee2_dummy.toString()); Output Loading employee#1... Hibernate: select employee0_.id as id0_0_, employee0_.name as name0_0_ from Employee employee0_ where employee0_.id=? Employee: Id: 1 Name: Yogesh Loading employee#2... Hibernate: select employee0_.id as id0_0_, employee0_.name as name0_0_ from Employee employee0_ where employee0_.id=? Employee: Id: 2 Name: Aarush Loading employee#1 again... Employee: Id: 1 Name: Yogesh Loading employee#2 again... Employee: Id: 2 Name: Aarush How it works… Here, we loaded Employee#1 and Employee#2 as shown in Line 2 and 6 respectively and also the print output for both. It's clear from the output that hibernate will hit the database to load Employee#1 and Employee#2 because at startup, no object is cached in hibernate. Now, in Line 10, we tried to load Employee#1 again. At this time, hibernate did not hit the database but simply use the cached object because Employee#1 is already loaded and this object is still in the session. The same thing happened with Employee#2. Hibernate stores an object in the cache only if one of the following operations is completed: Save Update Get Load List There's more… In the previous section, we took a look at how caching works. Now, we will discuss some other methods used to remove a cached object from the session. There are two more methods that are used to remove a cached object: evict(Object object): This method removes a particular object from the session clear(): This method removes all the objects from the session evict (Object object) This method is used to remove a particular object from the session. It is very useful. The object is no longer available in the session once this method is invoked and the request for the object hits the database: Code System.out.println("nLoading employee#1..."); /* Line 2 */ Employee employee1 = (Employee) session.load(Employee.class, new Long(1)); System.out.println(employee1.toString()); /* Line 5 */ session.evict(employee1); System.out.println("nEmployee#1 removed using evict(…)..."); System.out.println("nLoading employee#1 again..."); /* Line 9*/ Employee employee1_dummy = (Employee) session.load(Employee.class, new Long(1)); System.out.println(employee1_dummy.toString()); Output Loading employee#1... Hibernate: select employee0_.id as id0_0_, employee0_.name as name0_0_ from Employee employee0_ where employee0_.id=? Employee: Id: 1 Name: Yogesh Employee#1 removed using evict(…)... Loading employee#1 again... Hibernate: select employee0_.id as id0_0_, employee0_.name as name0_0_ from Employee employee0_ where employee0_.id=? Employee: Id: 1 Name: Yogesh Here, we loaded an Employee#1, as shown in Line 2. This object was then cached in the session, but we explicitly removed it from the session cache in Line 5. So, the loading of Employee#1 will again hit the database. clear() This method is used to remove all the cached objects from the session cache. They will no longer be available in the session once this method is invoked and the request for the objects hits the database: Code System.out.println("nLoading employee#1..."); /* Line 2 */ Employee employee1 = (Employee) session.load(Employee.class, new Long(1)); System.out.println(employee1.toString()); System.out.println("nLoading employee#2..."); /* Line 6 */ Employee employee2 = (Employee) session.load(Employee.class, new Long(2)); System.out.println(employee2.toString()); /* Line 9 */ session.clear(); System.out.println("nAll objects removed from session cache using clear()..."); System.out.println("nLoading employee#1 again..."); /* Line 13 */ Employee employee1_dummy = (Employee) session.load(Employee.class, new Long(1)); System.out.println(employee1_dummy.toString()); System.out.println("nLoading employee#2 again..."); /* Line 17 */ Employee employee2_dummy = (Employee) session.load(Employee.class, new Long(2)); System.out.println(employee2_dummy.toString()); Output Loading employee#1... Hibernate: select employee0_.id as id0_0_, employee0_.name as name0_0_ from Employee employee0_ where employee0_.id=? Employee: Id: 1 Name: Yogesh Loading employee#2... Hibernate: select employee0_.id as id0_0_, employee0_.name as name0_0_ from Employee employee0_ where employee0_.id=? Employee: Id: 2 Name: Aarush All objects removed from session cache using clear()... Loading employee#1 again... Hibernate: select employee0_.id as id0_0_, employee0_.name as name0_0_ from Employee employee0_ where employee0_.id=? Employee: Id: 1 Name: Yogesh Loading employee#2 again... Hibernate: select employee0_.id as id0_0_, employee0_.name as name0_0_ from Employee employee0_ where employee0_.id=? Employee: Id: 2 Name: Aarush Here, Line 2 and 6 show how to load Employee#1 and Employee#2 respectively. Now, we removed all the objects from the session cache using the clear() method. As a result, the loading of both Employee#1 and Employee#2 will again result in a database hit, as shown in Line 13 and 17. One-to-one mapping using a common join table In this method, we will use a third table that contains the relationship between the employee and detail tables. In other words, the third table will hold a primary key value of both tables to represent a relationship between them. Getting ready Use the following script to create the tables and classes. Here, we use Employee and EmployeeDetail to show a one-to-one mapping using a common join table: Creating the tables Use the following script to create the tables if you are not using hbm2dll=create|update: Use the following script to create the detail table: CREATE TABLE `detail` ( `detail_id` bigint(20) NOT NULL AUTO_INCREMENT, `city` varchar(255) DEFAULT NULL, PRIMARY KEY (`detail_id`) ); Use the following script to create the employee table: CREATE TABLE `employee` ( `employee_id` BIGINT(20) NOT NULL AUTO_INCREMENT, `name` VARCHAR(255) DEFAULT NULL, PRIMARY KEY (`employee_id`) ); Use the following script to create the employee_detail table: CREATE TABLE `employee_detail` ( `detail_id` BIGINT(20) DEFAULT NULL, `employee_id` BIGINT(20) NOT NULL, PRIMARY KEY (`employee_id`), KEY `FK_DETAIL_ID` (`detail_id`), KEY `FK_EMPLOYEE_ID` (`employee_id`), CONSTRAINT `FK_EMPLOYEE_ID` FOREIGN KEY (`employee_id`) REFERENCES `employee` (`employee_id`), CONSTRAINT `FK_DETAIL_ID` FOREIGN KEY (`detail_id`) REFERENCES `detail` (`detail_id`) ); Creating the classes Use the following code to create the classes: Source file: Employee.java @Entity @Table(name = "employee") public class Employee { @Id @GeneratedValue @Column(name = "employee_id") private long id; @Column(name = "name") private String name; @OneToOne(cascade = CascadeType.ALL) @JoinTable( name="employee_detail" , joinColumns=@JoinColumn(name="employee_id") , inverseJoinColumns=@JoinColumn(name="detail_id") ) private Detail employeeDetail; public long getId() { return id; } public void setId(long id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } public Detail getEmployeeDetail() { return employeeDetail; } public void setEmployeeDetail(Detail employeeDetail) { this.employeeDetail = employeeDetail; } @Override public String toString() { return "Employee" +"n Id: " + this.id +"n Name: " + this.name +"n Employee Detail " + "nt Id: " + this.employeeDetail.getId() + "nt City: " + this.employeeDetail.getCity(); } } Source file: Detail.java @Entity @Table(name = "detail") public class Detail { @Id @GeneratedValue @Column(name = "detail_id") private long id; @Column(name = "city") private String city; @OneToOne(cascade = CascadeType.ALL) @JoinTable( name="employee_detail" , joinColumns=@JoinColumn(name="detail_id") , inverseJoinColumns=@JoinColumn(name="employee_id") ) private Employee employee; public Employee getEmployee() { return employee; } public void setEmployee(Employee employee) { this.employee = employee; } public String getCity() { return city; } public void setCity(String city) { this.city = city; } public long getId() { return id; } public void setId(long id) { this.id = id; } @Override public String toString() { return "Employee Detail" +"n Id: " + this.id +"n City: " + this.city +"n Employee " + "nt Id: " + this.employee.getId() + "nt Name: " + this.employee.getName(); } } How to do it… In this section, we will take a look at how to insert a record step by step. Inserting a record Using the following code, we will insert an Employee record with a Detail object: Code Detail detail = new Detail(); detail.setCity("AHM"); Employee employee = new Employee(); employee.setName("vishal"); employee.setEmployeeDetail(detail); Transaction transaction = session.getTransaction(); transaction.begin(); session.save(employee); transaction.commit(); Output Hibernate: insert into detail (city) values (?) Hibernate: insert into employee (name) values (?) Hibernate: insert into employee_detail (detail_id, employee_id) values (?,?) Hibernate saves one record in the detail table and one in the employee table and then inserts a record in to the third table, employee_detail, using the primary key column value of the detail and employee tables. How it works… From the output, it's clear how this method works. The code is the same as in the other methods of configuring a one-to-one relationship, but here, hibernate reacts differently. Here, the first two statements of output insert the records in to the detail and employee tables respectively, and the third statement inserts the mapping record in to the third table, employee_detail, using the primary key column value of both the tables. Let's take a look at an option used in the previous code in detail: @JoinTable: This annotation, written on the Employee class, contains the name="employee_detail" attribute and shows that a new intermediate table is created with the name "employee_detail" joinColumns=@JoinColumn(name="employee_id"): This shows that a reference column is created in employee_detail with the name "employee_id", which is the primary key of the employee table inverseJoinColumns=@JoinColumn(name="detail_id"): This shows that a reference column is created in the employee_detail table with the name "detail_id", which is the primary key of the detail table Ultimately, the third table, employee_detail, is created with two columns: one is "employee_id" and the other is "detail_id". Persisting Map Map is used when we want to persist a collection of key/value pairs where the key is always unique. Some common implementations of java.util.Map are java.util.HashMap, java.util.LinkedHashMap, and so on. For this recipe, we will use java.util.HashMap. Getting ready Now, let's assume that we have a scenario where we are going to implement Map<String, String>; here, the String key is the e-mail address label, and the value String is the e-mail address. For example, we will try to construct a data structure similar to <"Personal e-mail", "emailaddress2@provider2.com">, <"Business e-mail", "emailaddress1@provider1.com">. This means that we will create an alias of the actual e-mail address so that we can easily get the e-mail address using the alias and can document it in a more readable form. This type of implementation depends on the custom requirement; here, we can easily get a business e-mail using the Business email key. Use the following code to create the required tables and classes. Creating tables Use the following script to create the tables if you are not using hbm2dll=create|update. This script is for the tables that are generated by hibernate: Use the following code to create the email table: CREATE TABLE `email` ( `Employee_id` BIGINT(20) NOT NULL, `emails` VARCHAR(255) DEFAULT NULL, `emails_KEY` VARCHAR(255) NOT NULL DEFAULT '', PRIMARY KEY (`Employee_id`,`emails_KEY`), KEY `FK5C24B9C38F47B40` (`Employee_id`), CONSTRAINT `FK5C24B9C38F47B40` FOREIGN KEY (`Employee_id`) REFERENCES `employee` (`id`) ); Use the following code to create the employee table: CREATE TABLE `employee` ( `id` BIGINT(20) NOT NULL AUTO_INCREMENT, `name` VARCHAR(255) DEFAULT NULL, PRIMARY KEY (`id`) ); Creating a class Source file: Employee.java @Entity @Table(name = "employee") public class Employee { @Id @GeneratedValue @Column(name = "id") private long id; @Column(name = "name") private String name; @ElementCollection @CollectionTable(name = "email") private Map<String, String> emails; public long getId() { return id; } public void setId(long id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } public Map<String, String> getEmails() { return emails; } public void setEmails(Map<String, String> emails) { this.emails = emails; } @Override public String toString() { return "Employee" + "ntId: " + this.id + "ntName: " + this.name + "ntEmails: " + this.emails; } } How to do it… Here, we will consider how to work with Map and its manipulation operations, such as inserting, retrieving, deleting, and updating. Inserting a record Here, we will create one employee record with two e-mail addresses: Code Employee employee = new Employee(); employee.setName("yogesh"); Map<String, String> emails = new HashMap<String, String>(); emails.put("Business email", "emailaddress1@provider1.com"); emails.put("Personal email", "emailaddress2@provider2.com"); employee.setEmails(emails); session.getTransaction().begin(); session.save(employee); session.getTransaction().commit(); Output Hibernate: insert into employee (name) values (?) Hibernate: insert into email (Employee_id, emails_KEY, emails) values (?,?,?) Hibernate: insert into email (Employee_id, emails_KEY, emails) values (?,?,?) When the code is executed, it inserts one record into the employee table and two records into the email table and also sets a primary key value for the employee record in each record of the email table as a reference. Retrieving a record Here, we know that our record is inserted with id 1. So, we will try to get only that record and understand how Map works in our case. Code Employee employee = (Employee) session.get(Employee.class, 1l); System.out.println(employee.toString()); System.out.println("Business email: " + employee.getEmails().get("Business email")); Output Hibernate: select employee0_.id as id0_0_, employee0_.name as name0_0_ from employee employee0_ where employee0_.id=? Hibernate: select emails0_.Employee_id as Employee1_0_0_, emails0_.emails as emails0_, emails0_.emails_KEY as emails3_0_ from email emails0_ where emails0_.Employee_id=? Employee Id: 1 Name: yogesh Emails: {Personal email=emailaddress2@provider2.com, Business email=emailaddress1@provider1.com} Business email: emailaddress1@provider1.com Here, we can easily get a business e-mail address using the Business email key from the map of e-mail addresses. This is just a simple scenario created to demonstrate how to persist Map in hibernate. Updating a record Here, we will try to add one more e-mail address to Employee#1: Code Employee employee = (Employee) session.get(Employee.class, 1l); Map<String, String> emails = employee.getEmails(); emails.put("Personal email 1", "emailaddress3@provider3.com"); session.getTransaction().begin(); session.saveOrUpdate(employee); session.getTransaction().commit(); System.out.println(employee.toString()); Output Hibernate: select employee0_.id as id0_0_, employee0_.name as name0_0_ from employee employee0_ where employee0_.id=? Hibernate: select emails0_.Employee_id as Employee1_0_0_, emails0_.emails as emails0_, emails0_.emails_KEY as emails3_0_ from email emails0_ where emails0_.Employee_id=? Hibernate: insert into email (Employee_id, emails_KEY, emails) values (?, ?, ?) Employee Id: 2 Name: yogesh Emails: {Personal email 1= emailaddress3@provider3.com, Personal email=emailaddress2@provider2.com, Business email=emailaddress1@provider1.com} Here, we added a new e-mail address with the Personal email 1 key and the value is emailaddress3@provider3.com. Deleting a record Here again, we will try to delete the records of Employee#1 using the following code: Code Employee employee = new Employee(); employee.setId(1); session.getTransaction().begin(); session.delete(employee); session.getTransaction().commit(); Output Hibernate: delete from email where Employee_id=? Hibernate: delete from employee where id=? While deleting the object, hibernate will delete the child records (here, e-mail addresses) as well. How it works… Here again, we need to understand the table structures created by hibernate: Hibernate creates a composite primary key in the email table using two fields: employee_id and emails_KEY. Summary In this article you familiarized yourself with recipes such as working with a first-level cache, one-to-one mapping using a common join table, and persisting map. Resources for Article: Further resources on this subject: PostgreSQL in Action[article] OpenShift for Java Developers[article] Oracle 12c SQL and PL/SQL New Features [article]
Read more
  • 0
  • 0
  • 4829

article-image-implementing-stacks-using-javascript
Packt
22 Oct 2014
10 min read
Save for later

Implementing Stacks using JavaScript

Packt
22 Oct 2014
10 min read
 In this article by Loiane Groner, author of the book Learning JavaScript Data Structures and Algorithms, we will discuss the stacks. (For more resources related to this topic, see here.) A stack is an ordered collection of items that follows the LIFO (short for Last In First Out) principle. The addition of new items or the removal of existing items takes place at the same end. The end of the stack is known as the top and the opposite is known as the base. The newest elements are near the top, and the oldest elements are near the base. We have several examples of stacks in real life, for example, a pile of books, as we can see in the following image, or a stack of trays from a cafeteria or food court: A stack is also used by compilers in programming languages and by computer memory to store variables and method calls. Creating a stack We are going to create our own class to represent a stack. Let's start from the basics and declare our class: function Stack() {   //properties and methods go here} First, we need a data structure that will store the elements of the stack. We can use an array to do this: Var items = []; Next, we need to declare the methods available for our stack: push(element(s)): This adds a new item (or several items) to the top of the stack. pop(): This removes the top item from the stack. It also returns the removed element. peek(): This returns the top element from the stack. The stack is not modified (it does not remove the element; it only returns the element for information purposes). isEmpty(): This returns true if the stack does not contain any elements and false if the size of the stack is bigger than 0. clear(): This removes all the elements of the stack. size(): This returns how many elements the stack contains. It is similar to the length property of an array. The first method we will implement is the push method. This method will be responsible for adding new elements to the stack with one very important detail: we can only add new items to the top of the stack, meaning at the end of the stack. The push method is represented as follows: this.push = function(element){   items.push(element);}; As we are using an array to store the elements of the stack, we can use the push method from the JavaScript array class. Next, we are going to implement the pop method. This method will be responsible for removing the items from the stack. As the stack uses the LIFO principle, the last item that we added is the one that is removed. For this reason, we can use the pop method from the JavaScript array class. The pop method is represented as follows: this.pop = function(){   return items.pop();}; With the push and pop methods being the only methods available for adding and removing items from the stack, the LIFO principle will apply to our own Stack class. Now, let's implement some additional helper methods for our class. If we would like to know what the last item added to our stack was, we can use the peek method. This method will return the item from the top of the stack: this.peek = function(){   return items[items.length-1];}; As we are using an array to store the items internally, we can obtain the last item from an array using length - 1 as follows: For example, in the previous diagram, we have a stack with three items; therefore, the length of the internal array is 3. The last position used in the internal array is 2. As a result, the length - 1 (3 - 1) is 2! The next method is the isEmpty method, which returns true if the stack is empty (no item has been added) and false otherwise: this.isEmpty = function(){   return items.length == 0;}; Using the isEmpty method, we can simply verify whether the length of the internal array is 0. Similar to the length property from the array class, we can also implement length for our Stack class. For collections, we usually use the term "size" instead of "length". And again, as we are using an array to store the items internally, we can simply return its length: this.size = function(){   return items.length;}; Finally, we are going to implement the clear method. The clear method simply empties the stack, removing all its elements. The simplest way of implementing this method is as follows: this.clear = function(){   items = [];}; An alternative implementation would be calling the pop method until the stack is empty. And we are done! Our Stack class is implemented. Just to make our lives easier during the examples, to help us inspect the contents of our stack, let's implement a helper method called print that is going to output the content of the stack on the console: this.print = function(){   console.log(items.toString());}; And now we are really done! The complete Stack class Let's take a look at how our Stack class looks after its full implementation: function Stack() {    var items = [];    this.push = function(element){       items.push(element);   };    this.pop = function(){       return items.pop();   };    this.peek = function(){       return items[items.length-1];   };    this.isEmpty = function(){       return items.length == 0;   };    this.size = function(){       return items.length;   };    this.clear = function(){       items = [];   };    this.print = function(){       console.log(items.toString());   };} Using the Stack class Before we dive into some examples, we need to learn how to use the Stack class. The first thing we need to do is instantiate the Stack class we just created. Next, we can verify whether it is empty (the output is true because we have not added any elements to our stack yet): var stack = new Stack();console.log(stack.isEmpty()); //outputs true Next, let's add some elements to it (let's push the numbers 5 and 8; you can add any element type to the stack): stack.push(5);stack.push(8); If we call the peek method, the output will be the number 8 because it was the last element that was added to the stack: console.log(stack.peek()); // outputs 8 Let's also add another element: stack.push(11);console.log(stack.size()); // outputs 3console.log(stack.isEmpty()); //outputs false We added the element 11. If we call the size method, it will give the output as 3, because we have three elements in our stack (5, 8, and 11). Also, if we call the isEmpty method, the output will be false (we have three elements in our stack). Finally, let's add another element: stack.push(15); The following diagram shows all the push operations we have executed so far and the current status of our stack: Next, let's remove two elements from the stack by calling the pop method twice: stack.pop();stack.pop();console.log(stack.size()); // outputs 2stack.print(); // outputs [5, 8] Before we called the pop method twice, our stack had four elements in it. After the execution of the pop method two times, the stack now has only two elements: 5 and 8. The following diagram exemplifies the execution of the pop method: Decimal to binary Now that we know how to use the Stack class, let's use it to solve some Computer Science problems. You are probably already aware of the decimal base. However, binary representation is very important in Computer Science as everything in a computer is represented by binary digits (0 and 1). Without the ability to convert back and forth between decimal and binary numbers, it would be a little bit difficult to communicate with a computer. To convert a decimal number to a binary representation, we can divide the number by 2 (binary is base 2 number system) until the division result is 0. As an example, we will convert the number 10 into binary digits: This conversion is one of the first things you learn in college (Computer Science classes). The following is our algorithm: function divideBy2(decNumber){    var remStack = new Stack(),       rem,       binaryString = '';    while (decNumber > 0){ //{1}       rem = Math.floor(decNumber % 2); //{2}       remStack.push(rem); //{3}       decNumber = Math.floor(decNumber / 2); //{4} }    while (!remStack.isEmpty()){ //{5}       binaryString += remStack.pop().toString();   }    return binaryString;} In this code, while the division result is not zero (line {1}), we get the remainder of the division (mod) and push it to the stack (lines {2} and {3}), and finally, we update the number that will be divided by 2 (line {4}). An important observation: JavaScript has a numeric data type, but it does not distinguish integers from floating points. For this reason, we need to use the Math.floor function to obtain only the integer value from the division operations. And finally, we pop the elements from the stack until it is empty, concatenating the elements that were removed from the stack into a string (line {5}). We can try the previous algorithm and output its result on the console using the following code: console.log(divideBy2(233));console.log(divideBy2(10));console.log(divideBy2(1000)); We can easily modify the previous algorithm to make it work as a converter from decimal to any base. Instead of dividing the decimal number by 2, we can pass the desired base as an argument to the method and use it in the divisions, as shown in the following algorithm: function baseConverter(decNumber, base){    var remStack = new Stack(),        rem,       baseString = '',       digits = '0123456789ABCDEF'; //{6}    while (decNumber > 0){       rem = Math.floor(decNumber % base);       remStack.push(rem);       decNumber = Math.floor(decNumber / base);   }    while (!remStack.isEmpty()){       baseString += digits[remStack.pop()]; //{7}   }    return baseString;} There is one more thing we need to change. In the conversion from decimal to binary, the remainders will be 0 or 1; in the conversion from decimal to octagonal, the remainders will be from 0 to 8; but in the conversion from decimal to hexadecimal, the remainders can be 0 to 8 plus the letters A to F (values 10 to 15). For this reason, we need to convert these values as well (lines {6} and {7}). We can use the previous algorithm and output its result on the console as follows: console.log(baseConverter(100345, 2));console.log(baseConverter(100345, 8));console.log(baseConverter(100345, 16)); Summary In this article, we learned about the stack data structure. We implemented our own algorithm that represents a stack and we learned how to add and remove elements from it using the push and pop methods. We also covered a very famous example of how to use a stack. Resources for Article: Further resources on this subject: Organizing Backbone Applications - Structure, Optimize, and Deploy [article] Introduction to Modern OpenGL [article] Customizing the Backend Editing in TYPO3 Templates [article]
Read more
  • 0
  • 0
  • 4827

article-image-bpms-components
Packt
19 Aug 2014
8 min read
Save for later

BPMS Components

Packt
19 Aug 2014
8 min read
In this article by Mariano Nicolas De Maio, the author of jBPM6 Developer Guide, we will look into the various components of a Business Process Management (BPM) system. (For more resources related to this topic, see here.) BPM systems are pieces of software created with the sole purpose of guiding your processes through the BPM cycle. They were originally monolithic systems in charge of every aspect of a process, where they had to be heavily migrated from visual representations to executable definitions. They've come a long way from there, but we usually relate them to the same old picture in our heads when a system that runs all your business processes is mentioned. Nowadays, nothing is further from the truth. Modern BPM Systems are not monolithic environments; they're coordination agents. If a task is finished, they will know what to do next. If a decision needs to be made regarding the next step, they manage it. If a group of tasks can be concurrent, they turn them into parallel tasks. If a process's execution is efficient, they will perform the processing 0.1 percent of the time in the process engine and 99.9 percent of the time on tasks in external systems. This is because they will have no heavy executions within, only derivations to other systems. Also, they will be able to do this from nothing but a specific diagram for each process and specific connectors to external components. In order to empower us to do so, they need to provide us with a structure and a set of tools that we'll start defining to understand how BPM systems' internal mechanisms work, and specifically, how jBPM6 implements these tools. Components of a BPMS All big systems become manageable when we divide their complexities into smaller pieces, which makes them easier to understand and implement. BPM systems apply this by dividing each function in a different module and interconnecting them within a special structure that (in the case of jBPM6) looks something like the following figure: BPMS' internal structure Each component in the preceding figure resolves one particular function inside the BPMS architecture, and we'll see a detailed explanation on each one of them. The execution node The execution node, as seen from a black box perspective, is the component that receives the process definitions (a description of each step that must be followed; from here on, we'll just refer to them as processes). Then, it executes all the necessary steps in the established way, keeping track of each step, variable, and decision that has to be taken in each process's execution (we'll start calling these process instances). The execution node along with its modules are shown in the following figure: The execution node is composed of a set of low-level modules: the semantic module and the process engine. The semantic module The semantic module is in charge of defining each of the specific language semantics, that is, what each word means and how it will be translated to the internal structures that the process engine can execute. It consists of a series of parsers to understand different languages. It is flexible enough to allow you to extend and support multiple languages; it also allows the user to change the way already defined languages are to be interpreted for special use cases. It is a common component of most of the BPMSes out there, and in jBPM6, it allows you to add the extensions of the process interpretations to the module. This is so that you can add your own language parsers, and define your very own text-based process definition language or extend existing ones. The process engine The process engine is the module that is in charge of the actual execution of our business processes. It creates new process instances and keeps track of their state and their internal steps. Its job is to expose methods to inject process definitions and to create, start, and continue our process instances. Understanding how the process engine works internally is a very important task for the people involved in BPM's stage 4, that is, runtime. This is where different configurations can be used to improve performance, integrate with other systems, provide fault tolerance, clustering, and many other functionalities. Process Engine structure In the case of jBPM6, process definitions and process instances have similar structures but completely different objectives. Process definitions only show the steps it should follow and the internal structures of the process, keeping track of all the parameters it should have. Process instances, on the other hand, should carry all of the information of each process's execution, and have a strategy for handling each step of the process and keep track of all its actual internal values. Process definition structures These structures are static representations of our business processes. However, from the process engine's internal perspective, these representations are far from the actual process structure that the engine is prepared to handle. In order for the engine to get those structures generated, it requires the previously described semantic module to transform those representations into the required object structure. The following figure shows how this parsing process happens as well as the resultant structure: Using a process modeler, business analysts can draw business processes by dragging-and-dropping different activities from the modeler palette. For jBPM6, there is a web-based modeler designed to draw Scalable Vector Graphics (SVG) files; this is a type of image file that has the particularity of storing the image information using XML text, which is later transformed into valid BPMN2 files. Note that both BPMN2 and jBPM6 are not tied up together. On one hand, the BPMN2 standard can be used by other process engine provides such as Activiti or Oracle BPM Suite. Also, because of the semantic module, jBPM6 could easily work with other parsers to virtually translate any form of textual representation of a process to its internal structures. In the internal structures, we have a root component (called Process in our case, which is finally implemented in a class called RuleFlowProcess) that will contain all the steps that are represented inside the process definition. From the jBPM6 perspective, you can manually create these structures using nothing but the objects provided by the engine. Inside the jBPM6-Quickstart project, you will find a code snippet doing exactly this in the createProcessDefinition() method of the ProgrammedProcessExecutionTest class: //Process Definition RuleFlowProcess process = new RuleFlowProcess(); process.setId("myProgramaticProcess"); //Start Task StartNode startTask = new StartNode(); startTask.setId(1); //Script Task ActionNode scriptTask = new ActionNode(); scriptTask.setId(2); DroolsAction action = new DroolsAction(); action.setMetaData("Action", new Action() { @Override public void execute(ProcessContext context) throws Exception { System.out.println("Executing the Action!!"); } }); scriptTask.setAction(action); //End Task EndNode endTask = new EndNode(); endTask.setId(3); //Adding the connections to the nodes and the nodes to the processes new ConnectionImpl(startTask, "DROOLS_DEFAULT", scriptTask, "DROOLS_DEFAULT"); new ConnectionImpl(scriptTask, "DROOLS_DEFAULT", endTask, "DROOLS_DEFAULT"); process.addNode(startTask); process.addNode(scriptTask); process.addNode(endTask); Using this code, we can manually create the object structures to represent the process shown in the following figure: This process contains three components: a start node, a script node, and an end node. In this case, this simple process is in charge of executing a simple action. The start and end tasks simply specify a sequence. Even if this is a correct way to create a process definition, it is not the recommended one (unless you're making a low-level functionality test). Real-world, complex processes are better off being designed in a process modeler, with visual tools, and exported to standard representations such as BPMN 2.0. The output of both the cases is the same; a process object that will be understandable by the jBPM6 runtime. While we analyze how the process instance structures are created and how they are executed, this will do. Process instance structures Process instances represent the running processes and all the information being handled by them. Every time you want to start a process execution, the engine will create a process instance. Each particular instance will keep track of all the activities that are being created by its execution. In jBPM6, the structure is very similar to that of the process definitions, with one root structure (the ProcessInstance object) in charge of keeping all the information and NodeInstance objects to keep track of live nodes. The following code shows a simplification of the methods of the ProcessInstance implementation: public class RuleFlowProcessInstance implements ProcessInstance { public RuleFlowProcess getRuleFlowProcess() { ... } public long getId() { ... } public void start() { ... } public int getState() { ... } public void setVariable(String name, Object value) { ... } public Collection<NodeInstance> getNodeInstances() { ... } public Object getVariable(String name) { ... } } After its creation, the engine calls the start() method of ProcessInstance. This method seeks StartNode of the process and triggers it. Depending on the execution of the path and how different nodes connect between each other, other nodes will get triggered until they reach a safe state where the execution of the process is completed or awaiting external data. You can access the internal parameters that the process instance has through the getVariable and setVariable methods. They provide local information from the particular process instance scope. Summary In this article, we saw what are the basic components required to set up a BPM system. With these components in place, we are ready to explore, in more detail, the structure and working of a BPM system. Resources for Article: Further resources on this subject: jBPM for Developers: Part 1 [Article] Configuring JBoss Application Server 5 [Article] Boss jBPM Concepts and jBPM Process Definition Language (jPDL) [Article]
Read more
  • 0
  • 0
  • 4825

article-image-ai-distilled-16-baidus-ernie-chatbot-openais-chatgpt-in-education-metas-facet-dataset-fmops-or-llmops-qualcomms-ai-focus-interecagent-liquid-neural-networks
Merlyn Shelley
08 Sep 2023
11 min read
Save for later

AI_Distilled #16: Baidu's Ernie Chatbot, OpenAI's ChatGPT in Education, Meta's FACET Dataset, FMOps or LLMOps, Qualcomm's AI Focus, InteRecAgent, Liquid Neural Networks

Merlyn Shelley
08 Sep 2023
11 min read
👋 Hello ,“Artificial intelligence is one of the most profound things we're working on as humanity. It is more profound than fire or electricity.” -Sundar Pichai, Google CEO  Pichai's AI-fire analogy signifies a transformative era; AI and ML will revolutionize education, medicine, and more, reshaping human progress. OpenAI has begun promoting the use of ChatGPT in education, which shouldn’t really come as a surprise as students the world over have been experimenting with the technology. Get ready to dive into the latest AI developments in this edition, AI_Distilled #16, including Baidu launching Ernie chatbot following Chinese government approval, X's Privacy Policy Reveals Plan to Use Public Data for AI Training, Meta releasing FACET Dataset to evaluate AI model fairness, Google’s new Multislice for scalable AI training on cloud TPUs, and Qualcomm's focus on AI and auto amidst NVIDIA's chip dominance. Watch out also for our handpicked collection of fresh AI, GPT, and LLM-focused secret knowledge and tutorials from around the web covering Liquid Neural Networks, Serverless Machine Learning with Amazon Redshift ML, implementing effective guardrails for LLMs, Navigating Generative AI with FMOps and LLMOps, and using Microsoft’s new AI compiler quartet. What do you think of this issue and our newsletter? Please consider taking the short survey below to share your thoughts and you will get a free PDF of the “The Applied Artificial Intelligence Workshop” eBook upon completion. Complete the Survey. Get a Packt eBook for Free!Writer’s Credit: Special shout-out to Vidhu Jain for their valuable contribution to this week’s newsletter content!  Cheers,  Merlyn Shelley  Editor-in-Chief, Packt  ⚡ TechWave: AI/GPT News & AnalysisMeta Releases FACET Dataset to Evaluate AI Model Fairness: Meta has launched FACET (FAirness in Computer Vision EvaluaTion), a dataset designed to assess the fairness of AI models used for image and video classification, including identifying people. Comprising 32,000 images with 50,000 labeled individuals, FACET includes demographic and physical attributes, allowing for deep evaluations of biases against various classes. Despite previous concerns about Meta's responsible AI practices, the company claims FACET is more comprehensive than previous bias benchmarks. However, concerns have been raised about the dataset's origins and the compensation of annotators. Meta has also released a web-based dataset explorer tool for FACET. You can read the full paper here. Baidu Launches Ernie Chatbot Following Chinese Government Approval: Chinese tech giant Baidu has unveiled its chatbot, Ernie Bot, after receiving government clearance, along with other AI firms. Ernie Bot is now accessible for download via app stores or Baidu's website. Similar to its rival, ChatGPT, users can engage Ernie Bot for queries, market analysis assistance, marketing slogan ideas, and document summaries. While it's accessible globally, registration requires a Chinese number, and the app is only in Chinese on US Android and iOS stores. Baidu has also introduced a plug-in market for Ernie Bot, which quickly garnered over 1 million users within 19 hours of launch. CEO Robin Li expressed plans for further AI-native apps aimed at exploring generative AI's core abilities. Google Introduces Multislice for Scalable AI Training on Cloud TPUs: Google has unveiled Multislice, a comprehensive large-scale training technology that facilitates straightforward, cost-effective, and nearly linear scaling to tens of thousands of Cloud Tensor Processing Units (TPUs) chips. Traditionally, a training run was restricted to a single slice, which meant a maximum of 3072 TPU v4 chips could be used. With Multislice, training can span multiple slices across pods through data center networking, eliminating these limitations. This innovation offers benefits such as efficient scaling for massive models, enhanced developer productivity, automatic compiler optimizations, and cost-efficiency. It promises to revolutionize AI infrastructure by enabling near-linear scaling for AI supercomputing. OpenAI Promotes Use of ChatGPT in Education: OpenAI is encouraging educators to utilize ChatGPT in classrooms. The company showcased six educators, primarily at the university level, using ChatGPT for various purposes, such as role-playing in debates, aiding translation for English-as-a-second-language students, and fact-checking. Despite some schools banning ChatGPT due to concerns about academic integrity, OpenAI believes it can be a valuable tool in education. However, it emphasizes the importance of maintaining human oversight in the assessment process. X's Privacy Policy Reveals Plan to Use Public Data for AI Training: In an update to its privacy policy, X (formerly Twitter) has informed users that it will now collect biometric data, job histories, and education backgrounds. However, another section of the policy reveals a broader plan: X intends to utilize the data it gathers, along with publicly available information, to train its machine learning and AI models. This revelation has attracted attention, particularly due to the connection with X owner Elon Musk's ambitions in the AI market through his company xAI. Musk confirmed the privacy policy change, emphasizing that only public data, not private messages, would be used for AI training.   Qualcomm's Focus on AI and Auto Amidst NVIDIA’s Chip Dominance: As NVIDIA takes the lead as the world's largest fabless chip company, Qualcomm is strategically positioning itself in the AI realm. The company has unveiled in-vehicle generative AI capabilities, expanded into two-wheelers, and forged a partnership with Amazon Web Services. Qualcomm's CEO, Cristiano Amon, believes that generative AI, currently reliant on cloud resources, will transition to local execution, enhancing performance and cost-efficiency. Diversification is also a priority, with Qualcomm's chips powering various smart devices, especially in the automotive sector. Amid uncertainty about its future relationship with Apple, Qualcomm aims to maintain its dominance through innovations in AI and auto tech. InteRecAgent, A Fusion of Language Models and Recommender Systems Introduced: Researchers from the University of Science and Technology of China, in collaboration with Microsoft Research Asia, have introduced InteRecAgent, a cutting-edge framework. This innovation seeks to combine the interactive capabilities of LLMs with the domain-specific precision of traditional recommender systems. Recommender systems play a vital role in various digital domains, but they often struggle with versatile interactions. On the other hand, LLMs excel in conversations but lack domain-specific knowledge. InteRecAgent introduces the "Candidate Memory Bus" to streamline recommendations for LLMs and a "Plan-first Execution with Dynamic Demonstrations" strategy for effective tool interaction. adidas Utilizes AI and NVIDIA RTX for Photorealistic 3D Content: Sportswear giant adidas is partnering with Covision Media, an Italian startup, to revolutionize their online shopping experience. Covision employs AI and NVIDIA RTX technology to develop 3D scanners that allow businesses to create digital twins of their products with stunning realism. This technology can quickly generate 3D scans, capturing textures, colors, and geometry, resulting in lifelike images. adidas is among the first to adopt this technology for automating and scaling e-commerce content production, enhancing their Virtual Try-On feature and replacing traditional product photography with computer-generated content.  🔮 Expert Insights from Packt CommunityServerless Machine Learning with Amazon Redshift ML - By Debu Panda, Phil Bates, Bhanu Pittampally, Sumeet JoshiData analysts and developers use Redshift data with machine learning (ML) models for tasks such as predicting customer behavior. Amazon Redshift ML streamlines this process using familiar SQL commands. A conundrum arises when attempting to decipher these data silos – a formidable challenge that hampers the derivation of meaningful insights essential for organizational clarity. Adding to this complexity, security and performance considerations typically prevent business analysts from accessing data within OLTP systems. The hiccup is that intricate analytical queries weigh down OLTP databases, casting a shadow over their core operations. Here, the solution is the data warehouse, which is a central hub of curated data, used by business analysts and data scientists to make informed decisions by employing the business intelligence and machine learning tools at their disposal. These users make use of Structured Query Language (SQL) to derive insights from this data trove. Here’s where Amazon Redshift Serverless comes in. It’s a key option within Amazon Redshift, a well-managed cloud data warehouse offered by Amazon Web Services (AWS). With cloud-based ease, Amazon Redshift Serverless lets you set up your data storage without infrastructure hassles or cost worries. You pay based on what you use for compute and storage. Amazon Redshift Serverless goes beyond convenience, propelling modern data applications that seamlessly connect to the data lake. The above content is extracted from the book Serverless Machine Learning with Amazon Redshift ML written by Debu Panda, Phil Bates, Bhanu Pittampally, Sumeet Joshi and published in Aug 2023. To get a glimpse of the book's contents, make sure to read the free chapter provided here, or if you want to unlock the full Packt digital library free for 7 days, try signing up now! To learn more, click on the button below. Keep Calm, Start Reading! 🌟 Secret Knowledge: AI/LLM ResourcesUnderstanding Liquid Neural Networks: A Primer on AI Advancements: In this post, you'll learn how liquid neural networks are transforming the AI landscape. These networks, inspired by the human brain, offer a unique and creative approach to problem-solving. They excel in complex tasks such as weather prediction, stock market analysis, and speech recognition. Unlike traditional neural networks, liquid neural networks require significantly fewer neurons, making them ideal for resource-constrained environments like autonomous vehicles. These networks excel in handling continuous data streams but may not be suitable for static data. They also provide better causality handling and interpretability. Navigating Generative AI with FMOps and LLMOps: A Practical Guide: In this informative post, you'll gain valuable insights into the world of generative AI and its operationalization using FMOps and LLMOps principles. The authors delve into the challenges businesses face when integrating generative AI into their operations. You'll explore the fundamental differences between traditional MLOps and these emerging concepts. The post outlines the roles various teams play in this process, from data engineers to data scientists, ML engineers, and product owners. The guide provides a roadmap for businesses looking to embrace generative AI. AI Compiler Quartet: A Breakdown of Cutting-Edge Technologies: Explore Microsoft’s groundbreaking "heavy-metal quartet" of AI compilers: Rammer, Roller, Welder, and Grinder. These compilers address the evolving challenges posed by AI models and hardware. Rammer focuses on optimizing deep neural network (DNN) computations, improving hardware parallel utilization. Roller tackles the challenge of memory partitioning and optimization, enabling faster compilation with good computation efficiency. Welder optimizes memory access, particularly vital as AI models become more memory-intensive. Grinder addresses complex control flow execution in AI computation. These AI compilers collectively offer innovative solutions for parallelism, compilation efficiency, memory, and control flow, shaping the future of AI model optimization and compilation.  💡 MasterClass: AI/LLM Tutorials Exploring IoT Data Simulation with ChatGPT and MQTTX: In this comprehensive guide, you'll learn how to harness the power of AI, specifically ChatGPT, and the MQTT client tool, MQTTX, to simulate and generate authentic IoT data streams. Discover why simulating IoT data is crucial for system verification, customer experience enhancement, performance assessment, and rapid prototype design. The article dives into the integration of ChatGPT and MQTTX, introducing the "Candidate Memory Bus" to streamline data testing. Follow the step-by-step guide to create simulation scripts with ChatGPT and efficiently simulate data transmission with MQTTX.  Revolutionizing Real-time Inference: SageMaker Unveils Streaming Support for Generative AI: Amazon SageMaker now offers real-time response streaming, transforming generative AI applications. This new feature enables continuous response streaming to clients, reducing time-to-first-byte and enhancing interactive experiences for chatbots, virtual assistants, and music generators. The post guides you through building a streaming web application using SageMaker real-time endpoints for interactive chat use cases. It showcases deployment options with AWS Large Model Inference (LMI) and Hugging Face Text Generation Inference (TGI) containers, providing a seamless, engaging conversation experience for users. Implementing Effective Guardrails for Large Language Models: Guardrails are crucial for maintaining trust in LLM applications as they ensure compliance with defined principles. This guide presents two open-source tools for implementing LLM guardrails: Guardrails AI and NVIDIA NeMo-Guardrails. Guardrails AI offers Python-based validation of LLM responses, using the RAIL specification. It enables developers to define output criteria and corrective actions, with step-by-step instructions for implementation. NVIDIA NeMo-Guardrails introduces Colang, a modeling language for flexible conversational workflows. The guide explains its syntax elements and event-driven design. Comparing the two, Guardrails AI suits simple tasks, while NeMo-Guardrails excels in defining advanced conversational guidelines. 🚀 HackHub: Trending AI Toolscabralpinto/modular-diffusion: Python library for crafting and training personalized Diffusion Models with PyTorch.  cofactoryai/textbase: Simplified Python chatbot development using NLP and ML with Textbase's on_message function in main.py. microsoft/BatteryML: Open-source ML tool for battery analysis, aiding researchers in understanding electrochemical processes and predicting battery degradation. facebookresearch/co-tracker: Swift transformer-based video tracker with Optical Flow, pixel-level tracking, grid sampling, and manual point selection. explodinggradients/ragas: Framework evaluates Retrieval Augmented Generation pipelines, enhancing LLM context with external data using research-based tools. 
Read more
  • 0
  • 0
  • 4824
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-article-odata-on-mobile-devices
Packt
02 Aug 2012
8 min read
Save for later

Odata on Mobile Devices

Packt
02 Aug 2012
8 min read
With the continuous evolution of mobile operating systems, smart mobile devices (such as smartphones or tablets) play increasingly important roles in everyone's daily work and life. The iOS (from Apple Inc., for iPhone, iPad, and iPod Touch devices), Android (from Google) and Windows Phone 7 (from Microsoft) operating systems have shown us the great power and potential of modern mobile systems. In the early days of the Internet, web access was mostly limited to fixed-line devices. However, with the rapid development of wireless network technology (such as 3G), Internet access has become a common feature for mobile or portable devices. Modern mobile OSes, such as iOS, Android, and Windows Phone have all provided rich APIs for network access (especially Internet-based web access). For example, it is quite convenient for mobile developers to create a native iPhone program that uses a network API to access remote RSS feeds from the Internet and present the retrieved data items on the phone screen. And to make Internet-based data access and communication more convenient and standardized, we often leverage some existing protocols, such as XML or JSON, to help us. Thus, it is also a good idea if we can incorporate OData services in mobile application development so as to concentrate our effort on the main application logic instead of the details about underlying data exchange and manipulation. In this article, we will discuss several cases of building OData client applications for various kinds of mobile device platforms. The first four recipes will focus on how to deal with OData in applications running on Microsoft Windows Phone 7. And they will be followed by two recipes that discuss consuming an OData service in mobile applications running on the iOS and Android platforms. Although this book is .NET developer-oriented, since iOS and Android are the most popular and dominating mobile OSes in the market, I think the last two recipes here would still be helpful (especially when the OData service is built upon WCF Data Service on the server side). Accessing OData service with OData WP7 client library What is the best way to consume an OData service in a Windows Phone 7 application? The answer is, by using the OData client library for Windows Phone 7 (OData WP7 client library). Just like the WCF Data Service client library for standard .NET Framework based applications, the OData WP7 client library allows developers to communicate with OData services via strong-typed proxy and entity classes in Windows Phone 7 applications. Also, the latest Windows Phone SDK 7.1 has included the OData WP7 client library and the associated developer tools in it. In this recipe, we will demonstrate how to use the OData WP7 client library in a standard Windows Phone 7 application. Getting ready The sample WP7 application we will build here provides a simple UI for users to view and edit the Categories data by using the Northwind OData service. The application consists of two phone screens, shown in the following screenshot: Make sure you have installed Windows Phone SDK 7.1 (which contains the OData WP7 client library and tools) on the development machine. You can get the SDK from the following website: http://create.msdn.com/en-us/home/getting_started The source code for this recipe can be found in the ch05ODataWP7ClientLibrarySln directory. How to do it... Create a new ASP.NET web application that contains the Northwind OData service. Add a new Windows Phone Application project in the same solution (see the following screenshot). Select Windows Phone OS 7.1 as the Target Windows Phone OS Version in the New Windows Phone Application dialog box (see the following screenshot). Click on the OK button, to finish the WP7 project creation. The following screenshot shows the default WP7 project structure created by Visual Studio: Create a new Windows Phone Portrait Page (see the following screenshot) and name it EditCategory.xaml. Create the OData client proxy (against the Northwind OData service) by using the Visual Studio Add Service Reference wizard. Add the XAML content for the MainPage.xaml page (see the following XAML fragment). <Grid x_Name="ContentPanel" Grid.Row="1" Margin="12,0,12,0"> <ListBox x_Name="lstCategories" ItemsSource="{Binding}"> <ListBox.ItemTemplate>> <DataTemplate> <Grid> <Grid.ColumnDefinitions> <ColumnDefinition Width="60" /> <ColumnDefinition Width="260" /> <ColumnDefinition Width="140" /> </Grid.ColumnDefinitions> <TextBlock Grid.Column="0" Text="{Binding Path=CategoryID}" FontSize="36" Margin="5"/> <TextBlock Grid.Column="1" Text="{Binding Path=CategoryName}" FontSize="36" Margin="5" TextWrapping="Wrap"/> <HyperlinkButton Grid.Column="2" Content="Edit" HorizontalAlignment="Right" NavigateUri="{Binding Path=CategoryID, StringFormat='/EditCategory.xaml? ID={0}'}" FontSize="36" Margin="5"/> <Grid> <DataTemplate> <ListBox.ItemTemplate> <ListBox> <Grid> Add the code for loading the Category list in the code-behind file of the MainPage. xaml page (see the following code snippet). public partial class MainPage : PhoneApplicationPage { ODataSvc.NorthwindEntities _ctx = null; DataServiceCollection _categories = null; ...... private void PhoneApplicationPage_Loaded(object sender, RoutedEventArgs e) { Uri svcUri = new Uri("http://localhost:9188/NorthwindOData.svc"); _ctx = new ODataSvc.NorthwindEntities(svcUri); _categories = new DataServiceCollection(_ctx); _categories.LoadCompleted += (o, args) => { if (_categories.Continuation != null) _categories.LoadNextPartialSetAsync(); else { this.Dispatcher.BeginInvoke( () => { ContentPanel.DataContext = _categories; ContentPanel.UpdateLayout(); } ); } }; var query = from c in _ctx.Categories select c; _categories.LoadAsync(query); } } Add the XAML content for the EditCategory.xamlpage (see the following XAML fragment). <Grid x_Name="ContentPanel" Grid.Row="1" Margin="12,0,12,0"> <StackPanel> <TextBlock Text="{Binding Path=CategoryID, StringFormat='Fields of Categories({0})'}" FontSize="40" Margin="5" /> <Border> <StackPanel> <TextBlock Text="Category Name:" FontSize="24" Margin="10" /> <TextBox x_Name="txtCategoryName" Text="{Binding Path=CategoryName, Mode=TwoWay}" /> <TextBlock Text="Description:" FontSize="24" Margin="10" /> <TextBox x_Name="txtDescription" Text="{Binding Path=Description, Mode=TwoWay}" /> </StackPanel> </Border> <StackPanel Orientation="Horizontal" HorizontalAlignment="Center"> <Button x_Name="btnUpdate" Content="Update" HorizontalAlignment="Center" Click="btnUpdate_Click" /> <Button x_Name="btnCancel" Content="Cancel" HorizontalAlignment="Center" Click="btnCancel_Click" /> </StackPanel> </StackPanel> </Grid> Add the code for editing the selected Category item in the code-behind file of the EditCategory.xaml page. In the PhoneApplicationPage_Loaded event, we will load the properties of the selected Category item and display them on the screen (see the following code snippet). private void PhoneApplicationPage_Loaded(object sender, RoutedEventArgs e) { EnableControls(false); Uri svcUri = new Uri("http://localhost:9188/NorthwindOData. svc"); _ctx = new ODataSvc.NorthwindEntities(svcUri); var id = int.Parse(NavigationContext.QueryString["ID"]); var query = _ctx.Categories.Where(c => c.CategoryID == id); _categories = new DataServiceCollection(_ctx); _categories.LoadCompleted += (o, args) => { if (_categories.Count <= 0) { MessageBox.Show("Failed to retrieve Category item."); NavigationService.GoBack(); } else { EnableControls(true); ContentPanel.DataContext = _categories[0]; ContentPanel.UpdateLayout(); } }; _categories.LoadAsync(query); } The code for updating changes (against the Category item) is put in the Click event of the Update button (see the following code snippet). private void btnUpdate_Click(object sender, RoutedEventArgs e) { EnableControls(false); _ctx.UpdateObject(_categories[0]); _ctx.BeginSaveChanges( (ar) => { this.Dispatcher.BeginInvoke( () => { try { var response = _ctx.EndSaveChanges(ar); NavigationService.Navigate(new Uri("/MainPage.xaml", UriKind.Relative)); } catch (Exception ex) { MessageBox.Show("Failed to save changes."); EnableControls(true); } } ); }, null ); } Select the WP7 project and launch it in Windows Phone Emulator (see the following screenshot). Depending on the performance of the development machine, it might take a while to start the emulator. Running a WP7 application in Windows Phone Emulator is very helpful especially when the phone application needs to access some web services (such as WCF Data Service) hosted on the local machine (via the Visual Studio test web server). How it works... Since the OData WP7 client library (and tools) has been installed together with Windows Phone SDK 7.1, we can directly use the Visual Studio Add Service Reference wizard to generate the OData client proxy in Windows Phone applications. And the generated OData proxy is the same as what we used in standard .NET applications. Similarly, all network access code (such as the OData service consumption code in this recipe) has to follow the asynchronous programming pattern in Windows Phone applications. There's more... In this recipe, we use the Windows Phone Emulator for testing. If you want to deploy and test your Windows Phone application on a real device, you need to obtain a Windows Phone developer account so as to unlock your Windows Phone device. Refer to the walkthrough: App Hub - windows phone developer registration walkthrough,available at http://go.microsoft.com/fwlink/?LinkID=202697
Read more
  • 0
  • 0
  • 4817

article-image-text-mining-r-part-2
Robi Sen
16 Apr 2015
4 min read
Save for later

Text Mining with R: Part 2

Robi Sen
16 Apr 2015
4 min read
In Part 1, we covered the basics of doing text mining in R by selecting data, preparing it, cleaning, then performing various operations on it to visualize that data. In this post we look at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. Building the document matrix A common technique in text mining is using a matrix of documents terms called a document term matrix. A document term matrix is simply a matrix where columns are terms and rows are documents that contain the occurrence of specific terms within the document. Or if you reverse the order and have terms as rows and documents as columns it’s called a term document matrix. For example let’s say we have two documents D 1 and D2. For example let’s say we have the documents: D1 = "I like cats" D2 = "I hate cats" Then the document term matrix would look like:   I like hate cats D1 1 1 0 1 D2 1 0 1 1 For our project to make a Document term matrix in R all you need to do is use the DocumentTermMatrix() like this: tdm <- DocumentTermMatrix(mycorpus) You can see information on your document term matrix by using print like: print(tdm) <<DocumentTermMatrix (documents: 4688, terms: 18363)>> Non-/sparse entries: 44400/86041344 Sparsity : 100% Maximal term length: 65 Weighting : term frequency (tf) Next because we need to sum up all the values in each term column so that we can drive the frequency of each term occurrence. We also want to sort those values from highest to lowest. You can use this code: m <- as.matrix(tdm) v <- sort(colSums(m),decreasing=TRUE) Next we will use the names() to pull the each term object’s name which in our case is a word. Then we want to build a dataframe from our words associated with their frequency of occurrences. Finally we want to create our word cloud but remove any terms that have an occurrence of less than 45 times to reduce clutter in our wordcloud. You could also use max.words to limit the total number of words in your word cloud. So your final code should look like this: words <- names(v) d <- data.frame(word=words, freq=v) wordcloud(d$word,d$freq,min.freq=45) If you run this in R studio you should see something like the figure which shows the words with highest occurrence in our corpus. The wordcloud object automatically scales the drawn words by the size of their frequency value. From here you can do a lot with your word cloud including change the scale, associate color to various values, and much more. You can read more about wordcloud here. While word clouds are often used on the web for things like blogs, news sites, and other similar use cases they have real value for data analysis beyond just visual indicators for users to find terms of interest. For example if you look at the word cloud we generated you will notice that one of the most popular terms mentioned in tweets is chocolate. Doing a short inspection of our CSV document for the term chocolate we find a lot of people mentioning the word in a variety of contexts but one of the most common is in relationship to a specific super bowl add. For example here is a tweet: Alexalabesky 41673.39 Chocolate chips and peanut butter 0 0 0 Unknown Unknown Unknown Unknown Unknown This appeared after the airing of this advertisement from Butterfinger. So even with this simple R code we can generate real meaning from social media which is the measurable impact of an advertisement during the Super Bowl. Summary In this post we looked at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. About the author Robi Sen, CSO at Department 13, is an experienced inventor, serial entrepreneur, and futurist whose dynamic twenty-plus year career in technology, engineering, and research has led him to work on cutting edge projects for DARPA, TSWG, SOCOM, RRTO, NASA, DOE, and the DOD. Robi also has extensive experience in the commercial space, including the co-creation of several successful start-up companies. He has worked with companies such as UnderArmour, Sony, CISCO, IBM, and many others to help build out new products and services. Robi specializes in bringing his unique vision and thought process to difficult and complex problems allowing companies and organizations to find innovative solutions that they can rapidly operationalize or go to market with.
Read more
  • 0
  • 0
  • 4816

article-image-background-jobs-django-celery
Jean Jung
19 Jan 2017
7 min read
Save for later

Background jobs on Django with Celery

Jean Jung
19 Jan 2017
7 min read
While doing web applications, you usually need to run some operations in the background to improve the application performance, or because a job really needs to run outside of the application environment. In both cases, if you are on Django, you are in good hands because you have Celery, the Distributed Task Queue written in Python. Celery is a tiny but complete project. You can find more information on the project page. In this post, we will see how it’s easy to integrate Celery with an existing project, and although we are focusing on Django here, creating a standalone Celery worker is a very similar process. Installing Celery The first step we will see is how to install Celery. If you already have it, please move to the next section and follow the next step! As every good Python package, Celery is distributed on pip. You can install it just by entering: pip install celery Choosing a message broker The second step is about choosing a message broker to act as the job queue. Celery can talk with a great variety of brokers; the main ones are: RabbitMQ Redis 1 Amazon SQS  ² Check for support on other brokers here. If you’re already using any of these brokers for other purposes, choose it as your primary option. In this section there is nothing more you have to do. Celery is very transparent and does not require any source modification to move from a broker to another, so feel free to try more than one after we end here. Ok let’s move on, but first do not forget to look the little notes below. ¹: For Redis (a great choice in my opinion), you have to install the celery[redis] package. ²: Celery has great features like web monitoring that do not work with this broker. Celery worker entrypoint When running Celery on a directory it will search for a file called celery.py, which is the application entrypoint, where the configs are loaded and the application object resides. Working with Django, this file is commonly stored on the project directory, along with the settings.py file; your file structure should look like this: your_project_name your_project_name __init__.py settings.py urls.py wsgi.py celery.py your_app_name __init__.py models.py views.py …. The settings read by that file will be on the same settings.py file that Django uses. At this point we can take a look at the official documentation celery.py file example. This code is basically the same for every project; just replace proj by your project name and save that file. Each part is described in the file comments. from __future__ import absolute_import, unicode_literals import os from celery import Celery # set the default Django settings module for the 'celery' program. os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'proj.settings') app = Celery('proj') # Using a string here means the worker don't have to serialize # the configuration object to child processes. # - namespace='CELERY' means all celery-related configuration keys # should have a `CELERY_` prefix. app.config_from_object('django.conf:settings', namespace='CELERY') # Load task modules from all registered Django app configs. # This is not required, but as you can have more than one app # with tasks it’s better to do the autoload than declaring all tasks # in this same file. app.autodiscover_tasks() Settings By default, Celery depends only on the broker_url setting to work. As we’ve seen in the previous session, your settings will be stored alongside the Django ones but with the 0‘CELERY_’ prefix. The broker_url format is as follows: CELERY_BROKER_URL = ‘broker://[[user]:[password]@]host[:port[/resource]]’ Where broker is an identifier that specifies the chosen broker, like amqp or redis; user and password are the authentication to the service. If needed, host and port are the addresses of the service and resource is a broker-specific path to the component resource. For example, if you’ve chosen a local Redis as your broker, your broker URL will be: CELERY_BROKER_URL = ‘redis://localhost:6379/0’ ¹ 1: Considering a default Redis installation with the database 0 being used. Doing this we have a functioning celery worker. How lucky! It’s so simple! But wait, what about the tasks? How do we write and execute them? Let’s see. Creating and running tasks Because of the superpowers Celery has, it can autoload tasks from Django app directories as we’ve seen before; you just have to declare your app tasks in a file called tasks.py in the app dir: your_project_name your_project_name __init__.py settings.py urls.py wsgi.py celery.py your_app_name __init__.py models.py views.py tasks.py …. In that file you just need to put functions decorated with the celery.shared_task decorator. So suppose we want to do a background mailer; the source will be like this: from __future__ import absolute_import, unicode_literals from celery import shared_task from django.core.mail import send_mail @shared_task def mailer(subject, message, recipient_list, from=’default@admin.com’): send_mail(subject, message, recipient_list, from) Then on the Django application, on any place you have to send an e-mail on background, just do the following: from __future__ import absolute_import from app.tasks import mailer …. def send_email_to_user(request): if request.user: mailer.delay(‘Alert Foo’, ‘The foo message’, [request.user.email]) delay is probably the most used way to submit a job to a Celery worker, but is not the only one. Check this reference to see what is possible to do. There are many features like task chaining, with future schedules and more! As you can have noticed, in a great majority of the files, we have used the from __future__ import absolute_import statement. This is very important, mainly with Python 2, because of the way Celery serializes messages to post tasks on brokers. You need to follow the same convention when creating and using tasks, as otherwise the namespace of the task will differ and the task will not get executed. The absolute import module forces you to use absolute imports, so you will avoid these problems. Check this link for more information. Running the worker If you get the source code above, put anything in the right place and run the Django development server to test your background jobs, they will not work! Wait. This is because you don’t have a Celery worker started yet. To start it, do a cd to the project main directory (the same as you run python manage.py runserver for example) and run: celery -A your_project_name worker -l info Replace your_project_name with your project and info with the desired log level. Keep this process running, start the Django server, and yes. Now you can see that anything works! Where to go now? Explore the Celery documentation and see all the available features, caveats, and help you can get from it. There is also an example project on the Celery GitHub page that you can use as a template for new projects or a guide to add celery to your existing project. Summary We’ve seen how to install and configure Celery to run alongside a new or existing Django project. We explored some of the broker options we have, and how simple it is to change between them. There are some hints about brokers that don’t offer all of the features Celery has. We have seen an example of a mailer task, and how it was created and called from the Django application. Finally I provided instructions to start the worker to get the things done. References [1] - Django project documentation [2] - Celery project documentation [3] - Redis project page [4] - RabbitMQ project page [5] - Amazon SQS page About the author Jean Jung is a Brazilian developer passionate about technology. He is currently a system analyst at EBANX, an international payment processing company for Latin America. He's very interested in Python and artificial intelligence, specifically machine learning, compilers and operational systems. As a hobby, he's always looking for IoT projects with Arduino.
Read more
  • 0
  • 0
  • 4815

article-image-cluster-computing-using-scala
Packt
13 Apr 2016
18 min read
Save for later

Cluster Computing Using Scala

Packt
13 Apr 2016
18 min read
In this article by Vytautas Jančauskas the author of the book Scientific Computing with Scala, explains the way of writing software to be run on distributed computing clusters. We will learn the MPJ Express library here. (For more resources related to this topic, see here.) Very often when dealing with intense data processing tasks and simulations of physical phenomena, there comes a time when no matter how many CPU cores and memory your workstation has, it is not enough. At times like these, you will want to turn to supercomputing clusters for help. These distributed computing environments consist of many nodes (each node being a separate computer) connected into a computer network using specialized high bandwidth and low latency connections (or if you are on a budget standard Ethernet hardware is often enough). These computers usually utilize a network filesystem allowing each node to see the same files. They communicate using messaging libraries, such as MPI. Your program will run on separate computers and utilize the message passing framework to exchange data via the computer network. Using MPJ Express for distributed computing MPJ Express is a message passing library for distributed computing. It works in programming languages using Java Virtual Machine (JVM). So, we can use it from Scala. It is similar in functionality and programming interface to MPI. If you know MPI, you will be able to use MPJ Express pretty much the same way. The differences specific to Scala are explained in this section. We will start with how to install it. For further reference, visit the MPJ Express website given here: http://mpj-express.org/ Setting up and running MPJ Express The steps to set up and run MPJ Express are as follows: First, download MPJ Express from the following link. The version at the time of this writing is 0.44.http://mpj-express.org/download.php Unpack the archive and refer to the included README file for installation instructions. Currently, you have to set MPJ_HOME to the folder you unpacked the archive to and add the bin folder in that archive to your path. For example, if you are a Linux user using bash as your shell, you can add the following two lines to your .bashrc file (the file is in your home directory at /home/yourusername/.bashrc): export MPJ_HOME=/home/yourusername/mpj export PATH=$MPJ_HOME/bin:$PATH Here, mpj is the folder you extracted the archive you downloaded from the MPJ Express website to. If you are using a different system, you will have to do the equivalent of the above for your system to use MPJ Express. We will want to use MPJ Express with Scala Build Tool (SBT), which we used previously to build and run all of our programs. Create the following directory structure: scalacluster/ lib/ project/ plugins.sbt build.sbt I have chosen to name the project folder asscalacluster here, but you can call it whatever you want. The .jar files in the lib folder will be accessible to your program now. Copy the contents of the lib folder from the mpj directory to this folder. Finally, create an empty build.sbt and plugins.sbt files. Let’s now write and run a simple "Hello, World!" program to test our setup: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank val size: Int = MPI.COMM_WORLD.Size println("Hello, World, I'm <" + me + ">") MPI.Finalize() } } This should be familiar to everyone who has ever used MPI. First, we import everything from the mpj package. Then, we initialize MPJ Express by calling MPI.Initialize, the arguments to MPJ Express will be passed from the command-line arguments you will enter when running the program. The MPI.COMM_WORLD.Rank() function returns the MPJ processes rank. A rank is a unique identifier used to distinguish processes from one another. They are used when you want different processes to do different things. A common pattern is to use the process with rank 0 as the master process and the processes with other ranks as workers. Then, you can use the processes rank to decide what action to take in the program. We also determine how many MPJ processes were launched by checking MPI.COMM_WORLD.Size. Our program will simply print a processes rank for now. We will want to run it. If you don't have a distributed computing cluster readily available, don't worry. You can test your programs locally on your desktop or laptop. The same program will work without changes on clusters as well. To run programs written using MPJ Express, you have to use the mpjrun.sh script. This script will be available to you if you have added the bin folder of the MPJ Express archive to your PATH as described in the section on installing MPJ Express. The mpjrun.sh script will setup the environment for your MPJ Express processes and start said processes. The mpjrun.sh script takes a .jar file, so we need to create one. Unfortunately for us, this cannot easily be done using the sbt package command in the directory containing our program. This worked previously, because we used Scala runtime to execute our programs. MPJ Express uses Java. The problem is that the .jar package created with sbt package does not include Scala's standard library. We need what is called a fat .jar—one that contains all the dependencies within itself. One way of generating it is to use a plugin for SBT called sbt-assembly. The website for this plugin is given here: https://github.com/sbt/sbt-assembly There is a simple way of adding the plugin for use in our project. Remember that project/plugins.sbt file we created? All you need to do is add the following line to it (the line may be different for different versions of the plugin. Consult the website): addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1") Now, add the following to the build.sbt file you created: lazy val root = (project in file(".")). settings( name := "mpjtest", version := "1.0", scalaVersion := "2.11.7" ) Then, execute the sbt assembly command from the shell to build the .jar file. The file will be put under the following directory if you are using the preceding build.sbt file. That is, if the folder you put the program and build.sbt in is /home/you/cluster: /home/you/cluster/target/scala-2.11/mpjtest-assembly- 1.0.jar Now, you can run the mpjtest-assembly-1.0.jar file as follows: $ mpjrun.sh -np 4 -jar target/scala-2.11/mpjtest-assembly-1.0.jar MPJ Express (0.44) is started in the multicore configuration Hello, World, I'm <0> Hello, World, I'm <2> Hello, World, I'm <3> Hello, World, I'm <1> Argument -np specifies how many processes to run. Since we specified -np 4, four processes will be started by the script. The order of the "Hello, World" messages can differ on your system since the precise order of execution of different processes is undetermined. If you got the output similar to the one shown here, then congratulations, you have done the majority of the work needed to write and deploy applications using MPJ Express. Using Send and Recv MPJ Express processes can communicate using Send and Recv. These methods constitute arguably the simplest and easiest to understand mode of operation that is also probably the most error prone. We will look at these two first. The following are the signatures for the Send and Recv methods: public void Send(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) throws MPIException public Status Recv(java.lang.Object buf, int offset, int count, Datatype datatype, int source, int tag) throws MPIException Both of these calls are blocking. This means that after calling Send, your process will block (will not execute the instructions following it) until a corresponding Recv is called by another process. Also Recv will block the process, until a corresponding Send happens. By corresponding, we mean that the dest and source arguments of the calls have the values corresponding to receivers and senders ranks, respectively. The two calls will be enough to implement many complicated communication patterns. However, they are prone to various problems such as deadlocks. Also, they are quite difficult to debug, since you have to make sure that each Send has the correct corresponding Recv and vice versa. The parameters for Send and Recv are basically the same. The meanings of those parameters are summarized in the following table: Argument Type Description Buf java.lang.Object It has to be a one-dimensional Java array. When using from Scala, use the Scala array, which is a one-to-one mapping to a Java array. offset int The start of the data you want to pass from the start of the array. Count int This shows the number items of the array you want to pass. datatype Datatype The type of data in the array. Can be one of the following: MPI.BYTE, MPI.CHAR, MPI.SHORT, MPI.BOOLEAN, MPI.INT, MPI.LONG, MPI.FLOAT, MPI.DOUBLE, MPI.OBJECT, MPI.LB, MPI.UB, and MPI.PACKED. dest/source int Either the destination to send the message to or the source to get the message from. You use the rank of the process to identify sources and destinations. tag int Used to tag the message. Can be used to introduce different message types. Can be ignored for most common applications. Let’s look at a simple program using these calls for communication. We will implement a simple master/worker communication pattern: import mpi._ import scala.util.Random object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { Here, we use an if statement to identify who we are based on our rank. Since each process gets a unique rank, this allows us to determine what action should be taken. In our case, we assigned the role of the master to the process with rank 0 and the role of a worker to processes with other ranks: for (i <- 1 until size) { val buf = Array(Random.nextInt(100)) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> please do work on " + buf(0)) } We iterate over workers, who have the ranks from 1 to whatever is the argument for number of processes you passed to the mpjrun.sh script. Let’s say that number is four. This gives us one master process and three worker processes. So, each process with a rank from 1 to 3 will get a randomly generated number. We have to put that number in an array even though it is a single number. This is because both Send and Recv methods expect an array as their first argument. We then use the Send method to send the data. We specified the array as argument buf, offset of 0, size of 1, type MPI.INT, destination as the for loop index, and tag as 0. This means that each of our three worker processes will receive a (most probably) different number: for (i <- 1 until size) { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> thanks for the reply, which was " + buf(0)) } Finally, we collect the results from the workers. For this, we iterate over the worker ranks and use the Recv method on each one of them. We print the result we got from the worker, and this concludes the master's part. We now move on to the workers: } else { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Understood, doing work on " + buf(0)) buf(0) = buf(0) * buf(0) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Reporting back") } The workers code is identical for all of them. They receive a message from the master, calculate the square of it, and send it back: MPI.Finalize() } } After you run the program, the results should be akin to the following, which I got when running this program on my system: MASTER: Dear <1> please do work on 71 MASTER: Dear <2> please do work on 12 MASTER: Dear <3> please do work on 55 <1>: Understood, doing work on 71 <1>: Reported back MASTER: Dear <1> thanks for the reply, which was 5041 <3>: Understood, doing work on 55 <2>: Understood, doing work on 12 <2>: Reported back MASTER: Dear <2> thanks for the reply, which was 144 MASTER: Dear <3> thanks for the reply, which was 3025 <3>: Reported back Sending Scala objects in MPJ Express messages Sometimes, the types provided by MPJ Express for use in the Send and Recv methods are not enough. You may want to send your MPJ Express processes a Scala object. A very realistic example of this would be to send an instance of a Scala case class. These can be used to construct more complicated data types consisting of several different basic types. A simple example is a two-dimensional vector consisting of x and y coordinates. This can be sent as a simple array, but more complicated classes can't. For example, you may want to use a case class as the one shown here. It has two attributes of type String and one attribute of type Int. So what do we do with a data type like this? The simplest answer to that problem is to serialize it. Serializing converts an object to a stream of characters or a string that can be sent over the network (or stored to a file or done other things with) and later on deserialized to get the original object back: scala> case class Person(name: String, surname: String, age: Int) defined class Person scala> val a = Person("Name", "Surname", 25) a: Person = Person(Name,Surname,25) A simple way of serializing is to use a format such as XML or JSON. This can be done automatically using a pickling library. Pickling is a term that comes from the Python programming language. It is the automatic conversion of an arbitrary object into a string representation that can later be de-converted to get the original object back. The reconstructed object will behave the same way as it did before conversion. This allows one to store arbitrary objects to files for example. There is a pickling library available for Scala as well. You can of course do serialization in several different ways (for example, using the powerful support for XML available in Scala). We will use the pickling library that is available from the following website for this example: https://github.com/scala/pickling You can install it by adding the following line to your build.sbt file: libraryDependencies += "org.scala-lang.modules" %% "scala- pickling" % "0.10.1" After doing that, use the following import statements to enable easy pickling in your projects: scala> import scala.pickling.Defaults._ import scala.pickling.Defaults._ scala> import scala.pickling.json._ import scala.pickling.json._ Here, you can see how you can then easily use this library to pickle and unpickle arbitrary objects without the use of annoying boiler plate code: scala> val pklA = a.pickle pklA: pickling.json.pickleFormat.PickleType = JSONPickle({ "$type": "Person", "name": "Name", "surname": "Surname", "age": 25 }) scala> val unpklA = pklA.unpickle[Person] unpklA: Person = Person(Name,Surname,25) Let’s see how this would work in an application using MPJ Express for message passing. A program using pickling to send a case class instance in a message is given here: import mpi._ import scala.pickling.Defaults._ import scala.pickling.json._ case class ArbitraryObject(a: Array[Double], b: Array[Int], c: String) Here, we have chosen to define a fairly complex case class, consisting of two arrays of different types and a string: object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val obj = ArbitraryObject(Array(1.0, 2.0, 3.0), Array(1, 2, 3), "Hello") val pkl = obj.pickle.value.toCharArray MPI.COMM_WORLD.Send(pkl, 0, pkl.size, MPI.CHAR, 1, 0) In the preceding bit of code, we create an instance of our case class. We then pickle it to JSON and get the string representation of said JSON with the value method. However, to send it in an MPJ message, we need to convert it to a one-dimensional array of one of the supported types. Since it is a string, we convert it to a char array. This is done using the toCharArray method: } else if (me == 1) { val buf = new Array[Char](1000) MPI.COMM_WORLD.Recv(buf, 0, 1000, MPI.CHAR, 0, 0) val msg = buf.mkString val obj = msg.unpickle[ArbitraryObject] On the receiving end, we get the raw char array, convert it back to string using mkString method, and then unpickle it using unpickle[T]. This will return an instance of the case class that we can use as any other instance of a case class. It is in its functionality the same object that was sent to us: println(msg) println(obj.c) } MPI.Finalize() } } The following is the result of running the preceding program. It prints out the JSON representation of our object, and also show that we can access the attributes of said object by printing the c attribute. MPJ Express (0.44) is started in the multicore configuration: { "$type": "ArbitraryObject", "a": [ 1.0, 2.0, 3.0 ], "b": [ 1, 2, 3 ], "c": "Hello" } Hello You can use this method to send arbitrary objects in an MPJ Express message. However, this is just one of many ways of doing this. As mentioned previously, an example of another way is to use the XML representation. XML support is strong in Scala, and you can use it to serialize objects as well. This will usually require you to add some boiler plate code to your program to serialize to XML. The method discussed earlier has the advantage of requiring no boiler plate code. Non-blocking communication So far, we examined only blocking (or synchronous) communication between two processes. This means that the process is blocked (halted their execution) until the Send or Recv methods have been completed successfully. This is simple to understand and enough for most cases. The problem with synchronous communication is that you have to be very careful otherwise deadlocks may occur. Deadlocks are situations when processes wait for each other to release a resource first. Mexican standoff including the dining philosophers problem is one of the famous example of Deadlock in Operating System. The point is that if you are unlucky, you may end up with a program that is seemingly stuck and you don't know why. Using nonlocking communication allows you to avoid these problems most of the time. If you think you may be at risk of deadlocks, you will probably want to use it. The signatures for the primary methods used in asynchronous communication are given here: Request Isend(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) Isend works similar to its Send counterpart. The main differences are that it does not block (the program continues execution after the call rather than waiting for a corresponding send), and then it returns a Request object. This object is used to check the status of your Send request, block until it is complete if required, and so on: Request Irecv(java.lang.Object buf, int offset, int count, Datatype datatype, int src, int tag) Irecv is again the same as Recv only non-blocking and returns a Request object used to handle your receive request. The operation of these methods can be seen in action in the following example: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val requests = for (i <- 0 until 10) yield { val buf = Array(i * i) MPI.COMM_WORLD.Isend(buf, 0, 1, MPI.INT, 1, 0) } } else if (me == 1) { for (i <- 0 until 10) { Thread.sleep(1000) val buf = Array[Int](0) val request = MPI.COMM_WORLD.Irecv (buf, 0, 1, MPI.INT, 0, 0) request.Wait() println("RECEIVED: " + buf(0)) } } MPI.Finalize() } } This is a very simplistic example used simply to demonstrate the basics of using the asynchronous message passing methods. First, the process with rank 0 will send 10 messages to process with rank 1 using Isend. Since Isend does not block, the loop will finish quickly and the messages it sent will be buffered until they are retrieved using Irecv. The second process (the one with rank 1) will wait for one second before retrieving each message. This is to demonstrate the asynchronous nature of these methods. The messages are in the buffer waiting to be retrieved. Therefore, Irecv can be used at your leisure when convenient. The Wait() method of the Request object, it returns, has to be used to retrieve results. The Wait() method blocks until the message is successfully received from the buffer. Summary Extremely computationally intensive programs are usually parallelized and run on supercomputing clusters. These clusters consist of multiple networked computers. Communication between these computers is usually done using messaging libraries such as MPI. These allow you to pass data between processes running on different machines in an efficient manner. In this article, you have learned how to use MPJ Express—an MPI like library for JVM. We saw how to carry out process to process communication as well as collective communication. Most important MPJ Express primitives were covered and example programs using them were given. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code[article] Getting Started with JavaFX[article] Integrating Scala, Groovy, and Flex Development with Apache Maven[article]
Read more
  • 0
  • 0
  • 4814
article-image-active-directory-domain-services-2016
Packt
09 May 2017
23 min read
Save for later

Active Directory Domain Services 2016

Packt
09 May 2017
23 min read
In this article, by Dishan Francis, the author of the book Mastering Active Directory, we will see AD DS features, privileged access management, time based group memberships. Microsoft, released Active Directory domain services 2016 at a very interesting time in technology. Today identity infrastructure requirements for enterprise are challenging, most of the companies uses cloud services for their operations (Software as a Service—SaaS) and lots moved infrastructure workloads to public clouds. (For more resources related to this topic, see here.) AD DS 2016 features Active Directory domain service (AD DS) improvements are bind with its forest and domain functional levels. Upgrading operating system or adding domain controllers which runs Windows Server 2016 to existing AD infrastructure not going to upgrade forest and domain functional levels. In order to use or test these new AD DS 2016 features you need to have forest and domain function levels set to Windows Server 2016. The minimum forest and domain functional levels you can run on your identity infrastructure depend on the lowest domain controller version running. For example, if you have Windows Server 2008 domain controller in your infrastructure, even though you add Windows Server 2016 domain controller, the domain and forest functional level need to maintain as Windows Server 2008 until last Windows Server 2008 demote from the infrastructure. Privileged access management Privileged access management (PAM) is one of the best topics which is discussed on presentations, tech shows, IT forums, IT groups, blogs and meetings for last few years (after 2014) around identity management. It has become a trending topic especially after the Windows Server 2016 previews released. For last year, I was travelling to countries, cities and had involved with many presentations, discussions about PAM.  First of all, this is not a feature that you can enable with few clicks. It is a combination of many technologies and methodologies which came together and make a workflow or in other words way of living for administrators. AD DS 2016 includes features and capabilities that support PAM in infrastructure but it is not the only thing. This is one of the greatest challenge I see about this new way of thinking and new way of working. Replacing a product is easy but changing a process is more complicated and challenging.   I started my career with one of the largest north American hosting company around 2003. I was a system administrator that time and one of my tasks was to identify hacking attempts and prevent workloads getting compromised. In order to do that I had to review lot of logs on different systems. But around that time most of the attacks from individual or groups were to put names on websites and prove that they can hack websites. Average hacking attempts per server was around 20 to 50 per day. Some collocation customers were even running their websites, workloads without any protection (even though not recommended). But as the time goes year by year number of attempts were dramatically increased and we start to talk about hundreds of thousands attempts per day. The following graph is taken from latest Symantec Internet Security Threat Report (2016) and it confirms number of web-based attacks increased by more than 117% from year 2014.  Web attacks blocked per month (Source - Symantec Internet Security Threat Report (2016)) It has not only changed the numbers, it also changed the purpose of attacks. As I said in earlier days it was script kiddies who were after fame. Then later as users started to use more and more online services, purpose of attacks changed to financial values. Attackers started to focus on websites which stores credit card information. For last 10 years, I had to change my credit card 4 times as my credit card information were exposed along with the websites I had used it with. These type of attacks are still happening in the industry.  When considering the types of threats after the year 2012, most of the things changed. Instead of fame or financial, attackers started to target identities. In earlier days, the data about a person were in different formats. For example, when I used to walk into my medical center 15 years ago, before seeing the doctor, administration staff had to go and find the file containing my name. They had number of racks filled with files and papers which included patient records, treatment history, test reports, and so on. But now things have changed, when I walk in, no one in administration need to worry about the file. Doctor can see all my records from his computer screen with few clicks. So, the data is being transformed into the digital format. More and more data about people is transforming into digital formats. In that health system, I become an identity and my identity is attached to the data and also to a certain privileges. Think about your bank, online banking system. You got your own username and password to type in, when you log in to the portal. So, you have your own identity in the bank system. Once you log in, you can access all your accounts, transfer money, make payments. Bank has granted some privileges to your identity. With your privileges, you cannot look into your neighbor’s bank account. But your bank manager can view your account and your neighbor’s account too. That means the privileges attached to the bank manager’s identity is different. Amount of data which can be retrieved from systems are dependent on the identity privileges. Not only that, some of these identities are integrated with different systems. Industries use different systems related to their operations. It can be email system, CMS or billing system. Each of these systems hold data. To make operations smooth these systems are integrated with one identity infrastructure and provides single sign-on experience instead of using different identities for each and every application. It is making identities more and more powerful within any system. For an attacker, what is more worth? To focus on one system or target on identity which is attached to data and privileges to many different systems? Which one can make more damage? If the identity which is the target, has more privileged access to the systems, its a total disaster. Is it all about usernames, passwords or admin accounts? No it's not, identities can make more damage than that. Usernames and passwords are just making it easy. Just think about the recent world famous cyber-attacks. Back in July 2015, a group called The Impact Team threatened to expose user account information of Ashley Madison dating site, if its parent company Avid Life Media didn't shut down the Ashley Madison and Established Men websites completely. For example, Ashley Madison website hack, is it that the financial value made it more dangerous? It was the identities which made damages to people’s lives. It was just enough to expose the names and make someones life to be humiliated. It ruined families and children lost their parents love and care. It proves it’s not only about permissions attached to an identity, individual identities itself are more important in modern big data phenomenon. It’s only been few months from the USA presidential election and by now we can see how much news it can make with a single tweet. It wasn’t needed to have special privileges to do a tweet, it was the identity which made that tweet important. In other hand if that twitter account got hacked and someone tweeted something fake on behalf of the actual person who owns it, what kind of damage it can make to whole world? In order to do that, does it need to hack the Jack Dorsey’s account? Value  of individual identity is more powerful than twitter CEO. According to following latest reports, it shows that majority of information exposed by identity attacks, are people names, addresses, medical reports, and government identity numbers. Source - Symantec Internet Security Threat Report (2016) The attacks targeted on identities are rising day by day. The following graph shows the number of identities been exposed, compared to the number of incidents. Source - Symantec Internet Security Threat Report (2016) In December 2015, there were only 11 incidents and 195 million identities were exposed. It shows how much damage these types of attacks can make.  Each and every time this kind of attack happens, most common answers from engineers are “Those attacks were so sophisticated”, “It was too complex to identify”, “They were so clever”, “It was zero-day attack”. Is that really true?  Zero-days attacks are based on unknown system bugs, errors to vendors. Latest reports show the average time of explores are less than 7 days and 1 day to release to patch. Source - Symantec Internet Security Threat Report (2016) Microsoft Security Intelligence Report Volume 21 | January through June, 2016 report contains the following figure which explains the complexity of the vulnerabilities. It clearly shows the majority of the vulnerabilities are less complex to exploit. High complexity vulnerabilities are still less than 5% from total vulnerability disclosures. It proves the attackers are still after low hanging fruits. Source: Microsoft Security Intelligence Report Volume 21 | January through June, 2016 Microsoft Active Directory is the leader in identity infrastructure solution provider. With all this constant news about identity breaches, Microsoft Active Directory name also appears. Then people start to question why Microsoft can’t fix it? But if you analyse these problems, it’s obvious that just providing technology rich product is not enough to solve these issues. With each and every new server operating system version, Microsoft releases new Active Directory version. Every time it contains new features to improve the identity infrastructure security. But when I go for the Active Directory released project, I see a majority of engineers not even following the security best practices defined by 10 years’ older Active Directory version. Think about a car race, its categories are usually based on the engine power. It can be 1800cc, 2000cc or more. In the race, most of the time it's the same models and same manufactured cars. If it's same manufacture, and if it's same engine capacity how one can win and the other lose? It’s the car tuning and the driving skills which decide a winner and loser. If Active Directory domain service 2016 can fix all the identity threats that’s really good but giving a product or technology doesn’t seem to be work so far. That’s why we need to change the way we think towards identity infrastructure security. We should not forget we are fighting against human adversaries. The tactics, methods, approaches they use, are changing every day. The products we use, do not have such frequent updates but we can change their ability to execute an attack on infrastructure by understanding fundamentals and use the products, technologies, workflows to prevent it. Before we move into identity theft prevention mechanism let’s look into typical identity infrastructure attack. Microsoft Tiered administration model is based on three tiers. All these identity attacks are starting with gaining some kind of access to the identity infrastructure and then move laterally until they have keys to the kingdom which is domain admin or enterprise administrator credentials. Then they have full ownership of entire identity infrastructure. As the preceding diagram shows that the first step on identity attack, is to get some kind of access to the system. They do not target domain admin or enterprise admin account first. Getting access to a typical user account is much easier than domain admin account. All they need is some kind of beach head. For this, still the most common attack technique is to send out phishing email. It’s typical that someone will still fall for that and click on it. Now they have some sort of access to your identity infrastructure and next step is to start moving laterally to gain more privileges. How many of you completely eliminated local administrator accounts in your infrastructure? I’m sure the answer will be almost none. Sometimes, users are asked for software installations, system level modifications frequently in their systems and most of the time engineers are ending up assigning local administrator privileges. If the compromised account used to be local administrator its becomes extremely easy to move to the next level. If not, they will make systems to misbehave. Then who will come to the rescue? It's the super powered IT help-desk peoples. In lots of organizations, IT help-desk engineers are domain administrators. If not at least local administrators to the systems. So, once they receive the call about a misbehaving computer, they RDP or login locally using the privileged account. If you are using RDP, it always sends your credentials via clear text. If the attacker is running any password harvesting tool it's extremely easy to capture the credentials. You may think if account (which is compromised) is a typical user account how it can execute such programs. But Windows operating systems are not preventing users from running any application on its user context. It will not allow to change any system level settings but it will still allow to run scripts or user level executable. Once they gain access to some identity in organization, the next level of privileges to own will be Tier 1. This is where the application administrators, data administrators, SaaS application administrators accounts live. In today's infrastructures, we have too many administrators. Primarily we have domain admins, enterprise administrators, then we have local administrators. Different applications running on the infrastructure have its own administrators such as exchange administrators, SQL administrators, and SharePoint administrators. The other third-party applications such as CMS, billing portal may have its own administrators. If you are using cloud services, SaaS applications, it has another set of administrators. Are we really aware of activities happening on these accounts? Mostly engineers are only worrying about protecting domain admin accounts, but at the same time forgetting about the other kinds of administrators in the infrastructure. Some of these administrator roles can make more damage than domain admin to a business. These application and services are decentralizing the management in the organization. In order to move latterly with privileges, these attackers only need to log into a machine or server where these administrators used to log in.  Local Security Authority Subsystem Service(LSASS) stores credentials in its memory for active Windows sessions. This prevents users from entering credentials for each and every service they access. This also stores Kerberos tickets. This allows attackers to perform a pass of the hash attack and retrieve locally stored credentials. Decentralized management of admin accounts make this process easier. There are features, security best practices which can be used to prevent the pass of the hash attacks in identity infrastructure.  Another problem with these types of accounts is once it becomes service admin accounts, eventually its becomes domain admin or enterprise administrator accounts. I have seen engineers created service accounts and when they can’t figure out the exact permission required for the program, as an easy fix it will add to the domain admin group. It’s not only the infrastructure attack that can expose such credentials. Service admins are attached to the application too, compromise on application can also expose the identities. In such scenario, it will be easier for attackers to gain keys to the kingdom.  Tier 0 is where the domain admin, enterprise admins operates. This is what the ultimate goal for identity infrastructure attack, once they obtain access to Tier 0, it means they own your entire identity infrastructure. Latest reports show once there is initial breach, it only takes less than 48 hours to gain Tier 0 privileges. According to the reports, once they gain access it will take up to 7-8 months minimum to identify the breach. Because once they have highest privileges they can make backdoors, clean up logs and hide forever if needed. Systems we use, always treat administrators as trustworthy people. It’s no longer valid statement for modern world. How many times you check systems logs to see what your domain admins are doing? Even though engineers look for the logs for other users, majority rarely check about domain admin accounts. The same thing applies for internal security breach too, as I said most people are good but you never know. Most of world famous identity attacks have proved that already. When I have discussion with engineers and customers about identity infrastructure security, following are the common comments I hear, "We have too many administrator accounts" "We do not know how many administrator account we got" "We got fast changing IT teams, so it’s hard to manage permissions" "We do not have visibility over administrator accounts activities" "If there is identity infrastructure breach or attempt, how do we identify?" Answer for all of these is PAM. As I said in the beginning, this is not one product. It’s a workflow and a new way of working. Main components for this process is listed as follows: Apply pass-the-hash prevention features to existing identity infrastructure. Install Microsoft Advanced Threat Analytics to monitor the domain controller traffic to identify potential real-time identity infrastructure threats. Install and configure Microsoft Identity Manager 2016—this product is allowing to manage privilege access of existing Active Directory forest by providing task-based time limited privilege access.  What is it to do with AD DS 2016? AD DS 2016 is now allowing time based group membership which makes this whole process possible. Users will add to the groups with TTL value and once its expires, the user will be removed from the group automatically. For example, let’s assume your CRM application has administrator rights assign to CRM Admin security group. The users in this group only log into the system once a month to do some maintenance. But the admin rights for the members in that group remain untouched for 29 days—24x7. So, it gives enough opportunity for attackers to try and gain access to the privileged accounts during that time. But if it’s admin rights can be limited at least for the day it needed isn’t it more useful? Then we know majority of days in month, CRM application do not have risk of been compromised by an account in CRM Admin group. What is the logic behind PAM? PAM product is built, based on Just-In-Time (JIT) administration concept. Back in 2014, Microsoft release PowerShell tool kit which allows Just-Enough-Administration. Let’s assume you are running a web server in your infrastructure. As part of the operation, every month you need to collect some logs to make a report. You already setup a PowerShell script for it. Someone in your team need to log into the system and need to run it. In order to do that, it requires administration privileges. Using JEA, it is possible to assign required permissions for the user to run only that particular program. In that way, user doesn't need to be added to the domain admin group. User will not be allowed to run any other program with assigned permission and it will not apply for another computer either. JIT administration is bound with time. Users will have required privileges only when they need it. Users will not hold privileged access rights all the time. PAM operations can be divide in to 4 major steps: Source - https://docs.microsoft.com/en-gb/microsoft-identity-manager/pam/privileged-identity-management-for-active-directory-domain-services Prepare: First step is to identify the privileged access groups in your exciting Active Directory forest and start to remove users from those. You may also need to do certain changes in your application infrastructure to support this setup. For example, if you assign privileged access to user accounts instead of security groups (in applications or services) it will need to change. Then next step is to setup equivalent groups in bastion forest without any members. When setup MIM, it will use a bastion forest to manage privileged access in existing Active Directory forest. This is a special forest and it cannot use for other infrastructure operations. This forest running with minimum of Windows Server 2012 R2 Active Directory forest functional level. When identity infrastructure compromised and attackers gain access to Tier 0, they can hide their activities for months or years. How we can be sure our existing identity infrastructure is not compromised already? if we implement this to same forest it will not achieve its core targets. Also, domain upgrades are painful it need time and budget. But because of the bastion forest, this solution can be applied to your existing identity infrastructure with minimum changes.  Protect: Next step is in the list to setup a workflow for authentications and authorization. Define how user can request privileges access when they are required. It can be via MIM portal or existing support portal (with integrated MIM REST API). It is possible to setup system to use Multi-Factor authentications (MFA) during this request process to prevent any unauthorized activity. Also, its important to define how the requests will be handled. It can be automatic approval or manual approval process. Operate: Once privilege access request approved, the user account will be added to the security group in bastion forest. The group itself have a SID value. In both forests, the group will have exact same SID value. Therefore the application or service will not see a difference between two groups in two different forest. Once the permission is granted it is only valid for the time defined by the authorization policy. Once it reaches the time limit, the user account will be removed from the security group automatically. Monitor: PAM provides visibility over the privilege access requests. Each and every request, events will be recorded and it is possible to review and also generate reports for audit purposes. It helps to fine tune the process and also to identify potential threats.  Let’s see how it’s really works: REBELADMIN CORP. uses a CRM system for its operations. The application got administrator role and REBELADMIN/CRMAdmins security group assigned to it. Any member of that group will have administrator privileges to the application. Recently PAM been introduced to the REBELADMIN CORP. As an engineer, I have identified REBELADMIN/CRMAdmins as privileged group and going to protect it using PAM. The first step is to remove the members of the REBELADMIN/CRMAdmins group. After that I have setup same group in the bastion forest. Not only the name is same, but also both the groups got the same SID value 1984.  User Dennis used to be a member of the REBELADMIN/CRMAdmins group and was running monthly report. At the end of the month, he tried to run it and now figured he do not have the required permissions. Next step for him is to request the required permission via MIM Portal. According to the policies, as part of the request, system wants Dennis to use MFA. Once Dennis verifies the PIN number the request logs in the portal. As administrator, I received the alert about the request and I log into system to review the request. It's legitimate request and I approve his access to the system for 8 hours. Then the system automatically added the user account for Dennis into BASTION/CRMAdmins group. This group have the same SID value as the production group. Therefore, the member of BASTION/CRMAdmins group will be treated as administrator by CRM application. This group membership contains TTL value too. After it passes 8 hours from approval, Dennis’s account will be automatically removed from BASTION/CRMAdmins group. In this process, we didn’t add any member to the production security group which is REBELADMIN/CRMAdmins. So, production forest stay untouched and protected. In here the most important thing we need to understand is the legacy approach for identity protection is no longer valid. We are against human adversaries. Identity is our new perimeter in infrastructure and to protect it we need to understand how adversaries doing it and stay step ahead. The new PAM with AD DS 2016 is new approach to the right direction.  Time based group memberships Time based group membership is part of that boarder topic. This allows administrators to assign temporarily group membership which is expressed by Time-To-Live (TTL) value. This value will add to the Kerberos ticket. This is also called as Expiring-Link feature. When a user is assigned to a temporarily group membership, his login Kerberos ticket granting ticket (TGT) life time will be equal to lowest TTL value he has. For example, let’s assume you granted temporarily group membership to user A to be a member of domain admin group. It is only valid for 60 minutes. But user logged in only after 50 minutes from original assign and only have 10 minutes left to be a member of domain admin group. Based on that domain controller will issue TGT only valid for 10 minutes for user A.  This feature is not enabled by default. The reason for that is, to use this feature the forest function level must be Windows Server 2016. Also, once this feature is enabled, it cannot be disabled.  Let’s see how it works in real world: I have Windows domain controller installed and it is running with Windows Server 2016 forest functional level. It can be verified using the following PowerShell command: Get-ADForest | fl Name,ForestMode Then we need to enable the Expiring Link feature. It can be enabled using the following command: Enable-ADOptionalFeature ‘Privileged Access Management Feature’ -Scope ForestOrConfigurationSet -Target rebeladmin.com The rebeladmin.com link can be replaced with your FQDN: I have a user called Adam Curtiss to whom I need to assign Domain Admins group membership for 60 minutes: Get-ADGroupMember “Domain Admins” The preceding command will list the current member of domain admin group:  Next step is to add the user Adam Curtiss to the Domain Admins group for 60 minutes: Add-ADGroupMember -Identity ‘Domain Admins’ -Members ‘acurtiss’ -MemberTimeToLive (New-TimeSpan -Minutes 60)  Once its run, we can verify the TTL value remaining for the group membership using the following command:  Get-ADGroup ‘Domain Admins’ -Property member -ShowMemberTimeToLive Once I log in as the user and list the Kerberos ticket it shows the renew time with less than 60 minutes as I log in as user after few minutes of granting. Once the TGT renewal comes, the user will no longer be a member of Domain Admins group. Summary In this article we looked at the new features and enhancements that come with AD DS 2016. One of the biggest improvement was Microsoft's new approach towards the PAM. This is not just a feature that can be enabled via AD DS, it's just a part of the border solution. It helps to protect identity infrastructures from adversaries as traditional techniques and technologies no longer valid with rising threats. Resources for Article: Further resources on this subject: Deploying and Synchronizing Azure Active Directory [article] How to Recover from an Active Directory Failure [article] Active Directory migration [article]
Read more
  • 0
  • 0
  • 4810

article-image-integrating-phplist-2-wordpress
Packt
29 Jul 2011
3 min read
Save for later

Integrating phpList 2 with WordPress

Packt
29 Jul 2011
3 min read
Prerequisites for this WordPress tutorial For this tutorial, we'll make the following assumptions: We already have a working instance of WordPress (version 3.x) Our phpList site is accessible through HTTP / HTTPS from our WordPress site Installing and configuring the phpList Integration plugin Download the latest version of Jesse Heap's phpList Integration plugin from http://wordpress.org/extend/plugins/phplist-form-integration/, unpack it, and upload the contents to your wp-content/plugins/ directory in WordPress. Activate the plugin from within your WordPress dashboard: Under the Settings menu, click on the new PHPlist link to configure the plugin: General Settings Under the General Settings heading, enter the URL to your phpList installation, as well as an admin username/password combination. Enter the ID and name of at least one list that you want to allow your WordPress users to subscribe to: Why does the plugin require my admin login and password? The admin login and password are used to bypass the confirmation e-mail that would normally be sent to a subscriber. Effectively, the plugin "logs into" phpList as the administrator and then subscribes the user, bypassing confirmation. If you don't want to bypass confirmation e-mails, then you don't need to enter your username and password. Form Settings The plugin will work with this section unmodified. However, let's imagine that we also want to capture the subscriber's name. We already have an attribute in phpList called first name, so change the first field label to First Name and the Text Field ID to first name (the same as our phpList attribute name): Adding a phpList Integration page The plugin will replace the HTML comment <!--phplist form--> with the generated phpList form. Let's say we wanted our phpList form to show up at http://ourblog.com/signup. Create a new WordPress page called Signup, add the content you want to be displayed, and then click on the HTML tab to edit the HTML source: You will see the HTML source of your page displayed. Insert the text "<!--phplist form-->" where you want the form to be displayed and save the page: HTML comments The "<!--some text-->" syntax designates an HTML comment, which is not displayed when the HTML is processed by the browser / viewer. This means that you won't see your comment when you view your page in Visual mode. Once the page has been updated, click on the View page link to display the page in WordPress: The subscribe form will be inserted in the page at the location where you added the comment: Adding a phpList Integration widget Instead of a dedicated page to sign up new subscribers, you may want to use a sidebar widget instead, so that the subscription options can show up on multiple pages on your WordPress site. To add the phpList integration widget, go to your WordPress site's Appearance option and go to the Widgets page: Drag the PHPList Integration widget to your preferred widget location. (These vary depending on your theme): You can change the Title of the widget before you click on Close to finish: Now that you've added the PHPList Integration widget to the widget area, your sign up form will be displayed on all WordPress pages, which include that widget area:   Further resources on this subject: Integrating phpList 2 with Drupal phpList 2 E-mail Campaign Manager: Personalizing E-mail Body Tcl: Handling Email Email, Languages, and JFile with Joomla!
Read more
  • 0
  • 1
  • 4809

article-image-hive-hadoop
Packt
10 Feb 2015
36 min read
Save for later

Hive in Hadoop

Packt
10 Feb 2015
36 min read
In this article by Garry Turkington and Gabriele Modena, the author of the book Learning Hadoop 2. explain how MapReduce is a powerful paradigm that enables complex data processing that can reveal valuable insights. It does require a different mindset and some training and experience on the model of breaking processing analytics into a series of map and reduce steps. There are several products that are built atop Hadoop to provide higher-level or more familiar views of the data held within HDFS, and Pig is a very popular one. This article will explore the other most common abstraction implemented atop Hadoop: SQL. In this article, we will cover the following topics: What the use cases for SQL on Hadoop are and why it is so popular HiveQL, the SQL dialect introduced by Apache Hive Using HiveQL to perform SQL-like analysis of the Twitter dataset How HiveQL can approximate common features of relational databases such as joins and views (For more resources related to this topic, see here.) Why SQL on Hadoop So far we have seen how to write Hadoop programs using the MapReduce APIs and how Pig Latin provides a scripting abstraction and a wrapper for custom business logic by means of UDFs. Pig is a very powerful tool, but its dataflow-based programming model is not familiar to most developers or business analysts. The traditional tool of choice for such people to explore data is SQL. Back in 2008 Facebook released Hive, the first widely used implementation of SQL on Hadoop. Instead of providing a way of more quickly developing map and reduce tasks, Hive offers an implementation of HiveQL, a query language based on SQL. Hive takes HiveQL statements and immediately and automatically translates the queries into one or more MapReduce jobs. It then executes the overall MapReduce program and returns the results to the user. This interface to Hadoop not only reduces the time required to produce results from data analysis, it also significantly widens the net as to who can use Hadoop. Instead of requiring software development skills, anyone who's familiar with SQL can use Hive. The combination of these attributes is that HiveQL is often used as a tool for business and data analysts to perform ad hoc queries on the data stored on HDFS. With Hive, the data analyst can work on refining queries without the involvement of a software developer. Just as with Pig, Hive also allows HiveQL to be extended by means of User Defined Functions, enabling the base SQL dialect to be customized with business-specific functionality. Other SQL-on-Hadoop solutions Though Hive was the first product to introduce and support HiveQL, it is no longer the only one. There are others, but we will mostly discuss Hive and Impala as they have been the most successful. While introducing the core features and capabilities of SQL on Hadoop however, we will give examples using Hive; even though Hive and Impala share many SQL features, they also have numerous differences. We don't want to constantly have to caveat each new feature with exactly how it is supported in Hive compared to Impala. We'll generally be looking at aspects of the feature set that are common to both, but if you use both products, it's important to read the latest release notes to understand the differences. Prerequisites Before diving into specific technologies, let's generate some data that we'll use in the examples throughout this article. We'll create a modified version of a former Pig script as the main functionality for this. The script in this article assumes that the Elephant Bird JARs used previously are available in the /jar directory on HDFS. The full source code is at https://github.com/learninghadoop2/book-examples/ch7/extract_for_hive.pig, but the core of extract_for_hive.pig is as follows: -- load JSON data tweets = load '$inputDir' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); -- Tweets tweets_tsv = foreach tweets { generate    (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,    (chararray)$0#'id_str', (chararray)$0#'text' as text,    (chararray)$0#'in_reply_to', (boolean)$0#'retweeted' as is_retweeted, (chararray)$0#'user'#'id_str' as user_id, (chararray)$0#'place'#'id' as place_id; } store tweets_tsv into '$outputDir/tweets' using PigStorage('u0001'); -- Places needed_fields = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,      (chararray)$0#'id_str' as id_str, $0#'place' as place; } place_fields = foreach needed_fields { generate    (chararray)place#'id' as place_id,    (chararray)place#'country_code' as co,    (chararray)place#'country' as country,    (chararray)place#'name' as place_name,    (chararray)place#'full_name' as place_full_name,    (chararray)place#'place_type' as place_type; } filtered_places = filter place_fields by co != ''; unique_places = distinct filtered_places; store unique_places into '$outputDir/places' using PigStorage('u0001');   -- Users users = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str' as id_str, $0#'user' as user; } user_fields = foreach users {    generate    (chararray)CustomFormatToISO(user#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)user#'id_str' as user_id, (chararray)user#'location' as user_location, (chararray)user#'name' as user_name, (chararray)user#'description' as user_description, (int)user#'followers_count' as followers_count, (int)user#'friends_count' as friends_count, (int)user#'favourites_count' as favourites_count, (chararray)user#'screen_name' as screen_name, (int)user#'listed_count' as listed_count;   } unique_users = distinct user_fields; store unique_users into '$outputDir/users' using PigStorage('u0001'); Run this script as follows: $ pig –f extract_for_hive.pig –param inputDir=<json input> -param outputDir=<output path> The preceding code writes data into three separate TSV files for the tweet, user, and place information. Notice that in the store command, we pass an argument when calling PigStorage. This single argument changes the default field separator from a tab character to unicode value U0001, or you can also use Ctrl +C + A. This is often used as a separator in Hive tables and will be particularly useful to us as our tweet data could contain tabs in other fields. Overview of Hive We will now show how you can import data into Hive and run a query against the table abstraction Hive provides over the data. In this example, and in the remainder of the article, we will assume that queries are typed into the shell that can be invoked by executing the hive command. Recently a client called Beeline also became available and will likely be the preferred CLI client in the near future. When importing any new data into Hive, there is generally a three-stage process: Create the specification of the table into which the data is to be imported Import the data into the created table Execute HiveQL queries against the table Most of the HiveQL statements are direct analogues to similarly named statements in standard SQL. We assume only a passing knowledge of SQL throughout this article, but if you need a refresher, there are numerous good online learning resources. Hive gives a structured query view of our data, and to enable that, we must first define the specification of the table's columns and import the data into the table before we can execute any queries. A table specification is generated using a CREATE statement that specifies the table name, the name and types of its columns, and some metadata about how the table is stored: CREATE table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; The statement creates a new table tweets defined by a list of names for columns in the dataset and their data type. We specify that fields are delimited by the Unicode U0001 character and that the format used to store data is TEXTFILE. Data can be imported from a location in HDFS tweets/ into hive using the LOAD DATA statement: LOAD DATA INPATH 'tweets' OVERWRITE INTO TABLE tweets; By default, data for Hive tables is stored on HDFS under /user/hive/warehouse. If a LOAD statement is given a path to data on HDFS, it will not simply copy the data into /user/hive/warehouse, but will move it there instead. If you want to analyze data on HDFS that is used by other applications, then either create a copy or use the EXTERNAL mechanism that will be described later. Once data has been imported into Hive, we can run queries against it. For instance: SELECT COUNT(*) FROM tweets; The preceding code will return the total number of tweets present in the dataset. HiveQL, like SQL, is not case sensitive in terms of keywords, columns, or table names. By convention, SQL statements use uppercase for SQL language keywords, and we will generally follow this when using HiveQL within files, as will be shown later. However, when typing interactive commands, we will frequently take the line of least resistance and use lowercase. If you look closely at the time taken by the various commands in the preceding example, you'll notice that loading data into a table takes about as long as creating the table specification, but even the simple count of all rows takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times. The nature of Hive tables Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. Both the CREATE TABLE and LOAD DATA statements do not truly create concrete table data as such; instead, they produce the metadata that will be used when Hive generates MapReduce jobs to access the data conceptually stored in the table but actually residing on HDFS. Even though the HiveQL statements refer to a specific table structure, it is Hive's responsibility to generate code that correctly maps this to the actual on-disk format in which the data files are stored. This might seem to suggest that Hive isn't a real database; this is true, it isn't. Whereas a relational database will require a table schema to be defined before data is ingested and then ingest only data that conforms to that specification, Hive is much more flexible. The less concrete nature of Hive tables means that schemas can be defined based on the data as it has already arrived and not on some assumption of how the data should be, which might prove to be wrong. Though changeable data formats are troublesome regardless of technology, the Hive model provides an additional degree of freedom in handling the problem when, not if, it arises. Hive architecture Until version 2, Hadoop was primarily a batch system. Internally, Hive compiles HiveQL statements into MapReduce jobs. Hive queries have traditionally been characterized by high latency. This has changed with the Stinger initiative and the improvements introduced in Hive 0.13 that we will discuss later. Hive runs as a client application that processes HiveQL queries, converts them into MapReduce jobs, and submits these to a Hadoop cluster either to native MapReduce in Hadoop 1 or to the MapReduce Application Master running on YARN in Hadoop 2. Regardless of the model, Hive uses a component called the metastore, in which it holds all its metadata about the tables defined in the system. Ironically, this is stored in a relational database dedicated to Hive's usage. In the earliest versions of Hive, all clients communicated directly with the metastore, but this meant that every user of the Hive CLI tool needed to know the metastore username and password. HiveServer was created to act as a point of entry for remote clients, which could also act as a single access-control point and which controlled all access to the underlying metastore. Because of limitations in HiveServer, the newest way to access Hive is through the multi-client HiveServer2. HiveServer2 introduces a number of improvements over its predecessor, including user authentication and support for multiple connections from the same client. More information can be found at https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2. Instances of HiveServer and HiveServer2 can be manually executed with the hive --service hiveserver and hive --service hiveserver2 commands, respectively. In the examples we saw before and in the remainder of this article, we implicitly use HiveServer to submit queries via the Hive command-line tool. HiveServer2 comes with Beeline. For compatibility and maturity reasons, Beeline being relatively new, both tools are available on Cloudera and most other major distributions. The Beeline client is part of the core Apache Hive distribution and so is also fully open source. Beeline can be executed in embedded version with the following command: $ beeline -u jdbc:hive2:// Data types HiveQL supports many of the common data types provided by standard database systems. These include primitive types, such as float, double, int, and string, through to structured collection types that provide the SQL analogues to types such as arrays, structs, and unions (structs with options for some fields). Since Hive is implemented in Java, primitive types will behave like their Java counterparts. We can distinguish Hive data types into the following five broad categories: Numeric: tinyint, smallint, int, bigint, float, double, and decimal Date and time: timestamp and date String: string, varchar, and char Collections: array, map, struct, and uniontype Misc: boolean, binary, and NULL DDL statements HiveQL provides a number of statements to create, delete, and alter databases, tables, and views. The CREATE DATABASE <name> statement creates a new database with the given name. A database represents a namespace where table and view metadata is contained. If multiple databases are present, the USE <database name> statement specifies which one to use to query tables or create new metadata. If no database is explicitly specified, Hive will run all statements against the default database. SHOW [DATABASES, TABLES, VIEWS] displays the databases currently available within a data warehouse and which table and view metadata is present within the database currently in use: CREATE DATABASE twitter; SHOW databases; USE twitter; SHOW TABLES; The CREATE TABLE [IF NOT EXISTS] <name> statement creates a table with the given name. As alluded to earlier, what is really created is the metadata representing the table and its mapping to files on HDFS as well as a directory in which to store the data files. If a table or view with the same name already exists, Hive will raise an exception. Both table and column names are case insensitive. In older versions of Hive (0.12 and earlier), only alphanumeric and underscore characters were allowed in table and column names. As of Hive 0.13, the system supports unicode characters in column names. Reserved words, such as load and create, need to be escaped by backticks (the ` character) to be treated literally. The EXTERNAL keyword specifies that the table exists in resources out of Hive's control, which can be a useful mechanism to extract data from another source at the beginning of a Hadoop-based Extract-Transform-Load (ETL) pipeline. The LOCATION clause specifies where the source file (or directory) is to be found. The EXTERNAL keyword and LOCATION clause have been used in the following code: CREATE EXTERNAL TABLE tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/tweets'; This table will be created in metastore, but the data will not be copied into the /user/hive/warehouse directory. Note that Hive has no concept of primary key or unique identifier. Uniqueness and data normalization are aspects to be addressed before loading data into the data warehouse. The CREATE VIEW <view name> … AS SELECT statement creates a view with the given name. For example, we can create a view to isolate retweets from other messages, as follows: CREATE VIEW retweets COMMENT 'Tweets that have been retweeted' AS SELECT * FROM tweets WHERE retweeted = true; Unless otherwise specified, column names are derived from the defining SELECT statement. Hive does not currently support materialized views. The DROP TABLE and DROP VIEW statements remove both metadata and data for a given table or view. When dropping an EXTERNAL table or a view, only metadata will be removed and the actual data files will not be affected. Hive allows table metadata to be altered via the ALTER TABLE statement, which can be used to change a column type, name, position, and comment or to add and replace columns. When adding columns, it is important to remember that only metadata will be changed and not the dataset itself. This means that if we were to add a column in the middle of the table which didn't exist in older files, then while selecting from older data, we might get wrong values in the wrong columns. This is because we would be looking at old files with a new format Similarly, ALTER VIEW <view name> AS <select statement> changes the definition of an existing view. File formats and storage The data files underlying a Hive table are no different from any other file on HDFS. Users can directly read the HDFS files in the Hive tables using other tools. They can also use other tools to write to HDFS files that can be loaded into Hive through CREATE EXTERNAL TABLE or through LOAD DATA INPATH. Hive uses the Serializer and Deserializer classes, SerDe, as well as FileFormat to read and write table rows. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified in a CREATE TABLE statement. The DELIMITED clause instructs the system to read delimited files. Delimiter characters can be escaped using the ESCAPED BY clause. Hive currently uses the following FileFormat classes to read and write HDFS files: TextInputFormat and HiveIgnoreKeyTextOutputFormat: will read/write data in plain text file format SequenceFileInputFormat and SequenceFileOutputFormat: classes read/write data in the Hadoop SequenceFile format Additionally, the following SerDe classes can be used to serialize and deserialize data: MetadataTypedColumnsetSerDe: This will read/write delimited records such as CSV or tab-separated records ThriftSerDe, and DynamicSerDe: These will read/write Thrift objects JSON As of version 0.13, Hive ships with the native org.apache.hive.hcatalog.data.JsonSerDe JSON SerDe. For older versions of Hive, Hive-JSON-Serde (found at https://github.com/rcongiu/Hive-JSON-Serde) is arguably one of the most feature-rich JSON serialization/deserialization modules. We can use either module to load JSON tweets without any need for preprocessing and just define a Hive schema that matches the content of a JSON document. In the following example, we use Hive-JSON-Serde. As with any third-party module, we load the SerDe JARS into Hive with the following code: ADD JAR JAR json-serde-1.3-jar-with-dependencies.jar; Then, we issue the usual create statement, as follows: CREATE EXTERNAL TABLE tweets (    contributors string,    coordinates struct <      coordinates: array <float>,      type: string>,    created_at string,    entities struct <      hashtags: array <struct <            indices: array <tinyint>,            text: string>>, … ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION 'tweets'; With this SerDe, we can map nested documents (such as entities or users) to the struct or map types. We tell Hive that the data stored at LOCATION 'tweets' is text (STORED AS TEXTFILE) and that each row is a JSON object (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'). In Hive 0.13 and later, we can express this property as ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'. Manually specifying the schema for complex documents can be a tedious and error-prone process. The hive-json module (found at https://github.com/hortonworks/hive-json) is a handy utility to analyze large documents and generate an appropriate Hive schema. Depending on the document collection, further refinement might be necessary. In our example, we used a schema generated with hive-json that maps the tweets JSON to a number of struct data types. This allows us to query the data using a handy dot notation. For instance, we can extract the screen name and description fields of a user object with the following code: SELECT user.screen_name, user.description FROM tweets_json LIMIT 10; Avro AvroSerde (https://cwiki.apache.org/confluence/display/Hive/AvroSerDe) allows us to read and write data in Avro format. Starting from 0.14, Avro-backed tables can be created using the STORED AS AVRO statement, and Hive will take care of creating an appropriate Avro schema for the table. Prior versions of Hive are a bit more verbose. This dataset was created using Pig's AvroStorage class, which generated the following schema: { "type":"record", "name":"record", "fields": [    {"name":"topic","type":["null","int"]},    {"name":"source","type":["null","int"]},    {"name":"rank","type":["null","float"]} ] } The table structure is captured in an Avro record, which contains header information (a name and optional namespace to qualify the name) and an array of the fields. Each field is specified with its name and type as well as an optional documentation string. For a few of the fields, the type is not a single value, but instead a pair of values, one of which is null. This is an Avro union, and this is the idiomatic way of handling columns that might have a null value. Avro specifies null as a concrete type, and any location where another type might have a null value needs to be specified in this way. This will be handled transparently for us when we use the following schema. With this definition, we can now create a Hive table that uses this schema for its table specification, as follows: CREATE EXTERNAL TABLE tweets_pagerank ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{    "type":"record",    "name":"record",    "fields": [        {"name":"topic","type":["null","int"]},        {"name":"source","type":["null","int"]},        {"name":"rank","type":["null","float"]}    ] }') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '${data}/ch5-pagerank'; Then, look at the following table definition from within Hive (note also that HCatalog): DESCRIBE tweets_pagerank; OK topic                 int                   from deserializer   source               int                   from deserializer   rank                 float                 from deserializer In the DDL, we told Hive that data is stored in Avro format using AvroContainerInputFormat and AvroContainerOutputFormat. Each row needs to be serialized and deserialized using org.apache.hadoop.hive.serde2.avro.AvroSerDe. The table schema is inferred by Hive from the Avro schema embedded in avro.schema.literal. Alternatively, we can store a schema on HDFS and have Hive read it to determine the table structure. Create the preceding schema in a file called pagerank.avsc—this is the standard file extension for Avro schemas. Then place it on HDFS; we prefer to have a common location for schema files such as /schema/avro. Finally, define the table using the avro.schema.url SerDe property WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc'). If Avro dependencies are not present in the classpath, we need to add the Avro MapReduce JAR to our environment before accessing individual fields. Within Hive, on the Cloudera CDH5 VM: ADD JAR /opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar; We can also use this table like any other. For instance, we can query the data to select the user and topic pairs with a high PageRank: SELECT source, topic from tweets_pagerank WHERE rank >= 0.9; Columnar stores Hive can also take advantage of columnar storage via the ORC (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) and Parquet (https://cwiki.apache.org/confluence/display/Hive/Parquet) formats. If a table is defined with very many columns, it is not unusual for any given query to only process a small subset of these columns. But even in a SequenceFile each full row and all its columns will be read from disk, decompressed, and processed. This consumes a lot of system resources for data that we know in advance is not of interest. Traditional relational databases also store data on a row basis, and a type of database called columnar changed this to be column-focused. In the simplest model, instead of one file for each table, there would be one file for each column in the table. If a query only needed to access five columns in a table with 100 columns in total, then only the files for those five columns will be read. Both ORC and Parquet use this principle as well as other optimizations to enable much faster queries. Queries Tables can be queried using the familiar SELECT … FROM statement. The WHERE statement allows the specification of filtering conditions, GROUP BY aggregates records, ORDER BY specifies sorting criteria, and LIMIT specifies the number of records to retrieve. Aggregate functions, such as count and sum, can be applied to aggregated records. For instance, the following code returns the top 10 most prolific users in the dataset: SELECT user_id, COUNT(*) AS cnt FROM tweets GROUP BY user_id ORDER BY cnt DESC LIMIT 10 The following are the top 10 most prolific users in the dataset: NULL 7091 1332188053 4 959468857 3 1367752118 3 362562944 3 58646041 3 2375296688 3 1468188529 3 37114209 3 2385040940 3 We can improve the readability of the hive output by setting the following: SET hive.cli.print.header=true; This will instruct hive, though not beeline, to print column names as part of the output. You can add the command to the .hiverc file usually found in the root of the executing user's home directory to have it apply to all hive CLI sessions. HiveQL implements a JOIN operator that enables us to combine tables together. In the Prerequisites section, we generated separate datasets for the user and place objects. Let's now load them into hive using external tables. We first create a user table to store user data, as follows: CREATE EXTERNAL TABLE user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/users'; We then create a place table to store location data, as follows: CREATE EXTERNAL TABLE place ( place_id string, country_code string, country string, `name` string, full_name string, place_type string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/places'; We can use the JOIN operator to display the names of the 10 most prolific users, as follows: SELECT tweets.user_id, user.name, COUNT(tweets.user_id) AS cnt FROM tweets JOIN user ON user.user_id = tweets.user_id GROUP BY tweets.user_id, user.user_id, user.name ORDER BY cnt DESC LIMIT 10; Only equality, outer, and left (semi) joins are supported in Hive. Notice that there might be multiple entries with a given user ID but different values for the followers_count, friends_count, and favourites_count columns. To avoid duplicate entries, we count only user_id from the tweets tables. We can rewrite the previous query as follows: SELECT tweets.user_id, u.name, COUNT(*) AS cnt FROM tweets join (SELECT user_id, name FROM user GROUP BY user_id, name) u ON u.user_id = tweets.user_id GROUP BY tweets.user_id, u.name ORDER BY cnt DESC LIMIT 10; Instead of directly joining the user table, we execute a subquery, as follows: SELECT user_id, name FROM user GROUP BY user_id, name; The subquery extracts unique user IDs and names. Note that Hive has limited support for subqueries, historically only permitting a subquery in the FROM clause of a SELECT statement. Hive 0.13 has added limited support for subqueries within the WHERE clause also. HiveQL is an ever-evolving rich language, a full exposition of which is beyond the scope of this article. A description of its query and ddl capabilities can be found at  https://cwiki.apache.org/confluence/display/Hive/LanguageManual. Structuring Hive tables for given workloads Often Hive isn't used in isolation, instead tables are created with particular workloads in mind or needs invoked in ways that are suitable for inclusion in automated processes. We'll now explore some of these scenarios. Partitioning a table With columnar file formats, we explained the benefits of excluding unneeded data as early as possible when processing a query. A similar concept has been used in SQL for some time: table partitioning. When creating a partitioned table, a column is specified as the partition key. All values with that key are then stored together. In Hive's case, different subdirectories for each partition key are created under the table directory in the warehouse location on HDFS. It's important to understand the cardinality of the partition column. With too few distinct values, the benefits are reduced as the files are still very large. If there are too many values, then queries might need a large number of files to be scanned to access all the required data. Perhaps the most common partition key is one based on date. We could, for example, partition our user table from earlier based on the created_at column, that is, the date the user was first registered. Note that since partitioning a table by definition affects its file structure, we create this table now as a non-external one, as follows: CREATE TABLE partitioned_user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; To load data into a partition, we can explicitly give a value for the partition into which to insert the data, as follows: INSERT INTO TABLE partitioned_user PARTITION( created_at_date = '2014-01-01') SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count FROM user; This is at best verbose, as we need a statement for each partition key value; if a single LOAD or INSERT statement contains data for multiple partitions, it just won't work. Hive also has a feature called dynamic partitioning, which can help us here. We set the following three variables: SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; SET hive.exec.max.dynamic.partitions.pernode=5000; The first two statements enable all partitions (nonstrict option) to be dynamic. The third one allows 5,000 distinct partitions to be created on each mapper and reducer node. We can then simply use the name of the column to be used as the partition key, and Hive will insert data into partitions depending on the value of the key for a given row: INSERT INTO TABLE partitioned_user PARTITION( created_at_date ) SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user; Even though we use only a single partition column here, we can partition a table by multiple column keys; just have them as a comma-separated list in the PARTITIONED BY clause. Note that the partition key columns need to be included as the last columns in any statement being used to insert into a partitioned table. In the preceding code we use Hive's to_date function to convert the created_at timestamp to a YYYY-MM-DD formatted string. Partitioned data is stored in HDFS as /path/to/warehouse/<database>/<table>/key=<value>. In our example, the partitioned_user table structure will look like /user/hive/warehouse/default/partitioned_user/created_at=2014-04-01. If data is added directly to the filesystem, for instance by some third-party processing tool or by hadoop fs -put, the metastore won't automatically detect the new partitions. The user will need to manually run an ALTER TABLE statement such as the following for each newly added partition: ALTER TABLE <table_name> ADD PARTITION <location>; To add metadata for all partitions not currently present in the metastore we can use: MSCK REPAIR TABLE <table_name>; statement. On EMR, this is equivalent to executing the following statement: ALTER TABLE <table_name> RECOVER PARTITIONS; Notice that both statements will work also with EXTERNAL tables. Overwriting and updating data Partitioning is also useful when we need to update a portion of a table. Normally a statement of the following form will replace all the data for the destination table: INSERT OVERWRITE INTO <table>… If OVERWRITE is omitted, then each INSERT statement will add additional data to the table. Sometimes, this is desirable, but often, the source data being ingested into a Hive table is intended to fully update a subset of the data and keep the rest untouched. If we perform an INSERT OVERWRITE statement (or a LOAD OVERWRITE statement) into a partition of a table, then only the specified partition will be affected. Thus, if we were inserting user data and only wanted to affect the partitions with data in the source file, we could achieve this by adding the OVERWRITE keyword to our previous INSERT statement. We can also add caveats to the SELECT statement. Say, for example, we only wanted to update data for a certain month: INSERT INTO TABLE partitioned_user PARTITION (created_at_date) SELECT created_at , user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user WHERE to_date(created_at) BETWEEN '2014-03-01' and '2014-03-31'; Bucketing and sorting Partitioning a table is a construct that you take explicit advantage of by using the partition column (or columns) in the WHERE clause of queries against the tables. There is another mechanism called bucketing that can further segment how a table is stored and does so in a way that allows Hive itself to optimize its internal query plans to take advantage of the structure. Let's create bucketed versions of our tweets and user tables; note the following additional CLUSTER BY and SORT BY statements in the CREATE TABLE statements: CREATE table bucketed_tweets ( tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE;   CREATE TABLE bucketed_user ( user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) SORTED BY(name) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; Note that we changed the tweets table to also be partitioned; you can only bucket a table that is partitioned. Just as we need to specify a partition column when inserting into a partitioned table, we must also take care to ensure that data inserted into a bucketed table is correctly clustered. We do this by setting the following flag before inserting the data into the table: SET hive.enforce.bucketing=true; Just as with partitioned tables, you cannot apply the bucketing function when using the LOAD DATA statement; if you wish to load external data into a bucketed table, first insert it into a temporary table, and then use the INSERT…SELECT… syntax to populate the bucketed table. When data is inserted into a bucketed table, rows are allocated to a bucket based on the result of a hash function applied to the column specified in the CLUSTERED BY clause. One of the greatest advantages of bucketing a table comes when we need to join two tables that are similarly bucketed, as in the previous example. So, for example, any query of the following form would be vastly improved: SET hive.optimize.bucketmapjoin=true; SELECT … FROM bucketed_user u JOIN bucketed_tweet t ON u.user_id = t.user_id; With the join being performed on the column used to bucket the table, Hive can optimize the amount of processing as it knows that each bucket contains the same set of user_id columns in both tables. While determining which rows against which to match, only those in the bucket need to be compared against, and not the whole table. This does require that the tables are both clustered on the same column and that the bucket numbers are either identical or one is a multiple of the other. In the latter case, with say one table clustered into 32 buckets and another into 64, the nature of the default hash function used to allocate data to a bucket means that the IDs in bucket 3 in the first table will cover those in both buckets 3 and 35 in the second. Sampling data Bucketing a table can also help while using Hive's ability to sample data in a table. Sampling allows a query to gather only a specified subset of the overall rows in the table. This is useful when you have an extremely large table with moderately consistent data patterns. In such a case, applying a query to a small fraction of the data will be much faster and will still give a broadly representative result. Note, of course, that this only applies to queries where you are looking to determine table characteristics, such as pattern ranges in the data; if you are trying to count anything, then the result needs to be scaled to the full table size. For a non-bucketed table, you can sample in a mechanism similar to what we saw earlier by specifying that the query should only be applied to a certain subset of the table: SELECT max(friends_count) FROM user TABLESAMPLE(BUCKET 2 OUT OF 64 ON name); In this query, Hive will effectively hash the rows in the table into 64 buckets based on the name column. It will then only use the second bucket for the query. Multiple buckets can be specified, and if RAND() is given as the ON clause, then the entire row is used by the bucketing function. Though successful, this is highly inefficient as the full table needs to be scanned to generate the required subset of data. If we sample on a bucketed table and ensure the number of buckets sampled is equal to or a multiple of the buckets in the table, then Hive will only read the buckets in question. For example: SELECT MAX(friends_count) FROM bucketed_user TABLESAMPLE(BUCKET 2 OUT OF 32 on user_id); In the preceding query against the bucketed_user table, which is created with 64 buckets on the user_id column, the sampling, since it is using the same column, will only read the required buckets. In this case, these will be buckets 2 and 34 from each partition. A final form of sampling is block sampling. In this case, we can specify the required amount of the table to be sampled, and Hive will use an approximation of this by only reading enough source data blocks on HDFS to meet the required size. Currently, the data size can be specified as either a percentage of the table, as an absolute data size, or as a number of rows (in each block). The syntax for TABLESAMPLE is as follows, which will sample 0.5 percent of the table, 1 GB of data or 100 rows per split, respectively: TABLESAMPLE(0.5 PERCENT) TABLESAMPLE(1G) TABLESAMPLE(100 ROWS) If these latter forms of sampling are of interest, then consult the documentation, as there are some specific limitations on the input format and file formats that are supported. Writing scripts We can place Hive commands in a file and run them with the -f option in the hive CLI utility: $ cat show_tables.hql show tables; $ hive -f show_tables.hql We can parameterize HiveQL statements by means of the hiveconf mechanism. This allows us to specify an environment variable name at the point it is used rather than at the point of invocation. For example: $ cat show_tables2.hql show tables like '${hiveconf:TABLENAME}'; $ hive -hiveconf TABLENAME=user -f show_tables2.hql The variable can also be set within the Hive script or an interactive session: SET TABLE_NAME='user'; The preceding hiveconf argument will add any new variables in the same namespace as the Hive configuration options. As of Hive 0.8, there is a similar option called hivevar that adds any user variables into a distinct namespace. Using hivevar, the preceding command would be as follows: $ cat show_tables3.hql show tables like '${hivevar:TABLENAME}'; $ hive -hivevar TABLENAME=user –f show_tables3.hql Or we can write the command interactively: SET hivevar_TABLE_NAME='user'; Summary In this article, we learned that in its early days, Hadoop was sometimes erroneously seen as the latest supposed relational database killer. Over time, it has become more apparent that the more sensible approach is to view it as a complement to RDBMS technologies and that, in fact, the RDBMS community has developed tools such as SQL that are also valuable in the Hadoop world. HiveQL is an implementation of SQL on Hadoop and was the primary focus of this article. In regard to HiveQL and its implementations, we covered the following topics: How HiveQL provides a logical model atop data stored in HDFS in contrast to relational databases where the table structure is enforced in advance How HiveQL offers the ability to extend its core set of operators with user-defined code and how this contrasts to the Pig UDF mechanism The recent history of Hive developments, such as the Stinger initiative, that have seen Hive transition to an updated implementation that uses Tez Resources for Article: Further resources on this subject: Big Data Analysis [Article] Understanding MapReduce [Article] Amazon DynamoDB - Modelling relationships, Error handling [Article]
Read more
  • 0
  • 0
  • 4808
article-image-manifest-assurance-security-and-android-permissions-flash
Packt
29 Jun 2011
8 min read
Save for later

Manifest Assurance: Security and Android Permissions for Flash

Packt
29 Jun 2011
8 min read
Setting application permissions with the Android Manifest file When users choose to install an application on Android, they are always presented with a warning about which permissions the application will have within their particular system. From Internet access to full Geolocation, Camera, or External Storage permissions; the user is explicitly told what rights the application will have on their system. If it seems as though the application is asking for more permissions than necessary, the user will usually refuse the install and look for another application to perform the task they need. It is very important to only require the permissions your application truly needs, or else users might be suspicious of you and the applications you make available. How to do it... There are three ways in which we can modify the Android Manifest file to set application permissions for compiling our application with Adobe AIR. Using Flash Professional: Within an AIR for Android project, open the Properties panel and click the little wrench icon next to Player selection: The AIR for Android Settings dialog window will appear. You will be presented with a list of permissions to either enable or disable for your application. Check only the ones your application will need and click OK when finished. Using Flash Builder: When first setting up your AIR for Android project in Flash Builder, define everything required in the Project Location area, and click Next. You are now in the Mobile Settings area of the New Flex Mobile Project dialog. Click the Permissions tab, making sure that Google Android is the selected platform. You will be presented with a list of permissions to either enable or disable for your application. Check only the ones your application will need and continue along with your project setup: To modify any of these permissions after you've begun developing the application, simply open the AIR descriptor file and edit it as is detailed in the following sections. Using a simple text editor: Find the AIR Descriptor File in your project. It is normally named something like {MyProject}-app.xml as it resides at the project root. Browse the file for a node named <android> within this node will be another called <manifestAdditions> which holds a child node called <manifest>. This section of the document contains everything we need to set permissions for our Android application. All we need to do is either comment out or remove those particular permissions that our application does not require. For instance, this application needs Internet, External Storage, and Camera access. Every other permission node is commented out using the standard XML comment syntax of <!-- <comment here> -->: <uses-permission name="android.permission.INTERNET"/> <uses-permission name="android.permission.WRITE_EXTERNAL_ STORAGE"/> <!--<uses-permission name="android.permission.READ_PHONE_ STATE"/>--> <!--<uses-permission name="android.permission.ACCESS_FINE_ LOCATION"/>--> <!--<uses-permission name="android.permission.DISABLE_ KEYGUARD"/>--> <!--<uses-permission name="android.permission.WAKE_LOCK"/>-- > <uses-permission name="android.permission.CAMERA"/> <!--<uses-permission name="android.permission.RECORD_ AUDIO"/>--> <!--<uses-permission name="android.permission.ACCESS_ NETWORK_STATE"/>--> <!--<uses-permission name="android.permission.ACCESS_WIFI_ STATE"/>--> How it works... The permissions you define within the AIR descriptor file will be used to create an Android Manifest file to be packaged within the .apk produced by the tool used to compile the project. These permissions restrict and enable the application, once installed on a user's device, and also alert the user as to which activities and resources the application will be given access to prior to installation. It is very important to provide only the permissions necessary for an application to perform the expected tasks once installed upon a device. The following is a list of the possible permissions for the Android manifest document: ACCESS_COARSE_LOCATION: Allows the Geoloctaion class to access WIFI and triangulated cell tower location data. ACCESS_FINE_LOCATION: Allows the Geolocation class to make use of the device GPS sensor. ACCESS_NETWORK_STATE: Allows an application to access the network state through the NetworkInfo class. ACCESS_WIFI_STATE: Allows and application to access the WIFI state through the NetworkInfo class. CAMERA: Allows an application to access the device camera. INTERNET: Allows the application to access the Internet and perform data transfer requests. READ_PHONE_STATE: Allows the application to mute audio when a phone call is in effect. RECORD_AUDIO: Allows microphone access to the application to record or monitor audio data. WAKE_LOCK: Allows the application to prevent the device from going to sleep using the SystemIdleMode class. (Must be used alongside DISABLE_KEYGUARD.) DISABLE_KEYGUARD: Allows the application to prevent the device from going to sleep using the SystemIdleMode class. (Must be used alongside WAKE_LOCK.) WRITE_EXTERNAL_STORAGE: Allows the application to write to external memory. This memory is normally stored as a device SD card. Preventing the device screen from dimming The Android operating system will dim, and eventually turn off the device screen after a certain amount of time has passed. It does this to preserve battery life, as the display is the primary power drain on a device. For most applications, if a user is interacting with the interface, that interaction will prevent the screen from dimming. However, if your application does not involve user interaction for lengthy periods of time, yet the user is looking at or reading something upon the display, it would make sense to prevent the screen from dimming. How to do it... There are two settings in the AIR descriptor file that can be changed to ensure the screen does not dim. We will also modify properties of our application to complete this recipe: Find the AIR descriptor file in your project. It is normally named something like {MyProject}-app.xml as it resides at the project root. Browse the file for a node named <android> within this node will be another called <manifestAdditions>, which holds a child node called <manifest>. This section of the document contains everything we need to set permissions for our Android application. All we need to do is make sure the following two nodes are present within this section of the descriptor file. Note that enabling both of these permissions is required to allow application control over the system through the SystemIdleMode class. Uncomment them if necessary. <uses-permission android_name="android.permission.WAKE_LOCK" /> <uses-permission android_name="android.permission.DISABLE_ KEYGUARD" /> Within our application, we will import the following classes: import flash.desktop.NativeApplication; import flash.desktop.SystemIdleMode; import flash.display.Sprite; import flash.display.StageAlign; import flash.display.StageScaleMode; import flash.text.TextField; import flash.text.TextFormat; Declare a TextField and TextFormat pair to trace out messages to the user: private var traceField:TextField; private var traceFormat:TextFormat; Now, we will set the system idle mode for our application by assigning the SystemIdleMode.KEEP_AWAKE constant to the NativeApplication.nativeApplication.systemIdleMode property: protected function setIdleMode():void { NativeApplication.nativeApplication.systemIdleMode = SystemIdleMode.KEEP_AWAKE; } We will, at this point, continue to set up our TextField, apply a TextFormat, and add it to the DisplayList. Here, we create a method to perform all of these actions for us: protected function setupTraceField():void { traceFormat = new TextFormat(); traceFormat.bold = true; traceFormat.font = "_sans"; traceFormat.size = 24; traceFormat.align = "left"; traceFormat.color = 0xCCCCCC; traceField = new TextField(); traceField.defaultTextFormat = traceFormat; traceField.selectable = false; traceField.multiline = true; traceField.wordWrap = true; traceField.mouseEnabled = false; traceField.x = 20; traceField.y = 20 traceField.width = stage.stageWidth-40; traceField.height = stage.stageHeight - traceField.y; addChild(traceField); } Here, we simply output the currently assigned system idle mode String to our TextField, letting the user know that the device will not be going to sleep: protected function checkIdleMode():void { traceField.text = "System Idle Mode: " + NativeApplication. nativeApplication.systemIdleMode; } When the application is run on a device, the System Idle Mode will be set and the results traced out to our display. The user can leave the device unattended for as long as necessary and the screen will not dim or lock. In the following example, this application was allowed to run for five minutes without user intervention: How it works... There are two things that must be done in order to get this to work correctly and both are absolutely necessary. First, we have to be sure the application has correct permissions through the Android Manifest file. Allowing the application permissions for WAKE_LOCK and DISABLE_KEYGUARD within the AIR descriptor file will do this for us. The second part involves setting the NativeApplication.systemIdleMode property to keepAwake. This is best accomplished through use of the SystemIdleMode.KEEP_AWAKE constant. Ensuring that these conditions are met will enable the application to keep the device display lit and prevent Android from locking the device after it has been idle.
Read more
  • 0
  • 0
  • 4806

article-image-dealing-upstream-proxies
Packt
27 Nov 2014
6 min read
Save for later

Dealing with Upstream Proxies

Packt
27 Nov 2014
6 min read
This article is written by Akash Mahajan, the author of Burp Suite Essentials. We know that setting up Mozilla Firefox with the FoxyProxy Standard add-on to create a selective, pattern-based forwarding process allows us to ensure that only white-listed traffic from our browser reaches Burp. This is something that Burp allows us to set with its configuration options itself. Think of it like this: less traffic reaching Burp ensures that Burp is dealing with legitimate traffic, and its filters are working on ensuring that we remain within our scope. (For more resources related to this topic, see here.) As a security professional testing web application, scope is a term you hear and read about everywhere. Many times, we are expected to test only parts of an application, and usually, the scope is limited by domain, subdomain, folder name, and even certain filenames. Burp gives a nice, simple-to-use interface to add, edit, and remove targets from the scope. Dealing with upstream proxies and SOCKS proxies Sometimes, the application that we need to test lies inside some corporate network. The clients give access to a specific IP address that is white-listed in the corporate firewall. At other times, we work inside the client location but it requires us to provide an internal proxy to get access to the staging site for testing. In all such cases and more, we need to be able to add an additional proxy that Burp can send data to before it reaches our target. In some cases, this proxy can be the one that the browser requires to reach the intranet or even the Internet. Since we would like to intercept all the browser traffic and Burp has become the proxy for the browser, we need to be able to chain the proxy to set the same in Burp. Types of proxies supported by Burp We can configure additional proxies by navigating to Options | Connections. If you notice carefully, the upstream proxy rule editor looks like the FoxyProxy add-on proxy window. That is not surprising as both of them operate with URL patterns. We can carefully add the target as the destination that will require a proxy to reach to. Most standard proxies that support authentication are supported in Burp. Out of these, NTLM flavors are regularly found in networks with the Microsoft Active Directory infrastructure. The usage is straightforward. Add the destination and the other details that should be provided to you by the network administrators. Working with SOCKS proxies SOCKS proxies are another common form of proxies in use. The most popular SOCKS-based proxy is TOR, which allows your entire browser traffic, including DNS lookups, to occur at the proxy end. Since the SOCKS proxy protocol works by taking all the traffic through it, the destination server can see the IP address of the SOCKS proxy. You can give this a whirl by running the Tor browser bundle http://www.torproject.org/projects/torbrowser.html.en. Once the Tor browser bundle is running successfully, just add the following values in the SOCKS proxy settings of Burp. Make sure you check Use SOCKS proxy after adding the correct values. Have a look at the following screenshot: Using SSH tunneling as a SOCKS proxy Using SSH tunneling as a SOCKS proxy is quite useful when we want to give a white-listed IP address to a firewall administrator to access an application. So, the scenario here requires you to have access to a GNU/Linux server with a static IP address, which you can connect to using Secure Shell Server (SSH). In Mac OS X and GNU/Linux shell, the following command will start a local SOCKS proxy: ssh -D 12345 user@hostname.com Once you are successfully logged in to your server, leave it on so that Burp can keep using it. Now add localhost as SOCKS proxy host and 12345 as SOCKS proxy port, and you are good to go. In Windows, if we use a command-line SSH client that comes with GNU, the process remains the same. Otherwise, if you are a PuTTY fan, let's see how we can configure the same thing in it. In PuTTY, follow these steps to get the SSH tunnel working, which will be our SOCKS proxy: Start PuTTY and click on SSH and then on Tunnels. Here, add a newly forwarded port. Give it the value of 12345. Under Destination, there is a bunch of radio buttons; choose Auto and Dynamic, and then click on the Add button: Once this is set, connect to the server. Add the values localhost and 12345 in the Host and Port fields, respectively, in the Burp options for the SOCKS proxy. You can verify that your traffic is going through the SOCKS proxy by visiting any site that gives you your external IP address. I personally use my own web page for that http://akashm.com/ip.php; you might want to try http://icanhazip.com or http://whatismyip.com. Burp allows maximum connectivity with upstream and SOCKS proxies to make our job easier. By adding URL patterns, we can choose which proxy is connected in upstream proxy providers. SOCKS proxies, due to their nature, take all the traffic and send it to another computer, so we can't choose which URL to use it for. But this allows a simple-to-use workflow to test applications, which are behind corporate firewalls and need to white-list our static IP before allowing access. Setting up Burp to be a proxy server for other devices So far, we have run Burp on our computer. This is good enough when we want to intercept the traffic of browsers running on our computer. But what if we would like to intercept traffic from our television, from our iOS, or Android devices? Currently, in the default configuration, Burp has started one listener on an internal interface on port number 8080. We can start multiple listeners on different ports and interfaces. We can do this in the Options subtab under the Proxy tab. Note that this is different from the main Options tab. We can add more than one proxy listener at the same time by following these steps: Click on the Add button under Proxy Listeners. Enter a port number. It can be the same 8080, but if it confuses you, can give the number 8081. Specify an interface and choose your LAN IP address. Once you click on Ok, click on Running, and now you have started an external listener for Burp: You can add the LAN IP address and the port number you added as the proxy server on your mobile device, and all HTTP traffic will get intercepted by Burp. Have a look at the following screenshot: Summary In this article, you learned how to use the SOCKS proxy server, especially in a SSH tunnel kind of scenario. You also learned how simple it is to create multiple listeners for Burp, which allows other devices in the network to send their HTTP traffic to the Burp interception proxy. Resources for Article: Further resources on this subject: Quick start – Using Burp Proxy [article] Nginx proxy module [article] Using Nginx as a Reverse Proxy [article]
Read more
  • 0
  • 0
  • 4804
Modal Close icon
Modal Close icon