How-To Tutorials

article-image-introduction-railo-open-source

31 Mar 2010

10 min read

Introduction to Railo Open Source

31 Mar 2010

What is Railo? Railo is an open source Java application server that implements CFML (ColdFusion Markup Language), a tag based language from Adobe's commercial product “ColdFusion.” Its performance is excellent, and it includes features that significantly increase productivity. Railo is a relative newcomer, but has been making some impressive ripples in the industry lately. This article is a primer on some of the critical advantages of Railo and why it is worth a serious look for web application development. Isn’t ColdFusion dead? A few years back, an article was published naming 10 technologies that were dead or dying, and to many people's surprise, ColdFusion was in that list. That caused a lot of waves. One thing about CFML developers – they are passionate about their programming language! ColdFusion has seen moderate success in specific vertical markets, but has been notably well accepted by the US Government. In comparison to dominant development languages, CFML never seemed to find real favor with the masses. Since ColdFusion was re-engineered to run entirely on Java, and with the arrival of Adobe Flex a few years ago which integrates Flash and ColdFusion, this has changed quite a bit. Adobe's ColdFusion product integrates so well with Flex that it has spawned new interest. One of the largest complaints about Adobe ColdFusion has always been the price. It’s been my experience that CFML developers consider themselves to be industry peers of LAMP (Linux, Apache, MySQL, PHP) developers, who use all open source tools. The majority of LAMP developers consider their skills much higher than that of CFML developers. This has only fed the fury over the years of CFML developers who claim that the investment in purchasing ColdFusion is a quick return on investment since CFML is so much more productive. Now along comes Railo, offering a free and open source solution to the CFML developers' dreams. Not only is it free, but also it performs fantastic, is stable, and is updated reasonably frequently. This is good news for CFML, which is, in my opinion, highly underrated, mostly due to poor marketing and sales price points over the years. CFML is actually quite a powerful and surprisingly productive language, and was designed to be a RAD (Rapid Application Development) tool. It has grown into a significantly better product, and certainly does deserve more respect than it has had. But enough about CFML, let’s talk about why I find Railo is so impressive and what distinguishes itself from the competition. What can you do with Railo? Perhaps the best way to answer this is to say, “What CAN'T you do with Railo?” The CFML language is essentially a big java tag library. CFML has grown into an impressive library over the years and Railo supports everything that Adobe's product supports that is in mainstream use. (There is some difference between the support as both Railo and Adobe release new versions of their products). The core features of Railo's language provide easy to learn tags for everything from database queries to sending dynamic email messages to scripting connections with ftp and Amazon s3 storage. Pretty much anything you can do with PHP you can do with Railo. Here's the catch – generally speaking, it takes less time to implement a solution using CFML than it does with PHP, ASP.net or pure Java. Use CFML for the basics; Extend using Java. While Railo gives you a LOT of built in functions, the real truth of the situation is that it is Java under the hood. All the tags and functions ultimately get compiled and run as Java byte code. The language is well designed, however, so that you can mix and match your CFML and Java code. For instance, if you wanted to read in a text file, you can use the built in tag CFFILE: <cffile action="read" file="c:webmessage.txt" variable="strContent"></cffile> This reads in the contents of the text file, and stores it in the specified variable. To display that content in the web browser, you would output it like so: <cfoutput>#strContent#</cfoutput> To illustrate how Java can be used directly in your code, this same task can be done using Java objects instead of the built in CFML tags like so: <cfobject type="Java" class=" java.io.FileReader" Action="Create" name="myFileReader"> <cfset Result = fileReader.init("c:webmessage.txt"); <cfoutput>#strContent#</cfoutput> These two small pieces of code achieve the same goals. My point is that the CFML language isn't limited to just CFML, you can instantiate and use any Java object anywhere within your code. This makes the language incredibly flexible, since you can use the CFML tags for quick and easy tasks, and use Java for heavy lifting where needed. Deployment and Development Environments All versions of Railo can be downloaded either as an “express,” “server” or “custom” deployment. The express edition is extremely easy for developers to get up and running and usually involves just decompressing a zip file onto your local system and starting it up. The server package comes along with Caucho Resin, a very high performance java application server. (Side note – some of the tools included with Resin are pretty impressive as well, including their all-java implementation of PHP!). The custom deployment package is for launching Railo on other Java servlet containers like Tomcat or Weblogic. Setting up Railo on a production server wasn't difficult, granted it is a bit more involved than installing RPMs of your favorite PHP version, but documentation was easily found on Railo's site and other sites found through Google. Like Adobe's product, Railo comes with web administration tools to manage the server and application-specific settings and resources. This is a big step up from the PHP and Linux world, where you normally need to configure a lot of your application's settings (data sources for example) in configuration files. The Railo administrator goes a few steps beyond Adobe as well, and makes context specific administration consoles available, so individual applications and websites can define their own sandboxed data sources, virtual mappings, and more. This is a really nice touch, and has been a requested feature for a long time. Where Railo Shines I have already reviewed some of the reasons why Railo is impressive. Aside from being a very powerful RAD, with performance that rivals or beats Adobe, Railo distinguishes itself further with some impressive features. Virtual File systems and Mappings As developers, we have all had to deal with managing remote or compressed files at one time or another. This feature in Railo does in a few mouse clicks what takes hundreds of lines of code. Railo lets you map remote file systems, like FTP, drive shares, and even Amazon S3 buckets and assign them to a virtual path in your application! This means that you can use the simple built in functions for file manipulation, and treat those files as if they were sitting right on the local file system. The support goes even further, and lets you map Java jar files and .zip files, so you can dynamically reference and run code sitting inside compressed archives. Setting up new mappings is a point-and-click affair in the Railo administrator or can be done programmatically. Application Distribution and Source Code Security The Java world has always been a step (alright, several steps) ahead of web application developers in packaging and distribution of applications. Many developers have their own home-grown methods for deploying a site and many web development applications, like Dreamweaver, have an FTP based method of deployment. Ultimately, it usually means handing over unprotected source code. CFML development has been the same way (yes, Adobe did have a way to compile .cfm templates, but my research shows it is both clumsy to use and not very popular). Railo brings “Java world” package deployment to CFML development. You can compile a whole application to Java byte code, compress it to a jar file and deploy it on any other Railo server. Railo is even smart enough to let you map a remote jar file on an FTP site and run it as a local web application. This means you have all the tools you need to deploy web applications and not expose your source. Built in AMF Support for Flex/Flash Applications Since Adobe open-sourced their BlazeDS AMF tools, Railo has integrated them making an easy to use system that “just works” with Flash applications. Inter-Application Integration, PDF and Video Manipulation CFML already has great capability for integrating with a huge number of database systems and can be expanded to use any of the huge number of open source Java projects. Railo can be used to talk to Amazon Web Services, like EC2 and S3 for cloud computing applications. Railo also has built in features for file conversions, such as dynamically generating PDFs, and programmatic editing and format conversions of digital video. A few simple lines of code can convert your video files to different formats, extract thumbnails for web previews, and then you could have them dropped on Amazon S3 to be served from the cloud. Very cool stuff, and worth looking at some of the examples on the Railo website. As you look over code that uses these features, it looks quite simple and it is amazing that Railo makes them look like child’s play, but there is serious inter-system integration going on behind the scenes. Railo makes it so very easy to add these capabilities to any web application. Infinitely Expandable with Java As mentioned above, it is easy to invoke Java classes from within CFML pages. Since Railo itself runs in a Java container, that means that any classes or code from the Java world can be integrated and used with a Railo application. My Experience Building a Railo Project My company has used ColdFusion for several projects. One of our commercial products is built on it and was originally designed for Adobe ColdFusion. Our product does a lot of heavy lifting with databases, internationalization, document format conversions, PDF previews and a lot more. Early in 2009 we did a complete conversion of the source to be compatible with Railo. There were only minor areas where our code needed to change, and most of them were with custom Java code that we wrote that simply needed updated to compatible with Railo's Java libraries. The pleasant surprise came when we were done and noticed a significant performance increase running on Railo. Summary In summary, I have been very impressed with Railo. It is community-driven; the people at Railo are responsive and truly care about the developer community, and the product really delivers what it claims. They have provided an application development platform that is both industry compatible and innovative. I think all seasoned web application developers will be able to appreciate what Railo has to offer. I believe that such powerful integration done so easily with only a few lines of code will draw a lot of attention. This is definitely a technology you should keep an eye on.

0
0
4830

article-image-modeling-shading-texturing-lighting-and-compositing-soda-can-blender-249-part-1

Packt

05 Feb 2010

4 min read

Modeling, Shading, Texturing, Lighting, and Compositing a Soda Can in Blender 2.49: Part 1

Packt

05 Feb 2010

4 min read

I wanted to encapsulate this article with the latest version of Blender (being 2.5), I would not do so not until everyone gets comfortable with it and who knows, on one of my proceeding articles, we might delve more into an introduction of the new version. But for now, let’s be courteous enough to use the fully functional 2.49 version of Blender. If you don’t have it right now, I suggest you head over to http://www.blender.org/download/get-blender/ and grab your own copy. And you also might want to have a copy of the latest GIMP from http://www.gimp.org/downloads/. REQUIREMENTS: Skill level: Intermediate Blender 2.49b (stable) GIMP 2.6.8 INTRODUCTION: So basically, we’ll use Blender’s modeling tools, material indexes, powerful texture system, basic UV unwrapping, some lighting techniques, and of course the node compositor which is built-in in Blender. I dedicate this article to my family and the whole Blender community who have been very supportive of me during my past years of struggle and learning. It was just a wish before that someday hopefully I might be able to get the hang of using this application as much as I did with GIMP and finally somehow, it did happen. REFERENCE PHASE: Before we even begin doing modeling and firing up Blender itself, let’s get ourselves some decent reference images to base our model. Anything will do; it depends entirely on your tastes and preferences. Doing a quick Google search, here’s some that I found: MODELING PHASE: After studying carefully the shape and size of our reference soda cans, we can now proceed and start creating our basis shape for the entire process. I think this might be a good time to say this line, “Fire up Blender!” Depending on your User Defaults and Preferences, your startup screen might look a bit differently than mine and your default object on the scene might be different too. If you have objects other than a cube on your scene, kindly, delete them first since we’re only going to use the cube as our starting point. So if you don’t have one right now, go ahead and add it from the Spacebar > Add > Mesh > Cube menu. Adding a Cube to the Scene You might have wondered why a Cube and not a Cylinder. It’s because we don’t want to work on some extra polygons, just a few points will do. And we would be using some of Blender’s Modifiers to add contours and interpolations in between points to achieve smooth curves on the segments. With our cube on the scene now, go ahead and select it (Right Mouse Button [RMB]), then press CTRL+ 2 on your keyboard to add a Subsurf Modifier on the selected object or click the Editing (F9) button and scroll until you see the Modifiers tab then click Add Modifier and finally choose Subsurf. This will add a new modifier on our current stack. Adding a Subsurf Modifier After doing this, modify some of the subsurf options accordingly. Go ahead and change the Render Levels value to 3, or if you wish to, you could also change the Levels value to 3 such that what you see in your viewport is what you get on the render, but at the cost of a bit of a slowdown on your viewport (depending on the power your computer has). But still, despite adding a Subdivision Surface/Subsurf modifier on our Cube, why does it look polygonal still? That is because by default, the faces’ interpolation around the neighboring ones is set to Solid, that’s why we see this sharp edged transition in between faces. To make it smoother, just go ahead and click on the Editing(F9) button and scroll until you see the Links and Materials tab then click Set Smooth, or in Edit Mode, press W on your keyboard to bring up the Specials Menu and choose Set Smooth. Voila! Smoothing out the Geometry After this step, go to front view by pressing Numpad 1 on your numeric keypad and go to Edit Mode by pressing TAB, or choosing it from the Mode dropdown on the bottom of your 3D view. Select the top-most four vertices and move them 1 Blender Unit up along the Z-axis, do this by holding down the Ctrl key to constrain your movements on increments of 1, then press Z on your keyboard to constrain your movement on the z-axis only and not elsewhere. Moving the Top-most Vertices along Z

0
0
4829

How-To Tutorials

article-image-java-hibernate-collections-associations-and-advanced-concepts

Packt

15 Sep 2015

16 min read

Java Hibernate Collections, Associations, and Advanced Concepts

Packt

15 Sep 2015

16 min read

0
0
4829

article-image-implementing-stacks-using-javascript

Packt

22 Oct 2014

10 min read

Implementing Stacks using JavaScript

Packt

22 Oct 2014

10 min read

0
0
4827

Packt

19 Aug 2014

8 min read

BPMS Components

Packt

19 Aug 2014

8 min read

In this article by Mariano Nicolas De Maio, the author of jBPM6 Developer Guide, we will look into the various components of a Business Process Management (BPM) system. (For more resources related to this topic, see here.) BPM systems are pieces of software created with the sole purpose of guiding your processes through the BPM cycle. They were originally monolithic systems in charge of every aspect of a process, where they had to be heavily migrated from visual representations to executable definitions. They've come a long way from there, but we usually relate them to the same old picture in our heads when a system that runs all your business processes is mentioned. Nowadays, nothing is further from the truth. Modern BPM Systems are not monolithic environments; they're coordination agents. If a task is finished, they will know what to do next. If a decision needs to be made regarding the next step, they manage it. If a group of tasks can be concurrent, they turn them into parallel tasks. If a process's execution is efficient, they will perform the processing 0.1 percent of the time in the process engine and 99.9 percent of the time on tasks in external systems. This is because they will have no heavy executions within, only derivations to other systems. Also, they will be able to do this from nothing but a specific diagram for each process and specific connectors to external components. In order to empower us to do so, they need to provide us with a structure and a set of tools that we'll start defining to understand how BPM systems' internal mechanisms work, and specifically, how jBPM6 implements these tools. Components of a BPMS All big systems become manageable when we divide their complexities into smaller pieces, which makes them easier to understand and implement. BPM systems apply this by dividing each function in a different module and interconnecting them within a special structure that (in the case of jBPM6) looks something like the following figure: BPMS' internal structure Each component in the preceding figure resolves one particular function inside the BPMS architecture, and we'll see a detailed explanation on each one of them. The execution node The execution node, as seen from a black box perspective, is the component that receives the process definitions (a description of each step that must be followed; from here on, we'll just refer to them as processes). Then, it executes all the necessary steps in the established way, keeping track of each step, variable, and decision that has to be taken in each process's execution (we'll start calling these process instances). The execution node along with its modules are shown in the following figure: The execution node is composed of a set of low-level modules: the semantic module and the process engine. The semantic module The semantic module is in charge of defining each of the specific language semantics, that is, what each word means and how it will be translated to the internal structures that the process engine can execute. It consists of a series of parsers to understand different languages. It is flexible enough to allow you to extend and support multiple languages; it also allows the user to change the way already defined languages are to be interpreted for special use cases. It is a common component of most of the BPMSes out there, and in jBPM6, it allows you to add the extensions of the process interpretations to the module. This is so that you can add your own language parsers, and define your very own text-based process definition language or extend existing ones. The process engine The process engine is the module that is in charge of the actual execution of our business processes. It creates new process instances and keeps track of their state and their internal steps. Its job is to expose methods to inject process definitions and to create, start, and continue our process instances. Understanding how the process engine works internally is a very important task for the people involved in BPM's stage 4, that is, runtime. This is where different configurations can be used to improve performance, integrate with other systems, provide fault tolerance, clustering, and many other functionalities. Process Engine structure In the case of jBPM6, process definitions and process instances have similar structures but completely different objectives. Process definitions only show the steps it should follow and the internal structures of the process, keeping track of all the parameters it should have. Process instances, on the other hand, should carry all of the information of each process's execution, and have a strategy for handling each step of the process and keep track of all its actual internal values. Process definition structures These structures are static representations of our business processes. However, from the process engine's internal perspective, these representations are far from the actual process structure that the engine is prepared to handle. In order for the engine to get those structures generated, it requires the previously described semantic module to transform those representations into the required object structure. The following figure shows how this parsing process happens as well as the resultant structure: Using a process modeler, business analysts can draw business processes by dragging-and-dropping different activities from the modeler palette. For jBPM6, there is a web-based modeler designed to draw Scalable Vector Graphics (SVG) files; this is a type of image file that has the particularity of storing the image information using XML text, which is later transformed into valid BPMN2 files. Note that both BPMN2 and jBPM6 are not tied up together. On one hand, the BPMN2 standard can be used by other process engine provides such as Activiti or Oracle BPM Suite. Also, because of the semantic module, jBPM6 could easily work with other parsers to virtually translate any form of textual representation of a process to its internal structures. In the internal structures, we have a root component (called Process in our case, which is finally implemented in a class called RuleFlowProcess) that will contain all the steps that are represented inside the process definition. From the jBPM6 perspective, you can manually create these structures using nothing but the objects provided by the engine. Inside the jBPM6-Quickstart project, you will find a code snippet doing exactly this in the createProcessDefinition() method of the ProgrammedProcessExecutionTest class: //Process Definition RuleFlowProcess process = new RuleFlowProcess(); process.setId("myProgramaticProcess"); //Start Task StartNode startTask = new StartNode(); startTask.setId(1); //Script Task ActionNode scriptTask = new ActionNode(); scriptTask.setId(2); DroolsAction action = new DroolsAction(); action.setMetaData("Action", new Action() { @Override public void execute(ProcessContext context) throws Exception { System.out.println("Executing the Action!!"); } }); scriptTask.setAction(action); //End Task EndNode endTask = new EndNode(); endTask.setId(3); //Adding the connections to the nodes and the nodes to the processes new ConnectionImpl(startTask, "DROOLS_DEFAULT", scriptTask, "DROOLS_DEFAULT"); new ConnectionImpl(scriptTask, "DROOLS_DEFAULT", endTask, "DROOLS_DEFAULT"); process.addNode(startTask); process.addNode(scriptTask); process.addNode(endTask); Using this code, we can manually create the object structures to represent the process shown in the following figure: This process contains three components: a start node, a script node, and an end node. In this case, this simple process is in charge of executing a simple action. The start and end tasks simply specify a sequence. Even if this is a correct way to create a process definition, it is not the recommended one (unless you're making a low-level functionality test). Real-world, complex processes are better off being designed in a process modeler, with visual tools, and exported to standard representations such as BPMN 2.0. The output of both the cases is the same; a process object that will be understandable by the jBPM6 runtime. While we analyze how the process instance structures are created and how they are executed, this will do. Process instance structures Process instances represent the running processes and all the information being handled by them. Every time you want to start a process execution, the engine will create a process instance. Each particular instance will keep track of all the activities that are being created by its execution. In jBPM6, the structure is very similar to that of the process definitions, with one root structure (the ProcessInstance object) in charge of keeping all the information and NodeInstance objects to keep track of live nodes. The following code shows a simplification of the methods of the ProcessInstance implementation: public class RuleFlowProcessInstance implements ProcessInstance { public RuleFlowProcess getRuleFlowProcess() { ... } public long getId() { ... } public void start() { ... } public int getState() { ... } public void setVariable(String name, Object value) { ... } public Collection<NodeInstance> getNodeInstances() { ... } public Object getVariable(String name) { ... } } After its creation, the engine calls the start() method of ProcessInstance. This method seeks StartNode of the process and triggers it. Depending on the execution of the path and how different nodes connect between each other, other nodes will get triggered until they reach a safe state where the execution of the process is completed or awaiting external data. You can access the internal parameters that the process instance has through the getVariable and setVariable methods. They provide local information from the particular process instance scope. Summary In this article, we saw what are the basic components required to set up a BPM system. With these components in place, we are ready to explore, in more detail, the structure and working of a BPM system. Resources for Article: Further resources on this subject: jBPM for Developers: Part 1 [Article] Configuring JBoss Application Server 5 [Article] Boss jBPM Concepts and jBPM Process Definition Language (jPDL) [Article]

0
0
4825

Merlyn Shelley

08 Sep 2023

11 min read

AI_Distilled #16: Baidu's Ernie Chatbot, OpenAI's ChatGPT in Education, Meta's FACET Dataset, FMOps or LLMOps, Qualcomm's AI Focus, InteRecAgent, Liquid Neural Networks

Merlyn Shelley

08 Sep 2023

11 min read

👋 Hello ,“Artificial intelligence is one of the most profound things we're working on as humanity. It is more profound than fire or electricity.” -Sundar Pichai, Google CEO Pichai's AI-fire analogy signifies a transformative era; AI and ML will revolutionize education, medicine, and more, reshaping human progress. OpenAI has begun promoting the use of ChatGPT in education, which shouldn’t really come as a surprise as students the world over have been experimenting with the technology. Get ready to dive into the latest AI developments in this edition, AI_Distilled #16, including Baidu launching Ernie chatbot following Chinese government approval, X's Privacy Policy Reveals Plan to Use Public Data for AI Training, Meta releasing FACET Dataset to evaluate AI model fairness, Google’s new Multislice for scalable AI training on cloud TPUs, and Qualcomm's focus on AI and auto amidst NVIDIA's chip dominance. Watch out also for our handpicked collection of fresh AI, GPT, and LLM-focused secret knowledge and tutorials from around the web covering Liquid Neural Networks, Serverless Machine Learning with Amazon Redshift ML, implementing effective guardrails for LLMs, Navigating Generative AI with FMOps and LLMOps, and using Microsoft’s new AI compiler quartet. What do you think of this issue and our newsletter? Please consider taking the short survey below to share your thoughts and you will get a free PDF of the “The Applied Artificial Intelligence Workshop” eBook upon completion. Complete the Survey. Get a Packt eBook for Free!Writer’s Credit: Special shout-out to Vidhu Jain for their valuable contribution to this week’s newsletter content! Cheers, Merlyn Shelley Editor-in-Chief, Packt ⚡ TechWave: AI/GPT News & AnalysisMeta Releases FACET Dataset to Evaluate AI Model Fairness: Meta has launched FACET (FAirness in Computer Vision EvaluaTion), a dataset designed to assess the fairness of AI models used for image and video classification, including identifying people. Comprising 32,000 images with 50,000 labeled individuals, FACET includes demographic and physical attributes, allowing for deep evaluations of biases against various classes. Despite previous concerns about Meta's responsible AI practices, the company claims FACET is more comprehensive than previous bias benchmarks. However, concerns have been raised about the dataset's origins and the compensation of annotators. Meta has also released a web-based dataset explorer tool for FACET. You can read the full paper here. Baidu Launches Ernie Chatbot Following Chinese Government Approval: Chinese tech giant Baidu has unveiled its chatbot, Ernie Bot, after receiving government clearance, along with other AI firms. Ernie Bot is now accessible for download via app stores or Baidu's website. Similar to its rival, ChatGPT, users can engage Ernie Bot for queries, market analysis assistance, marketing slogan ideas, and document summaries. While it's accessible globally, registration requires a Chinese number, and the app is only in Chinese on US Android and iOS stores. Baidu has also introduced a plug-in market for Ernie Bot, which quickly garnered over 1 million users within 19 hours of launch. CEO Robin Li expressed plans for further AI-native apps aimed at exploring generative AI's core abilities. Google Introduces Multislice for Scalable AI Training on Cloud TPUs: Google has unveiled Multislice, a comprehensive large-scale training technology that facilitates straightforward, cost-effective, and nearly linear scaling to tens of thousands of Cloud Tensor Processing Units (TPUs) chips. Traditionally, a training run was restricted to a single slice, which meant a maximum of 3072 TPU v4 chips could be used. With Multislice, training can span multiple slices across pods through data center networking, eliminating these limitations. This innovation offers benefits such as efficient scaling for massive models, enhanced developer productivity, automatic compiler optimizations, and cost-efficiency. It promises to revolutionize AI infrastructure by enabling near-linear scaling for AI supercomputing. OpenAI Promotes Use of ChatGPT in Education: OpenAI is encouraging educators to utilize ChatGPT in classrooms. The company showcased six educators, primarily at the university level, using ChatGPT for various purposes, such as role-playing in debates, aiding translation for English-as-a-second-language students, and fact-checking. Despite some schools banning ChatGPT due to concerns about academic integrity, OpenAI believes it can be a valuable tool in education. However, it emphasizes the importance of maintaining human oversight in the assessment process. X's Privacy Policy Reveals Plan to Use Public Data for AI Training: In an update to its privacy policy, X (formerly Twitter) has informed users that it will now collect biometric data, job histories, and education backgrounds. However, another section of the policy reveals a broader plan: X intends to utilize the data it gathers, along with publicly available information, to train its machine learning and AI models. This revelation has attracted attention, particularly due to the connection with X owner Elon Musk's ambitions in the AI market through his company xAI. Musk confirmed the privacy policy change, emphasizing that only public data, not private messages, would be used for AI training. Qualcomm's Focus on AI and Auto Amidst NVIDIA’s Chip Dominance: As NVIDIA takes the lead as the world's largest fabless chip company, Qualcomm is strategically positioning itself in the AI realm. The company has unveiled in-vehicle generative AI capabilities, expanded into two-wheelers, and forged a partnership with Amazon Web Services. Qualcomm's CEO, Cristiano Amon, believes that generative AI, currently reliant on cloud resources, will transition to local execution, enhancing performance and cost-efficiency. Diversification is also a priority, with Qualcomm's chips powering various smart devices, especially in the automotive sector. Amid uncertainty about its future relationship with Apple, Qualcomm aims to maintain its dominance through innovations in AI and auto tech. InteRecAgent, A Fusion of Language Models and Recommender Systems Introduced: Researchers from the University of Science and Technology of China, in collaboration with Microsoft Research Asia, have introduced InteRecAgent, a cutting-edge framework. This innovation seeks to combine the interactive capabilities of LLMs with the domain-specific precision of traditional recommender systems. Recommender systems play a vital role in various digital domains, but they often struggle with versatile interactions. On the other hand, LLMs excel in conversations but lack domain-specific knowledge. InteRecAgent introduces the "Candidate Memory Bus" to streamline recommendations for LLMs and a "Plan-first Execution with Dynamic Demonstrations" strategy for effective tool interaction. adidas Utilizes AI and NVIDIA RTX for Photorealistic 3D Content: Sportswear giant adidas is partnering with Covision Media, an Italian startup, to revolutionize their online shopping experience. Covision employs AI and NVIDIA RTX technology to develop 3D scanners that allow businesses to create digital twins of their products with stunning realism. This technology can quickly generate 3D scans, capturing textures, colors, and geometry, resulting in lifelike images. adidas is among the first to adopt this technology for automating and scaling e-commerce content production, enhancing their Virtual Try-On feature and replacing traditional product photography with computer-generated content. 🔮 Expert Insights from Packt CommunityServerless Machine Learning with Amazon Redshift ML - By Debu Panda, Phil Bates, Bhanu Pittampally, Sumeet JoshiData analysts and developers use Redshift data with machine learning (ML) models for tasks such as predicting customer behavior. Amazon Redshift ML streamlines this process using familiar SQL commands. A conundrum arises when attempting to decipher these data silos – a formidable challenge that hampers the derivation of meaningful insights essential for organizational clarity. Adding to this complexity, security and performance considerations typically prevent business analysts from accessing data within OLTP systems. The hiccup is that intricate analytical queries weigh down OLTP databases, casting a shadow over their core operations. Here, the solution is the data warehouse, which is a central hub of curated data, used by business analysts and data scientists to make informed decisions by employing the business intelligence and machine learning tools at their disposal. These users make use of Structured Query Language (SQL) to derive insights from this data trove. Here’s where Amazon Redshift Serverless comes in. It’s a key option within Amazon Redshift, a well-managed cloud data warehouse offered by Amazon Web Services (AWS). With cloud-based ease, Amazon Redshift Serverless lets you set up your data storage without infrastructure hassles or cost worries. You pay based on what you use for compute and storage. Amazon Redshift Serverless goes beyond convenience, propelling modern data applications that seamlessly connect to the data lake. The above content is extracted from the book Serverless Machine Learning with Amazon Redshift ML written by Debu Panda, Phil Bates, Bhanu Pittampally, Sumeet Joshi and published in Aug 2023. To get a glimpse of the book's contents, make sure to read the free chapter provided here, or if you want to unlock the full Packt digital library free for 7 days, try signing up now! To learn more, click on the button below. Keep Calm, Start Reading! 🌟 Secret Knowledge: AI/LLM ResourcesUnderstanding Liquid Neural Networks: A Primer on AI Advancements: In this post, you'll learn how liquid neural networks are transforming the AI landscape. These networks, inspired by the human brain, offer a unique and creative approach to problem-solving. They excel in complex tasks such as weather prediction, stock market analysis, and speech recognition. Unlike traditional neural networks, liquid neural networks require significantly fewer neurons, making them ideal for resource-constrained environments like autonomous vehicles. These networks excel in handling continuous data streams but may not be suitable for static data. They also provide better causality handling and interpretability. Navigating Generative AI with FMOps and LLMOps: A Practical Guide: In this informative post, you'll gain valuable insights into the world of generative AI and its operationalization using FMOps and LLMOps principles. The authors delve into the challenges businesses face when integrating generative AI into their operations. You'll explore the fundamental differences between traditional MLOps and these emerging concepts. The post outlines the roles various teams play in this process, from data engineers to data scientists, ML engineers, and product owners. The guide provides a roadmap for businesses looking to embrace generative AI. AI Compiler Quartet: A Breakdown of Cutting-Edge Technologies: Explore Microsoft’s groundbreaking "heavy-metal quartet" of AI compilers: Rammer, Roller, Welder, and Grinder. These compilers address the evolving challenges posed by AI models and hardware. Rammer focuses on optimizing deep neural network (DNN) computations, improving hardware parallel utilization. Roller tackles the challenge of memory partitioning and optimization, enabling faster compilation with good computation efficiency. Welder optimizes memory access, particularly vital as AI models become more memory-intensive. Grinder addresses complex control flow execution in AI computation. These AI compilers collectively offer innovative solutions for parallelism, compilation efficiency, memory, and control flow, shaping the future of AI model optimization and compilation. 💡 MasterClass: AI/LLM Tutorials Exploring IoT Data Simulation with ChatGPT and MQTTX: In this comprehensive guide, you'll learn how to harness the power of AI, specifically ChatGPT, and the MQTT client tool, MQTTX, to simulate and generate authentic IoT data streams. Discover why simulating IoT data is crucial for system verification, customer experience enhancement, performance assessment, and rapid prototype design. The article dives into the integration of ChatGPT and MQTTX, introducing the "Candidate Memory Bus" to streamline data testing. Follow the step-by-step guide to create simulation scripts with ChatGPT and efficiently simulate data transmission with MQTTX. Revolutionizing Real-time Inference: SageMaker Unveils Streaming Support for Generative AI: Amazon SageMaker now offers real-time response streaming, transforming generative AI applications. This new feature enables continuous response streaming to clients, reducing time-to-first-byte and enhancing interactive experiences for chatbots, virtual assistants, and music generators. The post guides you through building a streaming web application using SageMaker real-time endpoints for interactive chat use cases. It showcases deployment options with AWS Large Model Inference (LMI) and Hugging Face Text Generation Inference (TGI) containers, providing a seamless, engaging conversation experience for users. Implementing Effective Guardrails for Large Language Models: Guardrails are crucial for maintaining trust in LLM applications as they ensure compliance with defined principles. This guide presents two open-source tools for implementing LLM guardrails: Guardrails AI and NVIDIA NeMo-Guardrails. Guardrails AI offers Python-based validation of LLM responses, using the RAIL specification. It enables developers to define output criteria and corrective actions, with step-by-step instructions for implementation. NVIDIA NeMo-Guardrails introduces Colang, a modeling language for flexible conversational workflows. The guide explains its syntax elements and event-driven design. Comparing the two, Guardrails AI suits simple tasks, while NeMo-Guardrails excels in defining advanced conversational guidelines. 🚀 HackHub: Trending AI Toolscabralpinto/modular-diffusion: Python library for crafting and training personalized Diffusion Models with PyTorch. cofactoryai/textbase: Simplified Python chatbot development using NLP and ML with Textbase's on_message function in main.py. microsoft/BatteryML: Open-source ML tool for battery analysis, aiding researchers in understanding electrochemical processes and predicting battery degradation. facebookresearch/co-tracker: Swift transformer-based video tracker with Optical Flow, pixel-level tracking, grid sampling, and manual point selection. explodinggradients/ragas: Framework evaluates Retrieval Augmented Generation pipelines, enhancing LLM context with external data using research-based tools.

0
0
4824

article-image-article-odata-on-mobile-devices

Packt

02 Aug 2012

8 min read

Odata on Mobile Devices

Packt

02 Aug 2012

8 min read

With the continuous evolution of mobile operating systems, smart mobile devices (such as smartphones or tablets) play increasingly important roles in everyone's daily work and life. The iOS (from Apple Inc., for iPhone, iPad, and iPod Touch devices), Android (from Google) and Windows Phone 7 (from Microsoft) operating systems have shown us the great power and potential of modern mobile systems. In the early days of the Internet, web access was mostly limited to fixed-line devices. However, with the rapid development of wireless network technology (such as 3G), Internet access has become a common feature for mobile or portable devices. Modern mobile OSes, such as iOS, Android, and Windows Phone have all provided rich APIs for network access (especially Internet-based web access). For example, it is quite convenient for mobile developers to create a native iPhone program that uses a network API to access remote RSS feeds from the Internet and present the retrieved data items on the phone screen. And to make Internet-based data access and communication more convenient and standardized, we often leverage some existing protocols, such as XML or JSON, to help us. Thus, it is also a good idea if we can incorporate OData services in mobile application development so as to concentrate our effort on the main application logic instead of the details about underlying data exchange and manipulation. In this article, we will discuss several cases of building OData client applications for various kinds of mobile device platforms. The first four recipes will focus on how to deal with OData in applications running on Microsoft Windows Phone 7. And they will be followed by two recipes that discuss consuming an OData service in mobile applications running on the iOS and Android platforms. Although this book is .NET developer-oriented, since iOS and Android are the most popular and dominating mobile OSes in the market, I think the last two recipes here would still be helpful (especially when the OData service is built upon WCF Data Service on the server side). Accessing OData service with OData WP7 client library What is the best way to consume an OData service in a Windows Phone 7 application? The answer is, by using the OData client library for Windows Phone 7 (OData WP7 client library). Just like the WCF Data Service client library for standard .NET Framework based applications, the OData WP7 client library allows developers to communicate with OData services via strong-typed proxy and entity classes in Windows Phone 7 applications. Also, the latest Windows Phone SDK 7.1 has included the OData WP7 client library and the associated developer tools in it. In this recipe, we will demonstrate how to use the OData WP7 client library in a standard Windows Phone 7 application. Getting ready The sample WP7 application we will build here provides a simple UI for users to view and edit the Categories data by using the Northwind OData service. The application consists of two phone screens, shown in the following screenshot: Make sure you have installed Windows Phone SDK 7.1 (which contains the OData WP7 client library and tools) on the development machine. You can get the SDK from the following website: http://create.msdn.com/en-us/home/getting_started The source code for this recipe can be found in the ch05ODataWP7ClientLibrarySln directory. How to do it... Create a new ASP.NET web application that contains the Northwind OData service. Add a new Windows Phone Application project in the same solution (see the following screenshot). Select Windows Phone OS 7.1 as the Target Windows Phone OS Version in the New Windows Phone Application dialog box (see the following screenshot). Click on the OK button, to finish the WP7 project creation. The following screenshot shows the default WP7 project structure created by Visual Studio: Create a new Windows Phone Portrait Page (see the following screenshot) and name it EditCategory.xaml. Create the OData client proxy (against the Northwind OData service) by using the Visual Studio Add Service Reference wizard. Add the XAML content for the MainPage.xaml page (see the following XAML fragment). <Grid x_Name="ContentPanel" Grid.Row="1" Margin="12,0,12,0"> <ListBox x_Name="lstCategories" ItemsSource="{Binding}"> <ListBox.ItemTemplate>> <DataTemplate> <Grid> <Grid.ColumnDefinitions> <ColumnDefinition Width="60" /> <ColumnDefinition Width="260" /> <ColumnDefinition Width="140" /> </Grid.ColumnDefinitions> <TextBlock Grid.Column="0" Text="{Binding Path=CategoryID}" FontSize="36" Margin="5"/> <TextBlock Grid.Column="1" Text="{Binding Path=CategoryName}" FontSize="36" Margin="5" TextWrapping="Wrap"/> <HyperlinkButton Grid.Column="2" Content="Edit" HorizontalAlignment="Right" NavigateUri="{Binding Path=CategoryID, StringFormat='/EditCategory.xaml? ID={0}'}" FontSize="36" Margin="5"/> <Grid> <DataTemplate> <ListBox.ItemTemplate> <ListBox> <Grid> Add the code for loading the Category list in the code-behind file of the MainPage. xaml page (see the following code snippet). public partial class MainPage : PhoneApplicationPage { ODataSvc.NorthwindEntities _ctx = null; DataServiceCollection _categories = null; ...... private void PhoneApplicationPage_Loaded(object sender, RoutedEventArgs e) { Uri svcUri = new Uri("http://localhost:9188/NorthwindOData.svc"); _ctx = new ODataSvc.NorthwindEntities(svcUri); _categories = new DataServiceCollection(_ctx); _categories.LoadCompleted += (o, args) => { if (_categories.Continuation != null) _categories.LoadNextPartialSetAsync(); else { this.Dispatcher.BeginInvoke( () => { ContentPanel.DataContext = _categories; ContentPanel.UpdateLayout(); } ); } }; var query = from c in _ctx.Categories select c; _categories.LoadAsync(query); } } Add the XAML content for the EditCategory.xamlpage (see the following XAML fragment). <Grid x_Name="ContentPanel" Grid.Row="1" Margin="12,0,12,0"> <StackPanel> <TextBlock Text="{Binding Path=CategoryID, StringFormat='Fields of Categories({0})'}" FontSize="40" Margin="5" /> <Border> <StackPanel> <TextBlock Text="Category Name:" FontSize="24" Margin="10" /> <TextBox x_Name="txtCategoryName" Text="{Binding Path=CategoryName, Mode=TwoWay}" /> <TextBlock Text="Description:" FontSize="24" Margin="10" /> <TextBox x_Name="txtDescription" Text="{Binding Path=Description, Mode=TwoWay}" /> </StackPanel> </Border> <StackPanel Orientation="Horizontal" HorizontalAlignment="Center"> <Button x_Name="btnUpdate" Content="Update" HorizontalAlignment="Center" Click="btnUpdate_Click" /> <Button x_Name="btnCancel" Content="Cancel" HorizontalAlignment="Center" Click="btnCancel_Click" /> </StackPanel> </StackPanel> </Grid> Add the code for editing the selected Category item in the code-behind file of the EditCategory.xaml page. In the PhoneApplicationPage_Loaded event, we will load the properties of the selected Category item and display them on the screen (see the following code snippet). private void PhoneApplicationPage_Loaded(object sender, RoutedEventArgs e) { EnableControls(false); Uri svcUri = new Uri("http://localhost:9188/NorthwindOData. svc"); _ctx = new ODataSvc.NorthwindEntities(svcUri); var id = int.Parse(NavigationContext.QueryString["ID"]); var query = _ctx.Categories.Where(c => c.CategoryID == id); _categories = new DataServiceCollection(_ctx); _categories.LoadCompleted += (o, args) => { if (_categories.Count <= 0) { MessageBox.Show("Failed to retrieve Category item."); NavigationService.GoBack(); } else { EnableControls(true); ContentPanel.DataContext = _categories[0]; ContentPanel.UpdateLayout(); } }; _categories.LoadAsync(query); } The code for updating changes (against the Category item) is put in the Click event of the Update button (see the following code snippet). private void btnUpdate_Click(object sender, RoutedEventArgs e) { EnableControls(false); _ctx.UpdateObject(_categories[0]); _ctx.BeginSaveChanges( (ar) => { this.Dispatcher.BeginInvoke( () => { try { var response = _ctx.EndSaveChanges(ar); NavigationService.Navigate(new Uri("/MainPage.xaml", UriKind.Relative)); } catch (Exception ex) { MessageBox.Show("Failed to save changes."); EnableControls(true); } } ); }, null ); } Select the WP7 project and launch it in Windows Phone Emulator (see the following screenshot). Depending on the performance of the development machine, it might take a while to start the emulator. Running a WP7 application in Windows Phone Emulator is very helpful especially when the phone application needs to access some web services (such as WCF Data Service) hosted on the local machine (via the Visual Studio test web server). How it works... Since the OData WP7 client library (and tools) has been installed together with Windows Phone SDK 7.1, we can directly use the Visual Studio Add Service Reference wizard to generate the OData client proxy in Windows Phone applications. And the generated OData proxy is the same as what we used in standard .NET applications. Similarly, all network access code (such as the OData service consumption code in this recipe) has to follow the asynchronous programming pattern in Windows Phone applications. There's more... In this recipe, we use the Windows Phone Emulator for testing. If you want to deploy and test your Windows Phone application on a real device, you need to obtain a Windows Phone developer account so as to unlock your Windows Phone device. Refer to the walkthrough: App Hub - windows phone developer registration walkthrough,available at http://go.microsoft.com/fwlink/?LinkID=202697

0
0
4817

Robi Sen

16 Apr 2015

4 min read

Text Mining with R: Part 2

Robi Sen

16 Apr 2015

4 min read

In Part 1, we covered the basics of doing text mining in R by selecting data, preparing it, cleaning, then performing various operations on it to visualize that data. In this post we look at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. Building the document matrix A common technique in text mining is using a matrix of documents terms called a document term matrix. A document term matrix is simply a matrix where columns are terms and rows are documents that contain the occurrence of specific terms within the document. Or if you reverse the order and have terms as rows and documents as columns it’s called a term document matrix. For example let’s say we have two documents D 1 and D2. For example let’s say we have the documents: D1 = "I like cats" D2 = "I hate cats" Then the document term matrix would look like: I like hate cats D1 1 1 0 1 D2 1 0 1 1 For our project to make a Document term matrix in R all you need to do is use the DocumentTermMatrix() like this: tdm <- DocumentTermMatrix(mycorpus) You can see information on your document term matrix by using print like: print(tdm) <<DocumentTermMatrix (documents: 4688, terms: 18363)>> Non-/sparse entries: 44400/86041344 Sparsity : 100% Maximal term length: 65 Weighting : term frequency (tf) Next because we need to sum up all the values in each term column so that we can drive the frequency of each term occurrence. We also want to sort those values from highest to lowest. You can use this code: m <- as.matrix(tdm) v <- sort(colSums(m),decreasing=TRUE) Next we will use the names() to pull the each term object’s name which in our case is a word. Then we want to build a dataframe from our words associated with their frequency of occurrences. Finally we want to create our word cloud but remove any terms that have an occurrence of less than 45 times to reduce clutter in our wordcloud. You could also use max.words to limit the total number of words in your word cloud. So your final code should look like this: words <- names(v) d <- data.frame(word=words, freq=v) wordcloud(d$word,d$freq,min.freq=45) If you run this in R studio you should see something like the figure which shows the words with highest occurrence in our corpus. The wordcloud object automatically scales the drawn words by the size of their frequency value. From here you can do a lot with your word cloud including change the scale, associate color to various values, and much more. You can read more about wordcloud here. While word clouds are often used on the web for things like blogs, news sites, and other similar use cases they have real value for data analysis beyond just visual indicators for users to find terms of interest. For example if you look at the word cloud we generated you will notice that one of the most popular terms mentioned in tweets is chocolate. Doing a short inspection of our CSV document for the term chocolate we find a lot of people mentioning the word in a variety of contexts but one of the most common is in relationship to a specific super bowl add. For example here is a tweet: Alexalabesky 41673.39 Chocolate chips and peanut butter 0 0 0 Unknown Unknown Unknown Unknown Unknown This appeared after the airing of this advertisement from Butterfinger. So even with this simple R code we can generate real meaning from social media which is the measurable impact of an advertisement during the Super Bowl. Summary In this post we looked at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. About the author Robi Sen, CSO at Department 13, is an experienced inventor, serial entrepreneur, and futurist whose dynamic twenty-plus year career in technology, engineering, and research has led him to work on cutting edge projects for DARPA, TSWG, SOCOM, RRTO, NASA, DOE, and the DOD. Robi also has extensive experience in the commercial space, including the co-creation of several successful start-up companies. He has worked with companies such as UnderArmour, Sony, CISCO, IBM, and many others to help build out new products and services. Robi specializes in bringing his unique vision and thought process to difficult and complex problems allowing companies and organizations to find innovative solutions that they can rapidly operationalize or go to market with.

0
0
4816

article-image-background-jobs-django-celery

Jean Jung

19 Jan 2017

7 min read

Background jobs on Django with Celery

Jean Jung

19 Jan 2017

7 min read

While doing web applications, you usually need to run some operations in the background to improve the application performance, or because a job really needs to run outside of the application environment. In both cases, if you are on Django, you are in good hands because you have Celery, the Distributed Task Queue written in Python. Celery is a tiny but complete project. You can find more information on the project page. In this post, we will see how it’s easy to integrate Celery with an existing project, and although we are focusing on Django here, creating a standalone Celery worker is a very similar process. Installing Celery The first step we will see is how to install Celery. If you already have it, please move to the next section and follow the next step! As every good Python package, Celery is distributed on pip. You can install it just by entering: pip install celery Choosing a message broker The second step is about choosing a message broker to act as the job queue. Celery can talk with a great variety of brokers; the main ones are: RabbitMQ Redis 1 Amazon SQS ² Check for support on other brokers here. If you’re already using any of these brokers for other purposes, choose it as your primary option. In this section there is nothing more you have to do. Celery is very transparent and does not require any source modification to move from a broker to another, so feel free to try more than one after we end here. Ok let’s move on, but first do not forget to look the little notes below. ¹: For Redis (a great choice in my opinion), you have to install the celery[redis] package. ²: Celery has great features like web monitoring that do not work with this broker. Celery worker entrypoint When running Celery on a directory it will search for a file called celery.py, which is the application entrypoint, where the configs are loaded and the application object resides. Working with Django, this file is commonly stored on the project directory, along with the settings.py file; your file structure should look like this: your_project_name your_project_name __init__.py settings.py urls.py wsgi.py celery.py your_app_name __init__.py models.py views.py …. The settings read by that file will be on the same settings.py file that Django uses. At this point we can take a look at the official documentation celery.py file example. This code is basically the same for every project; just replace proj by your project name and save that file. Each part is described in the file comments. from __future__ import absolute_import, unicode_literals import os from celery import Celery # set the default Django settings module for the 'celery' program. os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'proj.settings') app = Celery('proj') # Using a string here means the worker don't have to serialize # the configuration object to child processes. # - namespace='CELERY' means all celery-related configuration keys # should have a `CELERY_` prefix. app.config_from_object('django.conf:settings', namespace='CELERY') # Load task modules from all registered Django app configs. # This is not required, but as you can have more than one app # with tasks it’s better to do the autoload than declaring all tasks # in this same file. app.autodiscover_tasks() Settings By default, Celery depends only on the broker_url setting to work. As we’ve seen in the previous session, your settings will be stored alongside the Django ones but with the 0‘CELERY_’ prefix. The broker_url format is as follows: CELERY_BROKER_URL = ‘broker://[[user]:[password]@]host[:port[/resource]]’ Where broker is an identifier that specifies the chosen broker, like amqp or redis; user and password are the authentication to the service. If needed, host and port are the addresses of the service and resource is a broker-specific path to the component resource. For example, if you’ve chosen a local Redis as your broker, your broker URL will be: CELERY_BROKER_URL = ‘redis://localhost:6379/0’ ¹ 1: Considering a default Redis installation with the database 0 being used. Doing this we have a functioning celery worker. How lucky! It’s so simple! But wait, what about the tasks? How do we write and execute them? Let’s see. Creating and running tasks Because of the superpowers Celery has, it can autoload tasks from Django app directories as we’ve seen before; you just have to declare your app tasks in a file called tasks.py in the app dir: your_project_name your_project_name __init__.py settings.py urls.py wsgi.py celery.py your_app_name __init__.py models.py views.py tasks.py …. In that file you just need to put functions decorated with the celery.shared_task decorator. So suppose we want to do a background mailer; the source will be like this: from __future__ import absolute_import, unicode_literals from celery import shared_task from django.core.mail import send_mail @shared_task def mailer(subject, message, recipient_list, from=’default@admin.com’): send_mail(subject, message, recipient_list, from) Then on the Django application, on any place you have to send an e-mail on background, just do the following: from __future__ import absolute_import from app.tasks import mailer …. def send_email_to_user(request): if request.user: mailer.delay(‘Alert Foo’, ‘The foo message’, [request.user.email]) delay is probably the most used way to submit a job to a Celery worker, but is not the only one. Check this reference to see what is possible to do. There are many features like task chaining, with future schedules and more! As you can have noticed, in a great majority of the files, we have used the from __future__ import absolute_import statement. This is very important, mainly with Python 2, because of the way Celery serializes messages to post tasks on brokers. You need to follow the same convention when creating and using tasks, as otherwise the namespace of the task will differ and the task will not get executed. The absolute import module forces you to use absolute imports, so you will avoid these problems. Check this link for more information. Running the worker If you get the source code above, put anything in the right place and run the Django development server to test your background jobs, they will not work! Wait. This is because you don’t have a Celery worker started yet. To start it, do a cd to the project main directory (the same as you run python manage.py runserver for example) and run: celery -A your_project_name worker -l info Replace your_project_name with your project and info with the desired log level. Keep this process running, start the Django server, and yes. Now you can see that anything works! Where to go now? Explore the Celery documentation and see all the available features, caveats, and help you can get from it. There is also an example project on the Celery GitHub page that you can use as a template for new projects or a guide to add celery to your existing project. Summary We’ve seen how to install and configure Celery to run alongside a new or existing Django project. We explored some of the broker options we have, and how simple it is to change between them. There are some hints about brokers that don’t offer all of the features Celery has. We have seen an example of a mailer task, and how it was created and called from the Django application. Finally I provided instructions to start the worker to get the things done. References [1] - Django project documentation [2] - Celery project documentation [3] - Redis project page [4] - RabbitMQ project page [5] - Amazon SQS page About the author Jean Jung is a Brazilian developer passionate about technology. He is currently a system analyst at EBANX, an international payment processing company for Latin America. He's very interested in Python and artificial intelligence, specifically machine learning, compilers and operational systems. As a hobby, he's always looking for IoT projects with Arduino.

0
0
4815

How-To Tutorials

article-image-cluster-computing-using-scala

Packt

13 Apr 2016

18 min read

Cluster Computing Using Scala

Packt

13 Apr 2016

18 min read

In this article by Vytautas Jančauskas the author of the book Scientific Computing with Scala, explains the way of writing software to be run on distributed computing clusters. We will learn the MPJ Express library here. (For more resources related to this topic, see here.) Very often when dealing with intense data processing tasks and simulations of physical phenomena, there comes a time when no matter how many CPU cores and memory your workstation has, it is not enough. At times like these, you will want to turn to supercomputing clusters for help. These distributed computing environments consist of many nodes (each node being a separate computer) connected into a computer network using specialized high bandwidth and low latency connections (or if you are on a budget standard Ethernet hardware is often enough). These computers usually utilize a network filesystem allowing each node to see the same files. They communicate using messaging libraries, such as MPI. Your program will run on separate computers and utilize the message passing framework to exchange data via the computer network. Using MPJ Express for distributed computing MPJ Express is a message passing library for distributed computing. It works in programming languages using Java Virtual Machine (JVM). So, we can use it from Scala. It is similar in functionality and programming interface to MPI. If you know MPI, you will be able to use MPJ Express pretty much the same way. The differences specific to Scala are explained in this section. We will start with how to install it. For further reference, visit the MPJ Express website given here: http://mpj-express.org/ Setting up and running MPJ Express The steps to set up and run MPJ Express are as follows: First, download MPJ Express from the following link. The version at the time of this writing is 0.44.http://mpj-express.org/download.php Unpack the archive and refer to the included README file for installation instructions. Currently, you have to set MPJ_HOME to the folder you unpacked the archive to and add the bin folder in that archive to your path. For example, if you are a Linux user using bash as your shell, you can add the following two lines to your .bashrc file (the file is in your home directory at /home/yourusername/.bashrc): export MPJ_HOME=/home/yourusername/mpj export PATH=$MPJ_HOME/bin:$PATH Here, mpj is the folder you extracted the archive you downloaded from the MPJ Express website to. If you are using a different system, you will have to do the equivalent of the above for your system to use MPJ Express. We will want to use MPJ Express with Scala Build Tool (SBT), which we used previously to build and run all of our programs. Create the following directory structure: scalacluster/ lib/ project/ plugins.sbt build.sbt I have chosen to name the project folder asscalacluster here, but you can call it whatever you want. The .jar files in the lib folder will be accessible to your program now. Copy the contents of the lib folder from the mpj directory to this folder. Finally, create an empty build.sbt and plugins.sbt files. Let’s now write and run a simple "Hello, World!" program to test our setup: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank val size: Int = MPI.COMM_WORLD.Size println("Hello, World, I'm <" + me + ">") MPI.Finalize() } } This should be familiar to everyone who has ever used MPI. First, we import everything from the mpj package. Then, we initialize MPJ Express by calling MPI.Initialize, the arguments to MPJ Express will be passed from the command-line arguments you will enter when running the program. The MPI.COMM_WORLD.Rank() function returns the MPJ processes rank. A rank is a unique identifier used to distinguish processes from one another. They are used when you want different processes to do different things. A common pattern is to use the process with rank 0 as the master process and the processes with other ranks as workers. Then, you can use the processes rank to decide what action to take in the program. We also determine how many MPJ processes were launched by checking MPI.COMM_WORLD.Size. Our program will simply print a processes rank for now. We will want to run it. If you don't have a distributed computing cluster readily available, don't worry. You can test your programs locally on your desktop or laptop. The same program will work without changes on clusters as well. To run programs written using MPJ Express, you have to use the mpjrun.sh script. This script will be available to you if you have added the bin folder of the MPJ Express archive to your PATH as described in the section on installing MPJ Express. The mpjrun.sh script will setup the environment for your MPJ Express processes and start said processes. The mpjrun.sh script takes a .jar file, so we need to create one. Unfortunately for us, this cannot easily be done using the sbt package command in the directory containing our program. This worked previously, because we used Scala runtime to execute our programs. MPJ Express uses Java. The problem is that the .jar package created with sbt package does not include Scala's standard library. We need what is called a fat .jar—one that contains all the dependencies within itself. One way of generating it is to use a plugin for SBT called sbt-assembly. The website for this plugin is given here: https://github.com/sbt/sbt-assembly There is a simple way of adding the plugin for use in our project. Remember that project/plugins.sbt file we created? All you need to do is add the following line to it (the line may be different for different versions of the plugin. Consult the website): addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1") Now, add the following to the build.sbt file you created: lazy val root = (project in file(".")). settings( name := "mpjtest", version := "1.0", scalaVersion := "2.11.7" ) Then, execute the sbt assembly command from the shell to build the .jar file. The file will be put under the following directory if you are using the preceding build.sbt file. That is, if the folder you put the program and build.sbt in is /home/you/cluster: /home/you/cluster/target/scala-2.11/mpjtest-assembly- 1.0.jar Now, you can run the mpjtest-assembly-1.0.jar file as follows: $ mpjrun.sh -np 4 -jar target/scala-2.11/mpjtest-assembly-1.0.jar MPJ Express (0.44) is started in the multicore configuration Hello, World, I'm <0> Hello, World, I'm <2> Hello, World, I'm <3> Hello, World, I'm <1> Argument -np specifies how many processes to run. Since we specified -np 4, four processes will be started by the script. The order of the "Hello, World" messages can differ on your system since the precise order of execution of different processes is undetermined. If you got the output similar to the one shown here, then congratulations, you have done the majority of the work needed to write and deploy applications using MPJ Express. Using Send and Recv MPJ Express processes can communicate using Send and Recv. These methods constitute arguably the simplest and easiest to understand mode of operation that is also probably the most error prone. We will look at these two first. The following are the signatures for the Send and Recv methods: public void Send(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) throws MPIException public Status Recv(java.lang.Object buf, int offset, int count, Datatype datatype, int source, int tag) throws MPIException Both of these calls are blocking. This means that after calling Send, your process will block (will not execute the instructions following it) until a corresponding Recv is called by another process. Also Recv will block the process, until a corresponding Send happens. By corresponding, we mean that the dest and source arguments of the calls have the values corresponding to receivers and senders ranks, respectively. The two calls will be enough to implement many complicated communication patterns. However, they are prone to various problems such as deadlocks. Also, they are quite difficult to debug, since you have to make sure that each Send has the correct corresponding Recv and vice versa. The parameters for Send and Recv are basically the same. The meanings of those parameters are summarized in the following table: Argument Type Description Buf java.lang.Object It has to be a one-dimensional Java array. When using from Scala, use the Scala array, which is a one-to-one mapping to a Java array. offset int The start of the data you want to pass from the start of the array. Count int This shows the number items of the array you want to pass. datatype Datatype The type of data in the array. Can be one of the following: MPI.BYTE, MPI.CHAR, MPI.SHORT, MPI.BOOLEAN, MPI.INT, MPI.LONG, MPI.FLOAT, MPI.DOUBLE, MPI.OBJECT, MPI.LB, MPI.UB, and MPI.PACKED. dest/source int Either the destination to send the message to or the source to get the message from. You use the rank of the process to identify sources and destinations. tag int Used to tag the message. Can be used to introduce different message types. Can be ignored for most common applications. Let’s look at a simple program using these calls for communication. We will implement a simple master/worker communication pattern: import mpi._ import scala.util.Random object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { Here, we use an if statement to identify who we are based on our rank. Since each process gets a unique rank, this allows us to determine what action should be taken. In our case, we assigned the role of the master to the process with rank 0 and the role of a worker to processes with other ranks: for (i <- 1 until size) { val buf = Array(Random.nextInt(100)) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> please do work on " + buf(0)) } We iterate over workers, who have the ranks from 1 to whatever is the argument for number of processes you passed to the mpjrun.sh script. Let’s say that number is four. This gives us one master process and three worker processes. So, each process with a rank from 1 to 3 will get a randomly generated number. We have to put that number in an array even though it is a single number. This is because both Send and Recv methods expect an array as their first argument. We then use the Send method to send the data. We specified the array as argument buf, offset of 0, size of 1, type MPI.INT, destination as the for loop index, and tag as 0. This means that each of our three worker processes will receive a (most probably) different number: for (i <- 1 until size) { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> thanks for the reply, which was " + buf(0)) } Finally, we collect the results from the workers. For this, we iterate over the worker ranks and use the Recv method on each one of them. We print the result we got from the worker, and this concludes the master's part. We now move on to the workers: } else { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Understood, doing work on " + buf(0)) buf(0) = buf(0) * buf(0) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Reporting back") } The workers code is identical for all of them. They receive a message from the master, calculate the square of it, and send it back: MPI.Finalize() } } After you run the program, the results should be akin to the following, which I got when running this program on my system: MASTER: Dear <1> please do work on 71 MASTER: Dear <2> please do work on 12 MASTER: Dear <3> please do work on 55 <1>: Understood, doing work on 71 <1>: Reported back MASTER: Dear <1> thanks for the reply, which was 5041 <3>: Understood, doing work on 55 <2>: Understood, doing work on 12 <2>: Reported back MASTER: Dear <2> thanks for the reply, which was 144 MASTER: Dear <3> thanks for the reply, which was 3025 <3>: Reported back Sending Scala objects in MPJ Express messages Sometimes, the types provided by MPJ Express for use in the Send and Recv methods are not enough. You may want to send your MPJ Express processes a Scala object. A very realistic example of this would be to send an instance of a Scala case class. These can be used to construct more complicated data types consisting of several different basic types. A simple example is a two-dimensional vector consisting of x and y coordinates. This can be sent as a simple array, but more complicated classes can't. For example, you may want to use a case class as the one shown here. It has two attributes of type String and one attribute of type Int. So what do we do with a data type like this? The simplest answer to that problem is to serialize it. Serializing converts an object to a stream of characters or a string that can be sent over the network (or stored to a file or done other things with) and later on deserialized to get the original object back: scala> case class Person(name: String, surname: String, age: Int) defined class Person scala> val a = Person("Name", "Surname", 25) a: Person = Person(Name,Surname,25) A simple way of serializing is to use a format such as XML or JSON. This can be done automatically using a pickling library. Pickling is a term that comes from the Python programming language. It is the automatic conversion of an arbitrary object into a string representation that can later be de-converted to get the original object back. The reconstructed object will behave the same way as it did before conversion. This allows one to store arbitrary objects to files for example. There is a pickling library available for Scala as well. You can of course do serialization in several different ways (for example, using the powerful support for XML available in Scala). We will use the pickling library that is available from the following website for this example: https://github.com/scala/pickling You can install it by adding the following line to your build.sbt file: libraryDependencies += "org.scala-lang.modules" %% "scala- pickling" % "0.10.1" After doing that, use the following import statements to enable easy pickling in your projects: scala> import scala.pickling.Defaults._ import scala.pickling.Defaults._ scala> import scala.pickling.json._ import scala.pickling.json._ Here, you can see how you can then easily use this library to pickle and unpickle arbitrary objects without the use of annoying boiler plate code: scala> val pklA = a.pickle pklA: pickling.json.pickleFormat.PickleType = JSONPickle({ "$type": "Person", "name": "Name", "surname": "Surname", "age": 25 }) scala> val unpklA = pklA.unpickle[Person] unpklA: Person = Person(Name,Surname,25) Let’s see how this would work in an application using MPJ Express for message passing. A program using pickling to send a case class instance in a message is given here: import mpi._ import scala.pickling.Defaults._ import scala.pickling.json._ case class ArbitraryObject(a: Array[Double], b: Array[Int], c: String) Here, we have chosen to define a fairly complex case class, consisting of two arrays of different types and a string: object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val obj = ArbitraryObject(Array(1.0, 2.0, 3.0), Array(1, 2, 3), "Hello") val pkl = obj.pickle.value.toCharArray MPI.COMM_WORLD.Send(pkl, 0, pkl.size, MPI.CHAR, 1, 0) In the preceding bit of code, we create an instance of our case class. We then pickle it to JSON and get the string representation of said JSON with the value method. However, to send it in an MPJ message, we need to convert it to a one-dimensional array of one of the supported types. Since it is a string, we convert it to a char array. This is done using the toCharArray method: } else if (me == 1) { val buf = new Array[Char](1000) MPI.COMM_WORLD.Recv(buf, 0, 1000, MPI.CHAR, 0, 0) val msg = buf.mkString val obj = msg.unpickle[ArbitraryObject] On the receiving end, we get the raw char array, convert it back to string using mkString method, and then unpickle it using unpickle[T]. This will return an instance of the case class that we can use as any other instance of a case class. It is in its functionality the same object that was sent to us: println(msg) println(obj.c) } MPI.Finalize() } } The following is the result of running the preceding program. It prints out the JSON representation of our object, and also show that we can access the attributes of said object by printing the c attribute. MPJ Express (0.44) is started in the multicore configuration: { "$type": "ArbitraryObject", "a": [ 1.0, 2.0, 3.0 ], "b": [ 1, 2, 3 ], "c": "Hello" } Hello You can use this method to send arbitrary objects in an MPJ Express message. However, this is just one of many ways of doing this. As mentioned previously, an example of another way is to use the XML representation. XML support is strong in Scala, and you can use it to serialize objects as well. This will usually require you to add some boiler plate code to your program to serialize to XML. The method discussed earlier has the advantage of requiring no boiler plate code. Non-blocking communication So far, we examined only blocking (or synchronous) communication between two processes. This means that the process is blocked (halted their execution) until the Send or Recv methods have been completed successfully. This is simple to understand and enough for most cases. The problem with synchronous communication is that you have to be very careful otherwise deadlocks may occur. Deadlocks are situations when processes wait for each other to release a resource first. Mexican standoff including the dining philosophers problem is one of the famous example of Deadlock in Operating System. The point is that if you are unlucky, you may end up with a program that is seemingly stuck and you don't know why. Using nonlocking communication allows you to avoid these problems most of the time. If you think you may be at risk of deadlocks, you will probably want to use it. The signatures for the primary methods used in asynchronous communication are given here: Request Isend(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) Isend works similar to its Send counterpart. The main differences are that it does not block (the program continues execution after the call rather than waiting for a corresponding send), and then it returns a Request object. This object is used to check the status of your Send request, block until it is complete if required, and so on: Request Irecv(java.lang.Object buf, int offset, int count, Datatype datatype, int src, int tag) Irecv is again the same as Recv only non-blocking and returns a Request object used to handle your receive request. The operation of these methods can be seen in action in the following example: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val requests = for (i <- 0 until 10) yield { val buf = Array(i * i) MPI.COMM_WORLD.Isend(buf, 0, 1, MPI.INT, 1, 0) } } else if (me == 1) { for (i <- 0 until 10) { Thread.sleep(1000) val buf = Array[Int](0) val request = MPI.COMM_WORLD.Irecv (buf, 0, 1, MPI.INT, 0, 0) request.Wait() println("RECEIVED: " + buf(0)) } } MPI.Finalize() } } This is a very simplistic example used simply to demonstrate the basics of using the asynchronous message passing methods. First, the process with rank 0 will send 10 messages to process with rank 1 using Isend. Since Isend does not block, the loop will finish quickly and the messages it sent will be buffered until they are retrieved using Irecv. The second process (the one with rank 1) will wait for one second before retrieving each message. This is to demonstrate the asynchronous nature of these methods. The messages are in the buffer waiting to be retrieved. Therefore, Irecv can be used at your leisure when convenient. The Wait() method of the Request object, it returns, has to be used to retrieve results. The Wait() method blocks until the message is successfully received from the buffer. Summary Extremely computationally intensive programs are usually parallelized and run on supercomputing clusters. These clusters consist of multiple networked computers. Communication between these computers is usually done using messaging libraries such as MPI. These allow you to pass data between processes running on different machines in an efficient manner. In this article, you have learned how to use MPJ Express—an MPI like library for JVM. We saw how to carry out process to process communication as well as collective communication. Most important MPJ Express primitives were covered and example programs using them were given. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code[article] Getting Started with JavaFX[article] Integrating Scala, Groovy, and Flex Development with Apache Maven[article]

0
0
4814

article-image-active-directory-domain-services-2016

Packt

09 May 2017

23 min read

Active Directory Domain Services 2016

Packt

09 May 2017

23 min read

0
0
4810

article-image-integrating-phplist-2-wordpress

Packt

29 Jul 2011

3 min read

Integrating phpList 2 with WordPress

Packt

29 Jul 2011

3 min read

Prerequisites for this WordPress tutorial For this tutorial, we'll make the following assumptions: We already have a working instance of WordPress (version 3.x) Our phpList site is accessible through HTTP / HTTPS from our WordPress site Installing and configuring the phpList Integration plugin Download the latest version of Jesse Heap's phpList Integration plugin from http://wordpress.org/extend/plugins/phplist-form-integration/, unpack it, and upload the contents to your wp-content/plugins/ directory in WordPress. Activate the plugin from within your WordPress dashboard: Under the Settings menu, click on the new PHPlist link to configure the plugin: General Settings Under the General Settings heading, enter the URL to your phpList installation, as well as an admin username/password combination. Enter the ID and name of at least one list that you want to allow your WordPress users to subscribe to: Why does the plugin require my admin login and password? The admin login and password are used to bypass the confirmation e-mail that would normally be sent to a subscriber. Effectively, the plugin "logs into" phpList as the administrator and then subscribes the user, bypassing confirmation. If you don't want to bypass confirmation e-mails, then you don't need to enter your username and password. Form Settings The plugin will work with this section unmodified. However, let's imagine that we also want to capture the subscriber's name. We already have an attribute in phpList called first name, so change the first field label to First Name and the Text Field ID to first name (the same as our phpList attribute name): Adding a phpList Integration page The plugin will replace the HTML comment  with the generated phpList form. Let's say we wanted our phpList form to show up at http://ourblog.com/signup. Create a new WordPress page called Signup, add the content you want to be displayed, and then click on the HTML tab to edit the HTML source: You will see the HTML source of your page displayed. Insert the text "" where you want the form to be displayed and save the page: HTML comments The "" syntax designates an HTML comment, which is not displayed when the HTML is processed by the browser / viewer. This means that you won't see your comment when you view your page in Visual mode. Once the page has been updated, click on the View page link to display the page in WordPress: The subscribe form will be inserted in the page at the location where you added the comment: Adding a phpList Integration widget Instead of a dedicated page to sign up new subscribers, you may want to use a sidebar widget instead, so that the subscription options can show up on multiple pages on your WordPress site. To add the phpList integration widget, go to your WordPress site's Appearance option and go to the Widgets page: Drag the PHPList Integration widget to your preferred widget location. (These vary depending on your theme): You can change the Title of the widget before you click on Close to finish: Now that you've added the PHPList Integration widget to the widget area, your sign up form will be displayed on all WordPress pages, which include that widget area: Further resources on this subject: Integrating phpList 2 with Drupal phpList 2 E-mail Campaign Manager: Personalizing E-mail Body Tcl: Handling Email Email, Languages, and JFile with Joomla!

0
1
4809

How-To Tutorials

Packt

10 Feb 2015

36 min read

Hive in Hadoop

Packt

10 Feb 2015

36 min read

In this article by Garry Turkington and Gabriele Modena, the author of the book Learning Hadoop 2. explain how MapReduce is a powerful paradigm that enables complex data processing that can reveal valuable insights. It does require a different mindset and some training and experience on the model of breaking processing analytics into a series of map and reduce steps. There are several products that are built atop Hadoop to provide higher-level or more familiar views of the data held within HDFS, and Pig is a very popular one. This article will explore the other most common abstraction implemented atop Hadoop: SQL. In this article, we will cover the following topics: What the use cases for SQL on Hadoop are and why it is so popular HiveQL, the SQL dialect introduced by Apache Hive Using HiveQL to perform SQL-like analysis of the Twitter dataset How HiveQL can approximate common features of relational databases such as joins and views (For more resources related to this topic, see here.) Why SQL on Hadoop So far we have seen how to write Hadoop programs using the MapReduce APIs and how Pig Latin provides a scripting abstraction and a wrapper for custom business logic by means of UDFs. Pig is a very powerful tool, but its dataflow-based programming model is not familiar to most developers or business analysts. The traditional tool of choice for such people to explore data is SQL. Back in 2008 Facebook released Hive, the first widely used implementation of SQL on Hadoop. Instead of providing a way of more quickly developing map and reduce tasks, Hive offers an implementation of HiveQL, a query language based on SQL. Hive takes HiveQL statements and immediately and automatically translates the queries into one or more MapReduce jobs. It then executes the overall MapReduce program and returns the results to the user. This interface to Hadoop not only reduces the time required to produce results from data analysis, it also significantly widens the net as to who can use Hadoop. Instead of requiring software development skills, anyone who's familiar with SQL can use Hive. The combination of these attributes is that HiveQL is often used as a tool for business and data analysts to perform ad hoc queries on the data stored on HDFS. With Hive, the data analyst can work on refining queries without the involvement of a software developer. Just as with Pig, Hive also allows HiveQL to be extended by means of User Defined Functions, enabling the base SQL dialect to be customized with business-specific functionality. Other SQL-on-Hadoop solutions Though Hive was the first product to introduce and support HiveQL, it is no longer the only one. There are others, but we will mostly discuss Hive and Impala as they have been the most successful. While introducing the core features and capabilities of SQL on Hadoop however, we will give examples using Hive; even though Hive and Impala share many SQL features, they also have numerous differences. We don't want to constantly have to caveat each new feature with exactly how it is supported in Hive compared to Impala. We'll generally be looking at aspects of the feature set that are common to both, but if you use both products, it's important to read the latest release notes to understand the differences. Prerequisites Before diving into specific technologies, let's generate some data that we'll use in the examples throughout this article. We'll create a modified version of a former Pig script as the main functionality for this. The script in this article assumes that the Elephant Bird JARs used previously are available in the /jar directory on HDFS. The full source code is at https://github.com/learninghadoop2/book-examples/ch7/extract_for_hive.pig, but the core of extract_for_hive.pig is as follows: -- load JSON data tweets = load '$inputDir' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); -- Tweets tweets_tsv = foreach tweets { generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str', (chararray)$0#'text' as text, (chararray)$0#'in_reply_to', (boolean)$0#'retweeted' as is_retweeted, (chararray)$0#'user'#'id_str' as user_id, (chararray)$0#'place'#'id' as place_id; } store tweets_tsv into '$outputDir/tweets' using PigStorage('u0001'); -- Places needed_fields = foreach tweets { generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str' as id_str, $0#'place' as place; } place_fields = foreach needed_fields { generate (chararray)place#'id' as place_id, (chararray)place#'country_code' as co, (chararray)place#'country' as country, (chararray)place#'name' as place_name, (chararray)place#'full_name' as place_full_name, (chararray)place#'place_type' as place_type; } filtered_places = filter place_fields by co != ''; unique_places = distinct filtered_places; store unique_places into '$outputDir/places' using PigStorage('u0001'); -- Users users = foreach tweets { generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str' as id_str, $0#'user' as user; } user_fields = foreach users { generate (chararray)CustomFormatToISO(user#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)user#'id_str' as user_id, (chararray)user#'location' as user_location, (chararray)user#'name' as user_name, (chararray)user#'description' as user_description, (int)user#'followers_count' as followers_count, (int)user#'friends_count' as friends_count, (int)user#'favourites_count' as favourites_count, (chararray)user#'screen_name' as screen_name, (int)user#'listed_count' as listed_count; } unique_users = distinct user_fields; store unique_users into '$outputDir/users' using PigStorage('u0001'); Run this script as follows: $ pig –f extract_for_hive.pig –param inputDir=<json input> -param outputDir=<output path> The preceding code writes data into three separate TSV files for the tweet, user, and place information. Notice that in the store command, we pass an argument when calling PigStorage. This single argument changes the default field separator from a tab character to unicode value U0001, or you can also use Ctrl +C + A. This is often used as a separator in Hive tables and will be particularly useful to us as our tweet data could contain tabs in other fields. Overview of Hive We will now show how you can import data into Hive and run a query against the table abstraction Hive provides over the data. In this example, and in the remainder of the article, we will assume that queries are typed into the shell that can be invoked by executing the hive command. Recently a client called Beeline also became available and will likely be the preferred CLI client in the near future. When importing any new data into Hive, there is generally a three-stage process: Create the specification of the table into which the data is to be imported Import the data into the created table Execute HiveQL queries against the table Most of the HiveQL statements are direct analogues to similarly named statements in standard SQL. We assume only a passing knowledge of SQL throughout this article, but if you need a refresher, there are numerous good online learning resources. Hive gives a structured query view of our data, and to enable that, we must first define the specification of the table's columns and import the data into the table before we can execute any queries. A table specification is generated using a CREATE statement that specifies the table name, the name and types of its columns, and some metadata about how the table is stored: CREATE table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; The statement creates a new table tweets defined by a list of names for columns in the dataset and their data type. We specify that fields are delimited by the Unicode U0001 character and that the format used to store data is TEXTFILE. Data can be imported from a location in HDFS tweets/ into hive using the LOAD DATA statement: LOAD DATA INPATH 'tweets' OVERWRITE INTO TABLE tweets; By default, data for Hive tables is stored on HDFS under /user/hive/warehouse. If a LOAD statement is given a path to data on HDFS, it will not simply copy the data into /user/hive/warehouse, but will move it there instead. If you want to analyze data on HDFS that is used by other applications, then either create a copy or use the EXTERNAL mechanism that will be described later. Once data has been imported into Hive, we can run queries against it. For instance: SELECT COUNT(*) FROM tweets; The preceding code will return the total number of tweets present in the dataset. HiveQL, like SQL, is not case sensitive in terms of keywords, columns, or table names. By convention, SQL statements use uppercase for SQL language keywords, and we will generally follow this when using HiveQL within files, as will be shown later. However, when typing interactive commands, we will frequently take the line of least resistance and use lowercase. If you look closely at the time taken by the various commands in the preceding example, you'll notice that loading data into a table takes about as long as creating the table specification, but even the simple count of all rows takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times. The nature of Hive tables Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. Both the CREATE TABLE and LOAD DATA statements do not truly create concrete table data as such; instead, they produce the metadata that will be used when Hive generates MapReduce jobs to access the data conceptually stored in the table but actually residing on HDFS. Even though the HiveQL statements refer to a specific table structure, it is Hive's responsibility to generate code that correctly maps this to the actual on-disk format in which the data files are stored. This might seem to suggest that Hive isn't a real database; this is true, it isn't. Whereas a relational database will require a table schema to be defined before data is ingested and then ingest only data that conforms to that specification, Hive is much more flexible. The less concrete nature of Hive tables means that schemas can be defined based on the data as it has already arrived and not on some assumption of how the data should be, which might prove to be wrong. Though changeable data formats are troublesome regardless of technology, the Hive model provides an additional degree of freedom in handling the problem when, not if, it arises. Hive architecture Until version 2, Hadoop was primarily a batch system. Internally, Hive compiles HiveQL statements into MapReduce jobs. Hive queries have traditionally been characterized by high latency. This has changed with the Stinger initiative and the improvements introduced in Hive 0.13 that we will discuss later. Hive runs as a client application that processes HiveQL queries, converts them into MapReduce jobs, and submits these to a Hadoop cluster either to native MapReduce in Hadoop 1 or to the MapReduce Application Master running on YARN in Hadoop 2. Regardless of the model, Hive uses a component called the metastore, in which it holds all its metadata about the tables defined in the system. Ironically, this is stored in a relational database dedicated to Hive's usage. In the earliest versions of Hive, all clients communicated directly with the metastore, but this meant that every user of the Hive CLI tool needed to know the metastore username and password. HiveServer was created to act as a point of entry for remote clients, which could also act as a single access-control point and which controlled all access to the underlying metastore. Because of limitations in HiveServer, the newest way to access Hive is through the multi-client HiveServer2. HiveServer2 introduces a number of improvements over its predecessor, including user authentication and support for multiple connections from the same client. More information can be found at https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2. Instances of HiveServer and HiveServer2 can be manually executed with the hive --service hiveserver and hive --service hiveserver2 commands, respectively. In the examples we saw before and in the remainder of this article, we implicitly use HiveServer to submit queries via the Hive command-line tool. HiveServer2 comes with Beeline. For compatibility and maturity reasons, Beeline being relatively new, both tools are available on Cloudera and most other major distributions. The Beeline client is part of the core Apache Hive distribution and so is also fully open source. Beeline can be executed in embedded version with the following command: $ beeline -u jdbc:hive2:// Data types HiveQL supports many of the common data types provided by standard database systems. These include primitive types, such as float, double, int, and string, through to structured collection types that provide the SQL analogues to types such as arrays, structs, and unions (structs with options for some fields). Since Hive is implemented in Java, primitive types will behave like their Java counterparts. We can distinguish Hive data types into the following five broad categories: Numeric: tinyint, smallint, int, bigint, float, double, and decimal Date and time: timestamp and date String: string, varchar, and char Collections: array, map, struct, and uniontype Misc: boolean, binary, and NULL DDL statements HiveQL provides a number of statements to create, delete, and alter databases, tables, and views. The CREATE DATABASE <name> statement creates a new database with the given name. A database represents a namespace where table and view metadata is contained. If multiple databases are present, the USE <database name> statement specifies which one to use to query tables or create new metadata. If no database is explicitly specified, Hive will run all statements against the default database. SHOW [DATABASES, TABLES, VIEWS] displays the databases currently available within a data warehouse and which table and view metadata is present within the database currently in use: CREATE DATABASE twitter; SHOW databases; USE twitter; SHOW TABLES; The CREATE TABLE [IF NOT EXISTS] <name> statement creates a table with the given name. As alluded to earlier, what is really created is the metadata representing the table and its mapping to files on HDFS as well as a directory in which to store the data files. If a table or view with the same name already exists, Hive will raise an exception. Both table and column names are case insensitive. In older versions of Hive (0.12 and earlier), only alphanumeric and underscore characters were allowed in table and column names. As of Hive 0.13, the system supports unicode characters in column names. Reserved words, such as load and create, need to be escaped by backticks (the ` character) to be treated literally. The EXTERNAL keyword specifies that the table exists in resources out of Hive's control, which can be a useful mechanism to extract data from another source at the beginning of a Hadoop-based Extract-Transform-Load (ETL) pipeline. The LOCATION clause specifies where the source file (or directory) is to be found. The EXTERNAL keyword and LOCATION clause have been used in the following code: CREATE EXTERNAL TABLE tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/tweets'; This table will be created in metastore, but the data will not be copied into the /user/hive/warehouse directory. Note that Hive has no concept of primary key or unique identifier. Uniqueness and data normalization are aspects to be addressed before loading data into the data warehouse. The CREATE VIEW <view name> … AS SELECT statement creates a view with the given name. For example, we can create a view to isolate retweets from other messages, as follows: CREATE VIEW retweets COMMENT 'Tweets that have been retweeted' AS SELECT * FROM tweets WHERE retweeted = true; Unless otherwise specified, column names are derived from the defining SELECT statement. Hive does not currently support materialized views. The DROP TABLE and DROP VIEW statements remove both metadata and data for a given table or view. When dropping an EXTERNAL table or a view, only metadata will be removed and the actual data files will not be affected. Hive allows table metadata to be altered via the ALTER TABLE statement, which can be used to change a column type, name, position, and comment or to add and replace columns. When adding columns, it is important to remember that only metadata will be changed and not the dataset itself. This means that if we were to add a column in the middle of the table which didn't exist in older files, then while selecting from older data, we might get wrong values in the wrong columns. This is because we would be looking at old files with a new format Similarly, ALTER VIEW <view name> AS <select statement> changes the definition of an existing view. File formats and storage The data files underlying a Hive table are no different from any other file on HDFS. Users can directly read the HDFS files in the Hive tables using other tools. They can also use other tools to write to HDFS files that can be loaded into Hive through CREATE EXTERNAL TABLE or through LOAD DATA INPATH. Hive uses the Serializer and Deserializer classes, SerDe, as well as FileFormat to read and write table rows. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified in a CREATE TABLE statement. The DELIMITED clause instructs the system to read delimited files. Delimiter characters can be escaped using the ESCAPED BY clause. Hive currently uses the following FileFormat classes to read and write HDFS files: TextInputFormat and HiveIgnoreKeyTextOutputFormat: will read/write data in plain text file format SequenceFileInputFormat and SequenceFileOutputFormat: classes read/write data in the Hadoop SequenceFile format Additionally, the following SerDe classes can be used to serialize and deserialize data: MetadataTypedColumnsetSerDe: This will read/write delimited records such as CSV or tab-separated records ThriftSerDe, and DynamicSerDe: These will read/write Thrift objects JSON As of version 0.13, Hive ships with the native org.apache.hive.hcatalog.data.JsonSerDe JSON SerDe. For older versions of Hive, Hive-JSON-Serde (found at https://github.com/rcongiu/Hive-JSON-Serde) is arguably one of the most feature-rich JSON serialization/deserialization modules. We can use either module to load JSON tweets without any need for preprocessing and just define a Hive schema that matches the content of a JSON document. In the following example, we use Hive-JSON-Serde. As with any third-party module, we load the SerDe JARS into Hive with the following code: ADD JAR JAR json-serde-1.3-jar-with-dependencies.jar; Then, we issue the usual create statement, as follows: CREATE EXTERNAL TABLE tweets ( contributors string, coordinates struct < coordinates: array <float>, type: string>, created_at string, entities struct < hashtags: array <struct < indices: array <tinyint>, text: string>>, … ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION 'tweets'; With this SerDe, we can map nested documents (such as entities or users) to the struct or map types. We tell Hive that the data stored at LOCATION 'tweets' is text (STORED AS TEXTFILE) and that each row is a JSON object (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'). In Hive 0.13 and later, we can express this property as ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'. Manually specifying the schema for complex documents can be a tedious and error-prone process. The hive-json module (found at https://github.com/hortonworks/hive-json) is a handy utility to analyze large documents and generate an appropriate Hive schema. Depending on the document collection, further refinement might be necessary. In our example, we used a schema generated with hive-json that maps the tweets JSON to a number of struct data types. This allows us to query the data using a handy dot notation. For instance, we can extract the screen name and description fields of a user object with the following code: SELECT user.screen_name, user.description FROM tweets_json LIMIT 10; Avro AvroSerde (https://cwiki.apache.org/confluence/display/Hive/AvroSerDe) allows us to read and write data in Avro format. Starting from 0.14, Avro-backed tables can be created using the STORED AS AVRO statement, and Hive will take care of creating an appropriate Avro schema for the table. Prior versions of Hive are a bit more verbose. This dataset was created using Pig's AvroStorage class, which generated the following schema: { "type":"record", "name":"record", "fields": [ {"name":"topic","type":["null","int"]}, {"name":"source","type":["null","int"]}, {"name":"rank","type":["null","float"]} ] } The table structure is captured in an Avro record, which contains header information (a name and optional namespace to qualify the name) and an array of the fields. Each field is specified with its name and type as well as an optional documentation string. For a few of the fields, the type is not a single value, but instead a pair of values, one of which is null. This is an Avro union, and this is the idiomatic way of handling columns that might have a null value. Avro specifies null as a concrete type, and any location where another type might have a null value needs to be specified in this way. This will be handled transparently for us when we use the following schema. With this definition, we can now create a Hive table that uses this schema for its table specification, as follows: CREATE EXTERNAL TABLE tweets_pagerank ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type":"record", "name":"record", "fields": [ {"name":"topic","type":["null","int"]}, {"name":"source","type":["null","int"]}, {"name":"rank","type":["null","float"]} ] }') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '${data}/ch5-pagerank'; Then, look at the following table definition from within Hive (note also that HCatalog): DESCRIBE tweets_pagerank; OK topic int from deserializer source int from deserializer rank float from deserializer In the DDL, we told Hive that data is stored in Avro format using AvroContainerInputFormat and AvroContainerOutputFormat. Each row needs to be serialized and deserialized using org.apache.hadoop.hive.serde2.avro.AvroSerDe. The table schema is inferred by Hive from the Avro schema embedded in avro.schema.literal. Alternatively, we can store a schema on HDFS and have Hive read it to determine the table structure. Create the preceding schema in a file called pagerank.avsc—this is the standard file extension for Avro schemas. Then place it on HDFS; we prefer to have a common location for schema files such as /schema/avro. Finally, define the table using the avro.schema.url SerDe property WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc'). If Avro dependencies are not present in the classpath, we need to add the Avro MapReduce JAR to our environment before accessing individual fields. Within Hive, on the Cloudera CDH5 VM: ADD JAR /opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar; We can also use this table like any other. For instance, we can query the data to select the user and topic pairs with a high PageRank: SELECT source, topic from tweets_pagerank WHERE rank >= 0.9; Columnar stores Hive can also take advantage of columnar storage via the ORC (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) and Parquet (https://cwiki.apache.org/confluence/display/Hive/Parquet) formats. If a table is defined with very many columns, it is not unusual for any given query to only process a small subset of these columns. But even in a SequenceFile each full row and all its columns will be read from disk, decompressed, and processed. This consumes a lot of system resources for data that we know in advance is not of interest. Traditional relational databases also store data on a row basis, and a type of database called columnar changed this to be column-focused. In the simplest model, instead of one file for each table, there would be one file for each column in the table. If a query only needed to access five columns in a table with 100 columns in total, then only the files for those five columns will be read. Both ORC and Parquet use this principle as well as other optimizations to enable much faster queries. Queries Tables can be queried using the familiar SELECT … FROM statement. The WHERE statement allows the specification of filtering conditions, GROUP BY aggregates records, ORDER BY specifies sorting criteria, and LIMIT specifies the number of records to retrieve. Aggregate functions, such as count and sum, can be applied to aggregated records. For instance, the following code returns the top 10 most prolific users in the dataset: SELECT user_id, COUNT(*) AS cnt FROM tweets GROUP BY user_id ORDER BY cnt DESC LIMIT 10 The following are the top 10 most prolific users in the dataset: NULL 7091 1332188053 4 959468857 3 1367752118 3 362562944 3 58646041 3 2375296688 3 1468188529 3 37114209 3 2385040940 3 We can improve the readability of the hive output by setting the following: SET hive.cli.print.header=true; This will instruct hive, though not beeline, to print column names as part of the output. You can add the command to the .hiverc file usually found in the root of the executing user's home directory to have it apply to all hive CLI sessions. HiveQL implements a JOIN operator that enables us to combine tables together. In the Prerequisites section, we generated separate datasets for the user and place objects. Let's now load them into hive using external tables. We first create a user table to store user data, as follows: CREATE EXTERNAL TABLE user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/users'; We then create a place table to store location data, as follows: CREATE EXTERNAL TABLE place ( place_id string, country_code string, country string, `name` string, full_name string, place_type string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/places'; We can use the JOIN operator to display the names of the 10 most prolific users, as follows: SELECT tweets.user_id, user.name, COUNT(tweets.user_id) AS cnt FROM tweets JOIN user ON user.user_id = tweets.user_id GROUP BY tweets.user_id, user.user_id, user.name ORDER BY cnt DESC LIMIT 10; Only equality, outer, and left (semi) joins are supported in Hive. Notice that there might be multiple entries with a given user ID but different values for the followers_count, friends_count, and favourites_count columns. To avoid duplicate entries, we count only user_id from the tweets tables. We can rewrite the previous query as follows: SELECT tweets.user_id, u.name, COUNT(*) AS cnt FROM tweets join (SELECT user_id, name FROM user GROUP BY user_id, name) u ON u.user_id = tweets.user_id GROUP BY tweets.user_id, u.name ORDER BY cnt DESC LIMIT 10; Instead of directly joining the user table, we execute a subquery, as follows: SELECT user_id, name FROM user GROUP BY user_id, name; The subquery extracts unique user IDs and names. Note that Hive has limited support for subqueries, historically only permitting a subquery in the FROM clause of a SELECT statement. Hive 0.13 has added limited support for subqueries within the WHERE clause also. HiveQL is an ever-evolving rich language, a full exposition of which is beyond the scope of this article. A description of its query and ddl capabilities can be found at https://cwiki.apache.org/confluence/display/Hive/LanguageManual. Structuring Hive tables for given workloads Often Hive isn't used in isolation, instead tables are created with particular workloads in mind or needs invoked in ways that are suitable for inclusion in automated processes. We'll now explore some of these scenarios. Partitioning a table With columnar file formats, we explained the benefits of excluding unneeded data as early as possible when processing a query. A similar concept has been used in SQL for some time: table partitioning. When creating a partitioned table, a column is specified as the partition key. All values with that key are then stored together. In Hive's case, different subdirectories for each partition key are created under the table directory in the warehouse location on HDFS. It's important to understand the cardinality of the partition column. With too few distinct values, the benefits are reduced as the files are still very large. If there are too many values, then queries might need a large number of files to be scanned to access all the required data. Perhaps the most common partition key is one based on date. We could, for example, partition our user table from earlier based on the created_at column, that is, the date the user was first registered. Note that since partitioning a table by definition affects its file structure, we create this table now as a non-external one, as follows: CREATE TABLE partitioned_user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; To load data into a partition, we can explicitly give a value for the partition into which to insert the data, as follows: INSERT INTO TABLE partitioned_user PARTITION( created_at_date = '2014-01-01') SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count FROM user; This is at best verbose, as we need a statement for each partition key value; if a single LOAD or INSERT statement contains data for multiple partitions, it just won't work. Hive also has a feature called dynamic partitioning, which can help us here. We set the following three variables: SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; SET hive.exec.max.dynamic.partitions.pernode=5000; The first two statements enable all partitions (nonstrict option) to be dynamic. The third one allows 5,000 distinct partitions to be created on each mapper and reducer node. We can then simply use the name of the column to be used as the partition key, and Hive will insert data into partitions depending on the value of the key for a given row: INSERT INTO TABLE partitioned_user PARTITION( created_at_date ) SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user; Even though we use only a single partition column here, we can partition a table by multiple column keys; just have them as a comma-separated list in the PARTITIONED BY clause. Note that the partition key columns need to be included as the last columns in any statement being used to insert into a partitioned table. In the preceding code we use Hive's to_date function to convert the created_at timestamp to a YYYY-MM-DD formatted string. Partitioned data is stored in HDFS as /path/to/warehouse/<database>/<table>/key=<value>. In our example, the partitioned_user table structure will look like /user/hive/warehouse/default/partitioned_user/created_at=2014-04-01. If data is added directly to the filesystem, for instance by some third-party processing tool or by hadoop fs -put, the metastore won't automatically detect the new partitions. The user will need to manually run an ALTER TABLE statement such as the following for each newly added partition: ALTER TABLE <table_name> ADD PARTITION <location>; To add metadata for all partitions not currently present in the metastore we can use: MSCK REPAIR TABLE <table_name>; statement. On EMR, this is equivalent to executing the following statement: ALTER TABLE <table_name> RECOVER PARTITIONS; Notice that both statements will work also with EXTERNAL tables. Overwriting and updating data Partitioning is also useful when we need to update a portion of a table. Normally a statement of the following form will replace all the data for the destination table: INSERT OVERWRITE INTO <table>… If OVERWRITE is omitted, then each INSERT statement will add additional data to the table. Sometimes, this is desirable, but often, the source data being ingested into a Hive table is intended to fully update a subset of the data and keep the rest untouched. If we perform an INSERT OVERWRITE statement (or a LOAD OVERWRITE statement) into a partition of a table, then only the specified partition will be affected. Thus, if we were inserting user data and only wanted to affect the partitions with data in the source file, we could achieve this by adding the OVERWRITE keyword to our previous INSERT statement. We can also add caveats to the SELECT statement. Say, for example, we only wanted to update data for a certain month: INSERT INTO TABLE partitioned_user PARTITION (created_at_date) SELECT created_at , user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user WHERE to_date(created_at) BETWEEN '2014-03-01' and '2014-03-31'; Bucketing and sorting Partitioning a table is a construct that you take explicit advantage of by using the partition column (or columns) in the WHERE clause of queries against the tables. There is another mechanism called bucketing that can further segment how a table is stored and does so in a way that allows Hive itself to optimize its internal query plans to take advantage of the structure. Let's create bucketed versions of our tweets and user tables; note the following additional CLUSTER BY and SORT BY statements in the CREATE TABLE statements: CREATE table bucketed_tweets ( tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; CREATE TABLE bucketed_user ( user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) SORTED BY(name) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; Note that we changed the tweets table to also be partitioned; you can only bucket a table that is partitioned. Just as we need to specify a partition column when inserting into a partitioned table, we must also take care to ensure that data inserted into a bucketed table is correctly clustered. We do this by setting the following flag before inserting the data into the table: SET hive.enforce.bucketing=true; Just as with partitioned tables, you cannot apply the bucketing function when using the LOAD DATA statement; if you wish to load external data into a bucketed table, first insert it into a temporary table, and then use the INSERT…SELECT… syntax to populate the bucketed table. When data is inserted into a bucketed table, rows are allocated to a bucket based on the result of a hash function applied to the column specified in the CLUSTERED BY clause. One of the greatest advantages of bucketing a table comes when we need to join two tables that are similarly bucketed, as in the previous example. So, for example, any query of the following form would be vastly improved: SET hive.optimize.bucketmapjoin=true; SELECT … FROM bucketed_user u JOIN bucketed_tweet t ON u.user_id = t.user_id; With the join being performed on the column used to bucket the table, Hive can optimize the amount of processing as it knows that each bucket contains the same set of user_id columns in both tables. While determining which rows against which to match, only those in the bucket need to be compared against, and not the whole table. This does require that the tables are both clustered on the same column and that the bucket numbers are either identical or one is a multiple of the other. In the latter case, with say one table clustered into 32 buckets and another into 64, the nature of the default hash function used to allocate data to a bucket means that the IDs in bucket 3 in the first table will cover those in both buckets 3 and 35 in the second. Sampling data Bucketing a table can also help while using Hive's ability to sample data in a table. Sampling allows a query to gather only a specified subset of the overall rows in the table. This is useful when you have an extremely large table with moderately consistent data patterns. In such a case, applying a query to a small fraction of the data will be much faster and will still give a broadly representative result. Note, of course, that this only applies to queries where you are looking to determine table characteristics, such as pattern ranges in the data; if you are trying to count anything, then the result needs to be scaled to the full table size. For a non-bucketed table, you can sample in a mechanism similar to what we saw earlier by specifying that the query should only be applied to a certain subset of the table: SELECT max(friends_count) FROM user TABLESAMPLE(BUCKET 2 OUT OF 64 ON name); In this query, Hive will effectively hash the rows in the table into 64 buckets based on the name column. It will then only use the second bucket for the query. Multiple buckets can be specified, and if RAND() is given as the ON clause, then the entire row is used by the bucketing function. Though successful, this is highly inefficient as the full table needs to be scanned to generate the required subset of data. If we sample on a bucketed table and ensure the number of buckets sampled is equal to or a multiple of the buckets in the table, then Hive will only read the buckets in question. For example: SELECT MAX(friends_count) FROM bucketed_user TABLESAMPLE(BUCKET 2 OUT OF 32 on user_id); In the preceding query against the bucketed_user table, which is created with 64 buckets on the user_id column, the sampling, since it is using the same column, will only read the required buckets. In this case, these will be buckets 2 and 34 from each partition. A final form of sampling is block sampling. In this case, we can specify the required amount of the table to be sampled, and Hive will use an approximation of this by only reading enough source data blocks on HDFS to meet the required size. Currently, the data size can be specified as either a percentage of the table, as an absolute data size, or as a number of rows (in each block). The syntax for TABLESAMPLE is as follows, which will sample 0.5 percent of the table, 1 GB of data or 100 rows per split, respectively: TABLESAMPLE(0.5 PERCENT) TABLESAMPLE(1G) TABLESAMPLE(100 ROWS) If these latter forms of sampling are of interest, then consult the documentation, as there are some specific limitations on the input format and file formats that are supported. Writing scripts We can place Hive commands in a file and run them with the -f option in the hive CLI utility: $ cat show_tables.hql show tables; $ hive -f show_tables.hql We can parameterize HiveQL statements by means of the hiveconf mechanism. This allows us to specify an environment variable name at the point it is used rather than at the point of invocation. For example: $ cat show_tables2.hql show tables like '${hiveconf:TABLENAME}'; $ hive -hiveconf TABLENAME=user -f show_tables2.hql The variable can also be set within the Hive script or an interactive session: SET TABLE_NAME='user'; The preceding hiveconf argument will add any new variables in the same namespace as the Hive configuration options. As of Hive 0.8, there is a similar option called hivevar that adds any user variables into a distinct namespace. Using hivevar, the preceding command would be as follows: $ cat show_tables3.hql show tables like '${hivevar:TABLENAME}'; $ hive -hivevar TABLENAME=user –f show_tables3.hql Or we can write the command interactively: SET hivevar_TABLE_NAME='user'; Summary In this article, we learned that in its early days, Hadoop was sometimes erroneously seen as the latest supposed relational database killer. Over time, it has become more apparent that the more sensible approach is to view it as a complement to RDBMS technologies and that, in fact, the RDBMS community has developed tools such as SQL that are also valuable in the Hadoop world. HiveQL is an implementation of SQL on Hadoop and was the primary focus of this article. In regard to HiveQL and its implementations, we covered the following topics: How HiveQL provides a logical model atop data stored in HDFS in contrast to relational databases where the table structure is enforced in advance How HiveQL offers the ability to extend its core set of operators with user-defined code and how this contrasts to the Pig UDF mechanism The recent history of Hive developments, such as the Stinger initiative, that have seen Hive transition to an updated implementation that uses Tez Resources for Article: Further resources on this subject: Big Data Analysis [Article] Understanding MapReduce [Article] Amazon DynamoDB - Modelling relationships, Error handling [Article]

0
0
4808

article-image-manifest-assurance-security-and-android-permissions-flash

Packt

29 Jun 2011

8 min read

Manifest Assurance: Security and Android Permissions for Flash

Packt

29 Jun 2011

8 min read

0
0
4806

How-To Tutorials

Packt

27 Nov 2014

6 min read

Dealing with Upstream Proxies

Packt

27 Nov 2014

6 min read

This article is written by Akash Mahajan, the author of Burp Suite Essentials. We know that setting up Mozilla Firefox with the FoxyProxy Standard add-on to create a selective, pattern-based forwarding process allows us to ensure that only white-listed traffic from our browser reaches Burp. This is something that Burp allows us to set with its configuration options itself. Think of it like this: less traffic reaching Burp ensures that Burp is dealing with legitimate traffic, and its filters are working on ensuring that we remain within our scope. (For more resources related to this topic, see here.) As a security professional testing web application, scope is a term you hear and read about everywhere. Many times, we are expected to test only parts of an application, and usually, the scope is limited by domain, subdomain, folder name, and even certain filenames. Burp gives a nice, simple-to-use interface to add, edit, and remove targets from the scope. Dealing with upstream proxies and SOCKS proxies Sometimes, the application that we need to test lies inside some corporate network. The clients give access to a specific IP address that is white-listed in the corporate firewall. At other times, we work inside the client location but it requires us to provide an internal proxy to get access to the staging site for testing. In all such cases and more, we need to be able to add an additional proxy that Burp can send data to before it reaches our target. In some cases, this proxy can be the one that the browser requires to reach the intranet or even the Internet. Since we would like to intercept all the browser traffic and Burp has become the proxy for the browser, we need to be able to chain the proxy to set the same in Burp. Types of proxies supported by Burp We can configure additional proxies by navigating to Options | Connections. If you notice carefully, the upstream proxy rule editor looks like the FoxyProxy add-on proxy window. That is not surprising as both of them operate with URL patterns. We can carefully add the target as the destination that will require a proxy to reach to. Most standard proxies that support authentication are supported in Burp. Out of these, NTLM flavors are regularly found in networks with the Microsoft Active Directory infrastructure. The usage is straightforward. Add the destination and the other details that should be provided to you by the network administrators. Working with SOCKS proxies SOCKS proxies are another common form of proxies in use. The most popular SOCKS-based proxy is TOR, which allows your entire browser traffic, including DNS lookups, to occur at the proxy end. Since the SOCKS proxy protocol works by taking all the traffic through it, the destination server can see the IP address of the SOCKS proxy. You can give this a whirl by running the Tor browser bundle http://www.torproject.org/projects/torbrowser.html.en. Once the Tor browser bundle is running successfully, just add the following values in the SOCKS proxy settings of Burp. Make sure you check Use SOCKS proxy after adding the correct values. Have a look at the following screenshot: Using SSH tunneling as a SOCKS proxy Using SSH tunneling as a SOCKS proxy is quite useful when we want to give a white-listed IP address to a firewall administrator to access an application. So, the scenario here requires you to have access to a GNU/Linux server with a static IP address, which you can connect to using Secure Shell Server (SSH). In Mac OS X and GNU/Linux shell, the following command will start a local SOCKS proxy: ssh -D 12345 user@hostname.com Once you are successfully logged in to your server, leave it on so that Burp can keep using it. Now add localhost as SOCKS proxy host and 12345 as SOCKS proxy port, and you are good to go. In Windows, if we use a command-line SSH client that comes with GNU, the process remains the same. Otherwise, if you are a PuTTY fan, let's see how we can configure the same thing in it. In PuTTY, follow these steps to get the SSH tunnel working, which will be our SOCKS proxy: Start PuTTY and click on SSH and then on Tunnels. Here, add a newly forwarded port. Give it the value of 12345. Under Destination, there is a bunch of radio buttons; choose Auto and Dynamic, and then click on the Add button: Once this is set, connect to the server. Add the values localhost and 12345 in the Host and Port fields, respectively, in the Burp options for the SOCKS proxy. You can verify that your traffic is going through the SOCKS proxy by visiting any site that gives you your external IP address. I personally use my own web page for that http://akashm.com/ip.php; you might want to try http://icanhazip.com or http://whatismyip.com. Burp allows maximum connectivity with upstream and SOCKS proxies to make our job easier. By adding URL patterns, we can choose which proxy is connected in upstream proxy providers. SOCKS proxies, due to their nature, take all the traffic and send it to another computer, so we can't choose which URL to use it for. But this allows a simple-to-use workflow to test applications, which are behind corporate firewalls and need to white-list our static IP before allowing access. Setting up Burp to be a proxy server for other devices So far, we have run Burp on our computer. This is good enough when we want to intercept the traffic of browsers running on our computer. But what if we would like to intercept traffic from our television, from our iOS, or Android devices? Currently, in the default configuration, Burp has started one listener on an internal interface on port number 8080. We can start multiple listeners on different ports and interfaces. We can do this in the Options subtab under the Proxy tab. Note that this is different from the main Options tab. We can add more than one proxy listener at the same time by following these steps: Click on the Add button under Proxy Listeners. Enter a port number. It can be the same 8080, but if it confuses you, can give the number 8081. Specify an interface and choose your LAN IP address. Once you click on Ok, click on Running, and now you have started an external listener for Burp: You can add the LAN IP address and the port number you added as the proxy server on your mobile device, and all HTTP traffic will get intercepted by Burp. Have a look at the following screenshot: Summary In this article, you learned how to use the SOCKS proxy server, especially in a SSH tunnel kind of scenario. You also learned how simple it is to create multiple listeners for Burp, which allows other devices in the network to send their HTTP traffic to the Burp interception proxy. Resources for Article: Further resources on this subject: Quick start – Using Burp Proxy [article] Nginx proxy module [article] Using Nginx as a Reverse Proxy [article]

0
0
4804

How-To Tutorials

Introduction to Railo Open Source

Modeling, Shading, Texturing, Lighting, and Compositing a Soda Can in Blender 2.49: Part 1

Java Hibernate Collections, Associations, and Advanced Concepts

Implementing Stacks using JavaScript

BPMS Components

AI_Distilled #16: Baidu's Ernie Chatbot, OpenAI's ChatGPT in Education, Meta's FACET Dataset, FMOps or LLMOps, Qualcomm's AI Focus, InteRecAgent, Liquid Neural Networks

Odata on Mobile Devices

Text Mining with R: Part 2

Background jobs on Django with Celery

Cluster Computing Using Scala

Trending Topics

Active Directory Domain Services 2016

Integrating phpList 2 with WordPress

Hive in Hadoop

Manifest Assurance: Security and Android Permissions for Flash

Dealing with Upstream Proxies

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access