Big Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-biggest-big-data-and-business-intelligence-salary-and-skills-survey-2015

03 Aug 2015

1 min read

The biggest Big Data & Business Intelligence salary and skills survey of 2015

03 Aug 2015

See the highlights from our comprehensive Skill Up IT industry salary reports, with data from over 20,000 IT professionals. Find out what trends are emerging in the world of data science and business intelligence and what skills you should be learning to further your career. Download the full size infographic here.

0
0
6549

article-image-what-we-learned-oracle-openworld-2017

Amey Varangaonkar

06 Oct 2017

5 min read

What we learned from Oracle OpenWorld 2017

Amey Varangaonkar

06 Oct 2017

5 min read

“Amazon’s lead is over.” These famous words by the Oracle CTO Larry Ellison in the Oracle OpenWorld 2016 garnered a lot of attention, as Oracle promised their customers an extensive suite of cloud offerings, and offered a closer look at their second generation IaaS data centers. In the recently concluded OpenWorld 2017, Oracle continued on their quest to take on AWS and other major cloud vendors by unveiling a host of cloud-based products and services. Not just that, they have juiced these offerings up with Artificial Intelligence-based features, in line with all the buzz surrounding AI. Key highlights from the Oracle OpenWorld 2017 Autonomous Database Oracle announced a totally automated, self-driving database that would require no human intervention for managing or fine-tuning the database. Using machine learning and AI to eliminate human error, the new database guarantees 99.995% availability. While taking another shot at AWS, Ellison promised in his keynote that customers moving from Amazon’s Redshift to Oracle’s database can expect a 50% cost reduction. Likely to be named as Oracle 18c, this new database is expected to be shipped across the world by December 2017. Oracle Blockchain Cloud Service Oracle joined IBM in the race to dominate the Blockchain space by unveiling its new cloud-based Blockchain service. Built on top of the Hyperledger Fabric project, the service promises to transform the way business is done by offering secure, transparent and efficient transactions. Other enterprise-critical features such as provisioning, monitoring, backup and recovery are also some of the standard features which this service will offer to its customers. “There are not a lot of production-ready capabilities around Blockchain for the enterprise. There [hasn’t been] a fully end-to-end, distributed and secure blockchain as a service,” Amit Zavery, Senior VP at Oracle Cloud. It is also worth remembering that Oracle joined the Hyperledger consortium just two months ago, and the signs of them releasing their own service were there already. Improvements to Business Management Services The new features and enhancements introduced for the business management services were one of the key highlights of the OpenWorld 2017. These features now empower businesses to manage their customers better, and plan for the future with better organization of resources. Some important announcements in this area were: Adding AI capabilities to its cloud services - The Oracle Adaptive Intelligent Apps will now make use of the AI capabilities to improve services for any kind of business Developers can now create their own AI-powered Oracle applications, making use of deep learning Oracle introduced AI-powered chatbots for better customer and employee engagement New features such as enhanced user experience in the Oracle ERP cloud and improved recruiting in the HR cloud services were introduced Key Takeaways from Oracle OpenWorld 2017 With the announcements, Oracle have given a clear signal that they’re to be taken seriously. They’re already buoyed by a strong Q1 result which saw their revenue from cloud platforms hit $1.5 billion, indicating a growth of 51% as compared to Q1 2016, Here are some key takeaways from the OpenWorld 2017, which are underlined by the aforementioned announcements: Oracle undoubtedly see cloud as the future, and have placed a lot of focus on the performance of their cloud platform. They’re betting on the fact that their familiarity with the traditional enterprise workload will help them win a lot more customers - something Amazon cannot claim. Oracle are riding on the AI wave and are trying to make their products as autonomous as possible - to reduce human intervention and human error, to some extent. With enterprises looking to cut costs wherever possible, this could be a smart move to attract more customers. The autonomous database will require Oracle to automatically fine-tune, patch, and upgrade its database, without causing any downtime. It will be interesting to see if the database can live up to its promise of ‘99.995% availability’. Is the role of Oracle DBAs going to be at risk, due to the automation? While it is doubtful that they will be out of jobs, there is bound to be a significant shift in their day to day operations. It is speculated that the DBAs would require to spend less time on the traditional administration tasks such as fine-tuning, patching, upgrading, etc. and instead focus on efficient database design, setting data policies and securing the data. Cybersecurity has been a key theme in Ellison’s keynote and the OpenWorld 2017 in general. As enterprise Blockchain adoption grows, so does the need for a secure, efficient digital transaction system. Oracle seem to have identified this opportunity, and it will be interesting to see how they compete with the likes of IBM and SAP to gain major market share. Oracle’s CEO Mark Hurd has predicted that Oracle can win the cloud wars, overcoming the likes of Amazon, Microsoft and Google. Judging by the announcements in the OpenWorld 2017, it seems like they may have a plan in place to actually pull it off. You can watch highlights from the Oracle OpenWorld 2017 on demand here. Don’t forget to check out our highly popular book Oracle Business Intelligence Enterprise Edition 12c, your one-stop guide to building an effective Oracle BI 12c system.

0
0
6407

article-image-reducing-cost-big-data-using-statistics-and-memory-technology-part-2

Praveen Rachabattuni

06 Jul 2015

6 min read

Reducing Cost in Big Data using Statistics and In-memory Technology - Part 2

Praveen Rachabattuni

06 Jul 2015

6 min read

In the first part of this two-part blog series, we learned that using statistical algorithms gives us a 95 percent accuracy rate for big data analytics, is faster, and is a lot more beneficial than waiting for the exact results. We also took a look at a few algorithms along with a quick introduction to Spark. Now let’s take a look at two tools in depth that are used with statistical algorithms: Apache Spark and Apache Pig. Apache Spark Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, and Python, as well as an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. At its core, Spark provides a general programming model that enables developers to write applications by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. This composition makes it easy to express a wide array of computations, including iterative machine learning, streaming, complex queries, and batch processing. In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory. This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses. It would be wonderful to have one tool for everyone, and one architecture and language for investigative as well as operational analytics. Spark’s ease of use comes from its general programming model, which does not constrain users to structure their applications into a bunch of map and reduce operations. Spark’s parallel programs look very much like sequential programs, which make them easier to develop and reason about. Finally, Spark allows users to easily combine batch, interactive, and streaming jobs in the same application. As a result, a Spark job can be up to 100 times faster and requires writing 210 times less code than an equivalent Hadoop job. Spark allows users and applications to explicitly cache a dataset by calling the cache() operation. This means that your applications can now access data from RAM instead of disk, which can dramatically improve the performance of iterative algorithms that access the same dataset repeatedly. This use case covers an important class of applications, as all machine learning and graph algorithms are iterative in nature. When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence. With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph. This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention. With a low-latency data analysis system at your disposal, it’s natural to extend the engine towards processing live data streams. Spark has an API for working with streams, providing exactly-once semantics and full recovery of stateful operators. It also has the distinct advantage of giving you the same Spark APIs to process your streams, including reuse of your regular Spark application code. Pig on Spark Pig on Spark combines the power and simplicity of Apache Pig on Apache Spark, making existing ETL pipelines 100 times faster than before. We do that via a unique mix of our operator toolkit, called DataDoctor, and Spark. The following are the primary goals for the project: Make data processing more powerful Make data processing more simple Make data processing 100 times faster than before DataDoctor is a high-level operator DSL on top of Spark. It has frameworks for no-symmetrical joins, sorting, grouping, and embedding native Spark functions. It hides a lot of complexity and makes it simple to implement data operators used in applications like Pig and Apache Hive on Spark. Pig operates in a similar manner to big data applications like Hive and Cascading. It has a query language quite akin to SQL that allows analysts and developers to design and write data flows. The query language is translated in to a “logical plan” that is further translated in to a “physical plan” containing operators. Those operators are then run on the designated execution engine (MapReduce, Apache Tez, and now Spark). There are a whole bunch of details around tracking progress, handling errors, and so on that I will skip here. Query planning on Spark will vary significantly from MapReduce, as Spark handles data wrangling in a much more optimized way. Further query planning can benefit greatly from ongoing effort on Catalyst inside Spark. At this moment, we have simply introduced a SparkPlanner that will undertake the conversion from a logical to a physical plan for Pig. Databricks is working actively to enable Catalyst to handle much of the operator optimizations that will plug into SparkPlanner in the near future. Longer term, we plan to rely on Spark itself for logical plan generation. An early version of this integration has been prototyped in partnership with Databricks. Pig Core hands off Spark execution to SparkLauncher with the physical plan. SparkLauncher creates a SparkContext providing all the Pig dependency JAR files and Pig itself. SparkLauncher gets an MR plan object created from the physical plan. At this point, we override all the Pig operators to DataDoctor operators recursively in the whole plan. Two iterations are performed over the plan — one that looks at the store operations and recursively travels down the execution tree, and a second iteration that does a breadth-first traversal over the plan and calls convert on each of the operators. The base class of converters in DataDoctor is a POConverter class and defines the abstract method convert, which is called during plan execution. More details of Pig on Spark can be found at PIG4059. As we merge with Apache Pig, we need to focus on the following enhancements to further improve the speed of Pig: Cache operator: Adding a new operator to explicitly tell Spark to cache certain datasets for faster execution Storage hints: Allowing the user to specify the storage location of datasets in Spark for better control of memory YARN and Mesos support: Adding resource manager support for more global deployment and support Conclusion In many large-scale data applications, statistical perspectives provide us with fruitful analytics in many ways, including speed and efficiency. About the author Praveen Rachabattuni is a tech lead at Sigmoid Analytics, a company that provides a real-time streaming and ETL framework on Apache Spark. Praveen is also a committer to Apache Pig.

0
0
6328

article-image-reducing-cost-big-data-using-statistics-and-memory-technology-part-1

Praveen Rachabattuni

03 Jul 2015

4 min read

Reducing Cost in Big Data using Statistics and In-memory Technology - Part 1

Praveen Rachabattuni

03 Jul 2015

4 min read

The world is shifting from private, dedicated data centers to on-demand computing in the cloud. This shift moves the onus of cost from the hands of IT companies to the hands of developers. As your data sizes start to rise, the computing cost grows linearly with it. We have found that using statistical algorithms gives us a 95 percent accuracy rate, is faster, and is a lot more beneficial than waiting for the exact results. The following are some common analytical queries that we have often come across in applications: How many distinct elements are in the data set (that is, what is the cardinality of the data set)? What are the most frequent elements (that is, the “heavy hitters” and “top elements”)? What are the frequencies of the most frequent elements? Does the data set contain a particular element (search query)? Can you filter data based upon a category? Statistical algorithms for quicker analytics Frequently, statistical algorithms avoid storing the original data, replacing it with hashes that eliminate a lot of network. Let’s get into the details of some of these algorithms that can help answer queries similar to those mentioned previously. A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is suitable in cases when we need to quickly filter items that are present in a set. HyperLogLog is an approximate technique for computing the number of distinct entries in a set (cardinality). It does this while using only a small amount of memory. For instance, to achieve 99 percent accuracy, it needs only 16 KB. In cases when we need to count distinct elements in a dataset spread across a Hadoop cluster, we could compute the hashes on different machines, build the bit index, and combine the bit index to compute the overall distinct elements. This eliminates the need of moving the data across the network and thus saves us a lot of time. The Count–min sketch is a probabilistic sub-linear space streaming algorithm that can be used to summarize a data stream to obtain the frequency of elements. It allocates a fixed amount of space to store count information, which does not vary over time even as more and more counts are updated. Nevertheless, it is able to provide useful estimated counts, because the accuracy scales with the total sum of all the counts stored. Spark - a faster execution engine Spark is a faster execution engine that provides 10 times the performance over MapReduce when combined with these statistical algorithms. Using Spark with statistical algorithms gives us a huge benefit both in terms of cost and time savings. Spark gets most of its speed by constructing Directed Acyclic Graphs (DAGs) out of the job operations and uses memory to save intermediate data, thus making the reads faster. When using statistical algorithms, saving the hashes in memory makes the algorithms work much faster. Case study Let’s say we have a continuous stream of user log data coming every hour at a rate of 4.4 GB per hour, and we need to analyze the distinct IPs in the logs on a daily basis. At my old company, when MapReduce was used to process the data, it was taking about 6 hours to process one day’s worth of data at a size of 106 GB. We had an AWS cluster consisting of 50 spot instances and 4 on-demand instances running to perform the analysis at a cost of $150 per day. Our system was then shifted to use Spark and HyperLogLog. This shift brought down the cost to $16.50 per day. To summarize, we had a 3.1 TB stream of data processed every month at a cost of $495, which was costing about $4,500 on the original system using MapReduce without the statistical algorithm in place. Further reading In the second part of this two-part blog series, we will discuss two tools in depth: Apache Spark and Apache Pig. We will take a look at how Pig combined with Spark makes existing ETL pipelines 100 times faster, and we will further our understanding of how statistical perspectives positively effect data analytics. About the author Praveen Rachabattuni is a tech lead at Sigmoid Analytics, a company that provides a real-time streaming and ETL framework on Apache Spark. Praveen is also a committer to Apache Pig.

0
0
6146

article-image-points-consider-prepping-data-data-science-project

Amarabha Banerjee

30 Nov 2017

5 min read

Points to consider while prepping your data for your data science project

Amarabha Banerjee

30 Nov 2017

5 min read

[box type="note" align="" class="" width=""]In this article by Jen Stirrup & Ruben Oliva Ramos from their book Advanced Analytics with R and Tableau, we shall look at the steps involved in prepping for any data science project taking the example of a data classification project using R and Tableau.[/box] Business Understanding When we are modeling data, it is crucial to keep the original business objectives in mind. These business objectives will direct the subsequent work in the data understanding, preparation and modeling steps, and the final evaluation and selection (after revisiting earlier steps if necessary) of a classification model or models. At later stages, this will help to streamline the project because we will be able to keep the model's performance in line with the original requirement while retaining a focus on ensuring a return on investment from the project. The main business objective is to identify individuals who are higher earners so that they can be targeted by a marketing campaign. For this purpose, we will investigate the data mining of demographic data in order to create a classification model in R. The model will be able to accurately determine whether individuals earn a salary that is above or below $50K per annum. Working with Data In this section, we will use Tableau as a visual data preparation in order to prepare the data for further analysis. Here is a summary of some of the things we will explore: Looking at columns that do not add any value to the model Columns that have so many missing categorical values that they do not predict the outcome reliably Review missing values from the columns The dataset used in this project has 49,000 records. You can see from the files that the data has been divided into a training dataset and a test set. The training dataset contains approximately 32,000 records and the test dataset around 16,000 records. It's helpful to note that there is a column that indicates the salary level or whether it is greater than or less than fifty thousand dollars per annum. This can be called a binomial label, which basically means that it can hold one or two possible values. When we import the data, we can filter for records where no income is specified. There is one record that has a NULL, and we can exclude it. Here is the filter: Let's explore the binomial label in more detail. How many records belong to each label? Let's visualize the finding. Quickly, we can see that 76 percent of the records in the dataset have a class label of <50K. Let's have a browse of the data in Tableau in order to see what the data looks like. From the grid, it's easy to see that there are 14 attributes in total. We can see the characteristics of the data: Seven polynomials: workclass, education, marital-status, occupation, relationship, race, sex, native-country One binomial: sex Six continuous attributes: age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week From the preceding chart, we can see that nearly 2 percent of the records are missing for one country, and the vast majority of individuals are from the United States. This means that we could consider the native-country feature as a candidate for removal from the model creation because the lack of variation means that it isn't going to add anything interesting to the analysis. Data Exploration We can now visualize the data in boxplots, so we can see the range of the data. In the first example, let's look at the age column, visualized as a boxplot in Tableau: We can see that the values are higher for the age characteristic, and there is a different pattern for each income level. When we look at education, we can also see a difference between the two groups: We can focus on age and education, while discarding other attributes that do not add value, such as native-country. The fnlwgt column does not add value because it is specific to the census collection process.When we visualize the race feature, it's noted that the White value appears for 85 percent of overall cases. This means that it is not likely to add much value to the predictor: Now, we can look at the number of years that people spend in education. When the education number attribute was plotted, then it can be seen that the lower values tend to predominate in the <50K class and the higher levels of time spent in education are higher in the >50K class. We can see this finding in the following figure: This finding may indicate some predictive capability in the education feature. The visualization suggests that there is a difference between both groups since the group that earns over $50K per annum does not appear much in the lower education levels. To summarize, we will focus on age and education as providing some predictive capability in determining the income level.The purpose of the model is to classify people by their earning level. Now that we have visualized the data in Tableau, we can use this information in order to model and analyze the data in R to produce the model. If you liked this article, please be sure to check out Advanced Analytics with R and Tableau which consists of this article and many useful analytics techniques with R and Tableau.

0
0
5974

Tech Guides - Big Data

The biggest Big Data & Business Intelligence salary and skills survey of 2015

What we learned from Oracle OpenWorld 2017

Reducing Cost in Big Data using Statistics and In-memory Technology - Part 2

Reducing Cost in Big Data using Statistics and In-memory Technology - Part 1

Points to consider while prepping your data for your data science project

Trending Topics

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access