Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Tech Guides - Data

281 Articles
article-image-deploy-self-service-business-intelligence-qlik-sense
Amey Varangaonkar
31 May 2018
7 min read
Save for later

Best practices for deploying self-service BI with Qlik Sense

Amey Varangaonkar
31 May 2018
7 min read
As part of a successful deployment of Qlik Sense, it is important IT recognizes self-service Business Intelligence to have its own dynamics and adoption rules. The various use cases and subsequent user groups thus need to be assessed and captured. Governance should always be present but power users should never get the feeling that they are restricted. Once they are won over, the rest of the traction and the adoption of other user types is very easy. In this article, we will look at the most important points to keep in mind while deploying self-service with Qlik Sense. The following excerpt is taken from the book Mastering Qlik Sense, authored by Martin Mahler and Juan Ignacio Vitantonio. This book demonstrates useful techniques to design useful and highly profitable Business Intelligence solutions using Qlik Sense. Here's the list of points to be kept in mind: Qlik Sense is not QlikView Not even nearly. The biggest challenge and fallacy is that the organization was sold, by Qlik or someone else, just the next version of the tool. It did not help at all that Qlik itself was working for years on Qlik Sense under the initial product name Qlik.Next. Whatever you are being told, however, it is being sold to you, Qlik Sense is at best the cousin of QlikView. Same family, but no blood relation. Thinking otherwise sets the wrong expectation so the business gives the wrong message to stakeholders and does not raise awareness to IT that self-service BI cannot be deployed in the same fashion as guided analytics, QlikView in this case. Disappointment is imminent when stakeholders realize Qlik Sense cannot replicate their QlikView dashboards. Simply installing Qlik Sense does not create a self-service BI environment Installing Qlik Sense and giving users access to the tool is a start but there is more to it than simply installing it. The infrastructure requires design and planning, data quality processing, data collection, and determining who intends to use the platform to consume what type of data. If data is not available and accessible to the user, data analytics serve no purpose. Make sure a data warehouse or similar is in place and the business has a use case for self-service data analytics. A good indicator for this is when the business or project works with a lot of data, and there are business users who have lots of Excel spreadsheets lying around analyzing it in different ways. That’s your best case candidate for Qlik Sense. IT to monitor Qlik Sense environment rather control IT needs to unlearn to learn new things and the same applies when it comes to deploying self-service. Create a framework with guidelines and principles and monitor that users are following it, rather than limiting them in their capabilities. This framework needs to have the input of the users as well and to be elastic. Also, not many IT professionals agree with giving away too much power to the user in the development process, believing this leads to chaos and anarchy. While the risk is there, this fear needs to be overcome. Users love data analytics, and they are keen to get the help of IT to create the most valuable dashboard possible and ensure it will be well received by a wide audience. Identifying key users and user groups is crucial For a strong adoption of the tool, IT needs to prepare the environment and identify the key power users in the organization and to win them over to using the technology. It is important they are intensively supported, especially in the beginning, and they are allowed to drive how the technology should be used rather than having principles imposed on them. Governance should always be present but power users should never get the feeling they are restricted by it. Because once they are won over, the rest of the traction and the adoption of other user types is very easy. Qlik Sense sells well–do a lot of demos Data analytics, compelling visualizations, and the interactivity of Qlik Sense is something almost everyone is interested in. The business wants to see its own data aggregated and distilled in a cool and glossy dashboard. Utilize the momentum and do as many demos as you can to win advocates of the technology and promote a consciousness of becoming a data-driven culture in the organization. Even the simplest Qlik Sense dashboards amaze people and boost their creativity for use cases where data analytics in their area could apply and create value. Promote collaboration Sharing is caring. This not only applies to insights, which naturally are shared with the excitement of having found out something new and valuable, but also to how the new insight has been derived. People keep their secrets on the approach and methodology to themselves, but this is counterproductive. It is important that applications, visualizations, and dashboards created with Qlik Sense are shared and demonstrated to other Qlik Sense users as frequently as possible. This not only promotes a data-driven culture but also encourages the collaboration of users and teams across various business functions, which would not have happened otherwise. They could either be sharing knowledge, tips, and tricks or even realizing they look at the same slices of data and could create additional value by connecting them together. Market the success of Qlik Sense within the organization If Qlik Sense has had a successful achievement in a project, tell others about it. Create a success story and propose doing demos of the dashboard and its analytics. IT has been historically very bad in promoting their work, which is counterproductive. Data analytics creates value and there is nothing embarrassing about boasting about its success; as Muhammad Ali suggested, it’s not bragging if it’s true. Introduce guidelines on design and terminology Avoiding the pitfalls of having multiple different-looking dashboards by promoting a consistent branding look across all Qlik Sense dashboards and applications, including terminology and best practices. Ensure the document is easily accessible to all users. Also, create predesigned templates with some sample sheets so the users duplicate them and modify them to their liking and extend them, applying the same design. Protect less experienced users from complexities Don’t overwhelm users if they have never developed in their life. Approach less technically savvy users in a different way by providing them with sample data and sample templates, including a library of predefined visualizations, dimensions, or measures (so-called Master Key Items). Be aware that what is intuitive to Qlik professionals or power users is not necessarily intuitive to other users – be patient and appreciative of their feedback, and try to understand how a typical business user might think. For a strong adoption of the tool, IT needs to prepare the environment and identify the key power users in the organization and win them over to using the technology. It is important they are intensively supported, especially in the beginning, and they are allowed to drive how the technology should be used rather than having principles imposed on them. If you found the excerpt useful, make sure you check out the book Mastering Qlik Sense to learn more of these techniques on efficient Business Intelligence using Qlik Sense. Read more How Qlik Sense is driving self-service Business Intelligence Overview of a Qlik Sense® Application’s Life Cycle What we learned from Qlik Qonnections 2018
Read more
  • 0
  • 0
  • 39763

article-image-top-4-business-intelligence-tools
Ed Bowkett
04 Dec 2014
4 min read
Save for later

Top 4 Business Intelligence Tools

Ed Bowkett
04 Dec 2014
4 min read
With the boom of data analytics, Business Intelligence has taken something of a front stage in recent years, and as a result, a number of Business Intelligence (BI) tools have appeared. This allows a business to obtain a reliable set of data, faster and easier, and to set business objectives. This will be a list of the more prominent tools and will list advantages and disadvantages of each. Pentaho Pentaho was founded in 2004 and offers a suite, among others, of open source BI applications under the name, Pentaho Business Analytics. It has two suites, enterprise and community. It allows easy access to data and even easier ways of visualizing this data, from a variety of different sources including Excel and Hadoop and it covers almost every platform ranging from mobile, Android and iPhone, through to Windows and even Web-based. However with the pros, there are cons, which include the Pentaho Metadata Editor in Pentaho, which is difficult to understand, and the documentation provided offers few solutions for this tool (which is a key component). Also, compared to other tools, which we will mention below, the advanced analytics in Pentaho need improving. However, given that it is open source, there is continual improvement. Tableau Founded in 2003, Tableau also offers a range of suites, focusing on three products: Desktop, Server, and Public. Some benefits of using Tableau over other products include ease of use and a pretty simple UI involving drag and drop tools, which allows pretty much everyone to use it. Creating a highly interactive dashboard with various sources to obtain your data from is simple and quick. To sum up, Tableau is fast. Incredibly fast! There are relatively few cons when it comes to Tableau, but some automated features you would usually expect in other suites aren’t offered for most of the processes and uses here. Jaspersoft As well as being another suite that is open source, Jaspersoft ships with a number of data visualization, data integration, and reporting tools. Added to the small licensing cost, Jaspersoft is justifiably one of the leaders in this area. It can be used with a variety of databases including Cassandra, CouchDB, MongoDB, Neo4j, and Riak. Other benefits include ease of installation and the functionality of the tools in Jaspersoft is better than most competitors on the market. However, the documentation has been claimed to have been lacking in helping customers dive deeper into Jaspersoft, and if you do customize it the customer service can no longer assist you if it breaks. However, given the functionality/ability to extend it, these cons seem minor. Qlikview Qlikview is one of the oldest Business Intelligence software tools in the market, having been around since 1993, it has multiple features, and as a result, many pros and cons that include ones that I have mentioned for previous suites. Some advantages of Qlikview are that it takes a very small amount of time to implement and it’s incredibly quick; quicker than Tableau in this regard! It also has 64-bit in-memory, which is among the best in the market. Qlikview also has good data mining tools, good features (having been in the market for a long time), and a visualization function. These aspects make it so much easier to deal with than others on the market. The learning curve is relatively small. Some cons in relation to Qlikview include that while Qlikview is easy to use, Tableau is seen as the better suite to use to analyze data in depth. Qlikview also has difficulties integrating map data, which other BI tools are better at doing. This list is not definitive! It lays out some open source tools that companies and individuals can use to help them analyze data to prepare business performance KPIs. There are other tools that are used by businesses including Microsoft BI tools, Cognos, MicroStrategy, and Oracle Hyperion. I’ve chosen to explore some BI tools that are quick to use out of the box and are incredibly popular and expanding in usage.
Read more
  • 0
  • 0
  • 39711

article-image-why-jvm-java-virtual-machine-for-deep-learning
Guest Contributor
10 Nov 2019
5 min read
Save for later

Why use JVM (Java Virtual Machine) for deep learning

Guest Contributor
10 Nov 2019
5 min read
Deep learning is one of the revolutionary breakthroughs of the decade for enterprise application development. Today, majority of organizations and enterprises have to transform their applications to exploit the capabilities of deep learning. In this article, we will discuss how to leverage the capabilities of JVM (Java virtual machine) to build deep learning applications. Entreprises prefer JVM Major JVM languages used in enterprise are Java, Scala, Groovy and Kotlin. Java is the most widely used programming language in the world. Nearly all major enterprises in the world use Java in some way or the other. Enterprises use JVM based languages such as Java to build complex applications because JVM features are optimal for production applications. JVM applications are also significantly faster and require much fewer resources to run compared to their counterparts such as Python. Java can perform more computational operations per second compared to Python. Here is an interesting performance benchmarking for the same. JVM optimizes performance benchmarks Production applications represent a business and are very sensitive to performance degradation, latency, and other disruptions. Application performance is estimated from latency/throughput measures. Memory overload and high resource usage can influence above said measures. Applications that demand more resources or memory require good hardware and further optimization from the application itself. JVM helps in optimizing performance benchmarks and tune the application to the hardware’s fullest capabilities. JVM can also help in avoiding memory footprints in the application. We have discussed on JVM features so far, but there’s an important context on why there’s a huge demand for JVM based deep learning in production. We’re going to discuss that next. Python is undoubtedly the leading programming language used in deep learning applications. For the same reason, the majority of enterprise developers i.e, Java developers are forced to switch to a technology stack that they’re less familiar with. On top of that, they need to address compatibility issues and deployment in a production environment while integrating neural network models. DeepLearning4J, deep learning library for JVM Java Developers working on enterprise applications would want to exploit deployment tools like Maven or Gradle for hassle-free deployments. So, there’s a demand for a JVM based deep learning library to simplify the whole process. Although there are multiple deep learning libraries that serve the purpose, DL4J (Deeplearning4J) is one of the top choices. DL4J is a deep learning library for JVM and is among the most popular repositories on GitHub. DL4J, developed by the Skymind team, is the first open-source deep learning library that is commercially supported. What makes it so special is that it is backed by ND4J (N-Dimensional Arrays for Java) and JavaCPP. ND4J is a scientific computational library developed by the Skymind team. It acts as the required backend dependency for all neural network computations in DL4J. ND4J is much faster in computations than NumPy. JavaCPP acts as a bridge between Java and native C++ libraries. ND4J internally depends on JavaCPP to run native C++ libraries. DL4J also has a dedicated ETL component called DataVec. DataVec helps to transform the data into a format that a neural network can understand. Data analysis can be done using DataVec just like Pandas, a popular Python data analysis library. Also, DL4J uses Arbiter component for hyperparameter optimization. Arbiter finds the best configuration to obtain good model scores by performing random/grid search using the hyperparameter values defined in a search space. Why choose DL4J for your deep learning applications? DL4J is a good choice for developing distributed deep learning applications. It can leverage the capabilities of Apache Spark and Hadoop to develop high performing distributed deep learning applications. Its performance is equivalent to Caffe in case multi-GPU hardware is used. We can use DL4J to develop multi-layer perceptrons, convolutional neural networks, recurrent neural networks, and autoencoders. There are a number of hyperparameters that can be adjusted to further optimize the neural network training. The Skymind team did a good job in explaining the important basics of DL4J on their website. On top of that, they also have a gitter channel to discuss or report bugs straight to their developers. If you are keen on exploring reinforcement learning further, then there’s a dedicated library called RL4J (Reinforcement Learning for Java) developed by Skymind. It can already play doom game! DL4J combines all the above-mentioned components (DataVec, ND4J, Arbiter and RL4J) for the deep learning workflow thus forming a powerful software suite. Most importantly, DL4J enables productionization of deep learning applications for the business. If you are interested to learn how to develop real-time applications on DL4J, checkout my new book Java Deep Learning Cookbook. In this book, I show you how to install and configure Deeplearning4j to implement deep learning models. You can also explore recipes for training and fine-tuning your neural network models using Java. By the end of this book, you’ll have a clear understanding of how you can use Deeplearning4j to build robust deep learning applications in Java. Author Bio Rahul Raj has more than 7 years of IT industry experience in software development, business analysis, client communication and consulting for medium/large scale projects. He has extensive experience in development activities comprising requirement analysis, design, coding, implementation, code review, testing, user training, and enhancements. He has written a number of articles about neural networks in Java and is featured by DL4J and Official Java community channel. You can follow Rahul on Twitter, LinkedIn, and GitHub. Top 6 Java Machine Learning/Deep Learning frameworks you can’t miss 6 most commonly used Java Machine learning libraries Deeplearning4J 1.0.0-beta4 released with full multi-datatype support, new attention layers, and more!
Read more
  • 0
  • 0
  • 39583

article-image-what-is-streaming-analytics-and-why-is-it-important
Amey Varangaonkar
05 Oct 2017
5 min read
Save for later

Say hello to Streaming Analytics

Amey Varangaonkar
05 Oct 2017
5 min read
In this data-driven age, businesses want fast, accurate insights from their huge data repositories in the shortest time span — and in real time when possible. These insights are essential — they help businesses understand relevant trends, improve their existing processes, enhance customer satisfaction, improve their bottom line, and most importantly, build, and sustain their competitive advantage in the market.   Doing all of this is quite an ask - one that is becoming increasingly difficult to achieve using just the traditional data processing systems where analytics is limited to the back-end. There is now a burning need for a newer kind of system where larger, more complex data can be processed and analyzed on the go. Enter: Streaming Analytics Streaming Analytics, also referred to as real-time event processing, is the processing and analysis of large streams of data in real-time. These streams are basically events that occur as a result of some action. Actions like a transaction or a system failure, or a trigger that changes the state of a system at any point in time. Even something as minor or granular as a click would then constitute as an event, depending upon the context. Consider this scenario - You are the CTO of an organization that deals with sensor data from wearables. Your organization would have to deal with terabytes of data coming in on a daily basis, from thousands of sensors. One of your biggest challenges as a CTO would be to implement a system that processes and analyzes the data from these sensors as it enters the system. Here’s where streaming analytics can help you by giving you the ability to derive insights from your data on the go. According to IBM, a streaming system demonstrates the following qualities: It can handle large volumes of data It can handle a variety of data and analyze it efficiently — be it structured or unstructured, and identifies relevant patterns accordingly It can process every event as it occurs unlike traditional analytics systems that rely on batch processing Why is Streaming Analytics important? The humongous volume of data that companies have to deal with today is almost unimaginable. Add to that the varied nature of data that these companies must handle, and the urgency with which value needs to be extracted from this data - it all makes for a pretty tricky proposition. In such scenarios, choosing a solution that integrates seamlessly with different data sources, is fine-tuned for performance, is fast, reliable, and most importantly one that is flexible to changes in technology, is critical. Streaming analytics offers all these features - thereby empowering organizations to gain that significant edge over their competition. Another significant argument in favour of streaming analytics is the speed at which one can derive insights from the data. Data in a real-time streaming system is processed and analyzed before it registers in a database. This is in stark contrast to analytics on traditional systems where information is gathered, stored, and then the analytics is performed. Thus, streaming analytics supports much faster decision-making than the traditional data analytics systems. Is Streaming Analytics right for my business? Not all organizations need streaming analytics, especially those that deal with static data or data that hardly change over longer intervals of time, or those that do not require real-time insights for decision-making.   For instance, consider the HR unit of a call centre. It is sufficient and efficient to use a traditional analytics solution to analyze thousands of past employee records rather than run it through a streaming analytics system. On the other hand, the same call centre can find real value in implementing streaming analytics to something like a real-time customer log monitoring system. A system where customer interactions and context-sensitive information are processed on the go. This can help the organization find opportunities to provide unique customer experiences, improve their customer satisfaction score, alongside a whole host of other benefits. Streaming Analytics is slowly finding adoption in a variety of domains, where companies are looking to get that crucial competitive advantage - sensor data analytics, mobile analytics, business activity monitoring being some of them. With the rise of Internet of Things, data from the IoT devices is also increasing exponentially. Streaming analytics is the way to go here as well. In short, streaming analytics is ideal for businesses dealing with time-critical missions and those working with continuous streams of incoming data, where decision-making has to be instantaneous. Companies that obsess about real-time monitoring of their businesses will also find streaming analytics useful - just integrate your dashboards with your streaming analytics platform! What next? It is safe to say that with time, the amount of information businesses will manage is going to rise exponentially, and so will the nature of this information. As a result, it will get increasingly difficult to process volumes of unstructured data and gain insights from them using just the traditional analytics systems. Adopting streaming analytics into the business workflow will therefore become a necessity for many businesses. Apache Flink, Spark Streaming, Microsoft's Azure Stream Analytics, SQLstream Blaze, Oracle Stream Analytics and SAS Event Processing are all good places to begin your journey through the fleeting world of streaming analytics. You can browse through this list of learning resources from Packt to know more. Learning Apache Flink Learning Real Time processing with Spark Streaming Real Time Streaming using Apache Spark Streaming (video) Real Time Analytics with SAP Hana Real-Time Big Data Analytics
Read more
  • 0
  • 0
  • 39459

article-image-why-is-hadoop-dying
Aaron Lazar
23 Apr 2018
5 min read
Save for later

Why is Hadoop dying?

Aaron Lazar
23 Apr 2018
5 min read
Hadoop has been the definitive big data platform for some time. The name has practically been synonymous with the field. But while its ascent followed the trajectory of what was referred to as the 'big data revolution', Hadoop now seems to be in danger. The question is everywhere - is Hadoop dying out? And if it is, why is it? Is it because big data is no longer the buzzword it once was, or are there simply other ways of working with big data that have become more useful? Hadoop was essential to the growth of big data When Hadoop was open sourced in 2007, it opened the door to big data. It brought compute to data, as against bringing data to compute. Organisations had the opportunity to scale their data without having to worry too much about the cost. It obviously had initial hiccups with security, the complexity of querying and querying speeds, but all that was taken care off, in the long run. Still, although querying speeds remained quite a pain, however that wasn’t the real reason behind Hadoop dying (slowly). As cloud grew, Hadoop started falling One of the main reasons behind Hadoop's decline in popularity was the growth of cloud. There cloud vendor market was pretty crowded, and each of them provided their own big data processing services. These services all basically did what Hadoop was doing. But they also did it in an even more efficient and hassle-free way. Customers didn't have to think about administration, security or maintenance in the way they had to with Hadoop. One person’s big data is another person’s small data Well, this is clearly a fact. Several organisations that used big data technologies without really gauging the amount of data they actually would need to process, have suffered. Imagine sitting with 10TB Hadoop clusters when you don’t have that much data. The two biggest organisations that built products on Hadoop, Hortonworks and Cloudera, saw a decline in revenue in 2015, owing to their massive use of Hadoop. Customers weren’t pleased with nature of Hadoop’s limitations. Apache Hadoop v Apache Spark Hadoop processing is way behind in terms of processing speed. In 2014 Spark took the world by storm. I’m going to let you guess which line in the graph above might be Hadoop, and which might be Spark. Spark was a general purpose, easy to use platform that was built after studying the pitfalls of Hadoop. Spark was not bound to just the HDFS (Hadoop Distributed File System) which meant that it could leverage storage systems like Cassandra and MongoDB as well. Spark 2.3 was also able to run on Kubernetes; a big leap for containerized big data processing in the cloud. Spark also brings along GraphX, which allows developers to view data in the form of graphs. Some of the major areas Spark wins are Iterative Algorithms in Machine Learning, Interactive Data Mining and Data Processing, Stream processing, Sensor data processing, etc. Machine Learning in Hadoop is not straightforward Unlike MLlib in Spark, Machine Learning is not possible in Hadoop unless tied with a 3rd party library. Mahout used to be quite popular for doing ML on Hadoop, but its adoption has gone down in the past few years. Tools like RHadoop, a collection of 3 R packages, have grown for ML, but it still is nowhere comparable to the power of the modern day MLaaS offerings from cloud providers. All the more reason to move away from Hadoop, right? Maybe. Hadoop is not only Hadoop The general misconception is that Hadoop is quickly going to be extinct. On the contrary, the Hadoop family consists of YARN, HDFS, MapReduce, Hive, Hbase, Spark, Kudu, Impala, and 20 other products. While e folks may be moving away from Hadoop as their choice for big data processing, they will still be using Hadoop in some form or the other. As with Cloudera and Hortonworks, though the market has seen a downward trend, they’re in no way letting go of Hadoop anytime soon, although they have shifted part of their processing operations to Spark. Is Hadoop dying? Perhaps not... In the long run, it’s not completely accurate to say that Hadoop is dying. December last year brought with it Hadoop 3.0, which is supposed to be a much improved version of the framework. Some of the most noteworthy features are its improved shell script, more powerful YARN, improved fault tolerance with erasure coding, and many more. Although, that hasn’t caused any major spike in adoption, there are still users who will adopt Hadoop based on their use case, or simply use another alternative like Spark along with another framework from the Hadoop family. So, Hadoop’s not going away anytime soon. Read More Pandas is an effective tool to explore and analyze data - Interview insights  
Read more
  • 0
  • 1
  • 39249

article-image-machine-learning-as-a-service-mlaas-how-google-cloud-platform-microsoft-azure-and-aws-are-democratizing-artificial-intelligence
Bhagyashree R
07 Sep 2018
13 min read
Save for later

Machine Learning as a Service (MLaaS): How Google Cloud Platform, Microsoft Azure, and AWS are democratizing Artificial Intelligence

Bhagyashree R
07 Sep 2018
13 min read
There has been a huge shift in the way that businesses build technology in recent years driven by a move towards cloud and microservices. Public cloud services like AWS, Microsoft Azure, and Google Cloud Platform are transforming the way companies of all sizes understand and use software. Not only do public cloud services reduce the resourcing costs associated with on site server resources, they also make it easier to leverage cutting edge technological innovations like machine learning and artificial intelligence. Cloud is giving rise to what’s known as ‘Machine Learning as a Service’ - a trend that could prove to be transformative for organizations of all types and sizes. According to a report published on Research and Markets, Machine Learning as a Service is set to face a compound annual growth rate (CAGR) of 49% between 2017 and 2023. The main drivers of this growth include the increased application of advanced analytics in manufacturing, the high volume of structured and unstructured data, and the integration of machine learning with big data. Of course, with machine learning a relatively new area for many businesses, demand for MLaaS is ultimately self-fulfilling - if it’s there and people can see the benefits it can bring, demand is only going to continue. But it’s important not to get fazed by the hype. Plenty of money will be spent on cloud based machine learning products that won’t help anyone but the tech giants who run the public clouds. With that in mind, let’s dive deeper into Machine Learning as a Service and what the biggest cloud vendors offer. What does Machine Learning as a Service (MLaaS) mean? Machine learning as a Service (MLaaS) is an array of services that provides machine learning tools to users. Businesses and developers can incorporate a machine learning model into their application without having to work on its implementation. These services range from data visualization, facial recognition, natural language processing, chatbots, predictive analytics and deep learning, among others. Typically, for a given machine learning task, a user has to perform various steps. These steps include data preprocessing, feature identification, implementing the machine learning model, and training the model. MLaaS services simplify this process by only exposing a subset of the steps to the user while automatically managing the remaining steps. Some services can also provide 1-click mode, where the users does not have to perform any of the steps mentioned earlier. What type of businesses can benefit from Machine Learning as a Service? Large companies Large companies can afford to hire expert machine learning engineers and data scientists, but they still have to build and manage their own custom machine learning model. This is time-intensive and complicated process. By leveraging MLaaS services these companies can use pre-trained machine learning models via APIs that perform specific tasks and save time. Small and mid-sized businesses Big companies can invest in their own machine learning solutions because they have the resources. For small and mid-sized businesses (SMBs), however, this simply isn’t the case. Fortunately, MLaaS changes all that and makes machine learning accessible to organizations with resource limitations. By using MLaaS, businesses can leverage machine learning without the huge investment in infrastructure or talent. Whether it’s for smarter and more intelligent customer-facing apps, or improved operational intelligence and automation, this could bring huge gains for a reasonable amount of spending. What types of roles will benefit from MLaaS? Machine learning can contribute to any kind of app development provided you have data to train your app. However, adding AI features to your app is not easy. As a developer, you’ve to worry about a lot of other factors besides regular app development checklist, in order to make your app intelligent. Some of them are: Data preprocessing Model training Model evaluation Predictions Expertise in data science The development tools provided by MLaaS can simplify these tasks allowing you to easily embed machine learning in your applications. Developers can build quickly and efficiently with MLaaS offerings, because they have access to pre-built algorithms and models that would take them extensive resources to build otherwise. MLaaS can also support data scientists and analysts. While most data scientists should have the necessary skills to build and train machine learning models from scratch, it can nevertheless still be a time consuming task. MLaaS can, as already mentioned, simplify the machine learning engineering process, which means data scientists can focus on optimizations that require more thought and expertise. Top machine learning as a service (MLaaS) providers Amazon Web Services (AWS), Azure, and Google, all have MLaaS products in their cloud offerings. Let’s take a look at them. Google Cloud AI at a glance Google Cloud AI Google’s Cloud AI provides modern machine learning services. It consists of pre-trained models and a service to generate your own tailored models. The services provided are fast, scalable, and easy to use. The following are the services that Google provides at an unprecedented scale and speed to your applications: Cloud AutoML Beta It is a suite of machine learning products, with the help of which developers with limited machine learning expertise can train high-quality models specific to their business needs. It provides you a simple GUI to train, evaluate, improve, and deploy models based on your own data. Read also: AmoebaNets: Google’s new evolutionary AutoML Google Cloud Machine Learning (ML) Engine Google Cloud Machine Learning Engine is a service that offers training and prediction services to enable developers and data scientists to build superior machine learning models and deploy in production. You don’t have to worry about infrastructure and can instead focus on the model development and deployment. It offers two types of predictions: Online prediction deploys ML models with serverless, fully managed hosting that responds in real time with high availability. Batch predictions is cost-effective and provides unparalleled throughput for asynchronous applications. Read also: Google announces Cloud TPUs on the Cloud Machine Learning Engine (ML Engine) Google BigQuery It is a cloud data warehouse for data analytics. It uses SQL and provides Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) drivers to make integration fast and easy. It provides benefits like auto scaling and high-performance streaming to load data. You can create amazing reports and dashboards using your favorite BI tool, like Tableau, MicroStrategy, Looker etc. Read also: Getting started with Google Data Studio: An intuitive tool for visualizing BigQuery Data Dialogflow Enterprise Edition Dialogflow is an end-to-end, build-once deploy-everywhere development suite for creating conversational interfaces for websites, mobile applications, popular messaging platforms, and IoT devices. Dialogflow Enterprise Edition users have access to Google Cloud Support and a service level agreement (SLA) for production deployments. Read also: Google launches the Enterprise edition of Dialogflow, its chatbot API Cloud Speech-to-Text Google Cloud Speech-to-Text allows you to convert speech to text by applying neural network models. 120 languages are supported by the API, which will help you extend your user base. It can process both real-time streaming and prerecorded audio. Read also: Google announce the largest overhaul of their Cloud Speech-to-Text Microsoft Azure AI at a glance The Azure platform consists of various AI tools and services that can help you build smart applications. It provides Cognitive Services and Conversational AI with Bot tools, which facilitate building custom models with Azure Machine Learning for any scenario. You can run AI workloads anywhere at scale using its enterprise-grade AI infrastructure The following are services provided by Azure AI to help you achieve maximum productivity and reliability: Pre-built services You need not be an expert in data science to make your systems more intelligent and engaging. The pre-built services come with high-quality RESTful intelligent APIs for the following: Vision: Make your apps identify and analyze content within images and videos. Provides capabilities such as, image classification, optical character recognition in images, face detection, person identification, and emotion identification. Speech: Integrate speech processing capabilities in your app or services such as, text-to-speech, speech-to-text, speaker recognition, and speech translation. Language: Your application or service will understand meaning of the unstructured text or the intent behind a speaker's utterances. It comes with capabilities such as, text sentiment analysis, key phrase extraction, automated and customizable text translation. Knowledge: Create knowledge rich resources that can be integrated into apps and services. It provides features such as, QnA extraction from unstructured text, knowledge base creation from collections of Q&As, and semantic matching for knowledge bases. Search: Using Search API you can find exactly what you are looking for across billions of web pages. It provides features like, ad-free, safe, location-aware web search, Bing visual search, custom search engine creation, and many more. Custom services Azure Machine Learning is a fully managed cloud service which helps you to easily prepare data, build, and train your own models: You can rapidly prototype on your desktop, then scale up on VMs or scale out using Spark clusters. You can manage model performance, identify the best model, and promote it using data-driven insight. Deploy and manage your models everywhere. Using Docker containers, you can deploy the models into production faster in the cloud, on-premises or at the edge. Promote your best performing models into production and retrain them whenever necessary. Read also: Microsoft supercharges its Azure AI platform with new features AWS machine learning services at a glance Machine learning services provided by AWS help developers to easily add intelligence to any application with pre-trained services. For training and inferencing, it offers a broad array of compute options with powerful GPU-based instances, compute and memory optimized instances, and even FPGAs. You will get to choose from a set of services for data analysis including data warehousing, business intelligence, batch processing, stream processing, and data workflow orchestration. The following are the services provided by AWS: AWS machine learning applications Amazon Comprehend: This is a natural language processing (NLP) service that identifies relationships and finds insights in text using machine learning. It recognizes the language of the text and understands how positive or negative it is and extracts key phrases, places, people, brands, or events. It then analyzes text using tokenization and parts of speech, and automatically organizes a collection of text files by topic. Amazon Lex: This service provides the same deep learning technologies used by Amazon Alexa to developers in helping them build sophisticated, natural language, conversational bots easily. It comes with advanced deep learning functionalities like, automatic speech recognition (ASR) and natural language understanding (NLU) to facilitate a more life like conversational interaction with the users. Amazon Polly: This text-to-speech service produces speech that sounds like human voice using advanced deep learning technologies. It provides you dozens of life like voices across a variety of languages. You can simply select the ideal voice and build speech-enabled applications that work in many different countries. Amazon Rekognition: This service can identify the objects, people, text, scenes, and activities, and any inappropriate content in an image or a video. It also provides highly accurate facial analysis and facial recognition on images and video. Read also: AWS makes Amazon Rekognition, its image recognition AI, available for Asia-Pacific developers AWS machine learning platforms Amazon SageMaker: It is a platform that solves the complexities in the machine learning process, from building to deploying a model. It is a fully-managed platform that helps developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. AWS DeepLens: It is a fully programmable video camera, which comes with tutorials, code, and pre-trained models designed to expand deep learning skills. It provides you sample projects giving you practical and hands-on experience in deep learning in less than 10 minutes. Models trained in Amazon SageMaker can be sent to AWS DeepLens with just a few clicks from the AWS Management Console. Amazon ML: This is a service that provides visualization tools and wizards that direct you to create a machine learning model without having to learn complex ML algorithms and technology. Using simple APIs it makes it easy for you to obtain predictions for your application. It is highly scalable and can generate billions of predictions daily, and serve those predictions in real-time and at high throughput Read also: Amazon Sagemaker makes machine learning on the cloud easy. Deep Learning on AWS AWS Deep Learning AMIs: This provides the infrastructure and tools to accelerate deep learning in the cloud, at any scale. To train sophisticated, custom AI models, or to experiment with new algorithms you can quickly launch Amazon EC2 instances which are pre-installed in popular deep learning frameworks such as Apache MXNet and Gluon, TensorFlow, Microsoft Cognitive Toolkit, Caffe, Caffe2, Theano, Torch, PyTorch, Chainer, and Keras. Apache MXNet on AWS: This is a fast and scalable training and inference framework with an easy-to-use, concise API for machine learning. It allows developers of all skill levels to get started with deep learning on the cloud, on edge devices, and mobile apps using Gluon. You can build linear regressions, convolutional networks and recurrent LSTMs for object detection, speech recognition, recommendation, and personalization, in just a few lines of Gluon code. TensorFlow on AWS: You can quickly and easily get started with deep learning in the cloud using TensorFlow. AWS provides you a fully-managed TensorFlow experience with Amazon SageMaker. You can also use the AWS Deep Learning AMIs to build custom environment and workflow with TensorFlow and other popular frameworks such as Apache MXNet and Gluon, Caffe, Caffe2, Chainer, Torch, Keras, and Microsoft Cognitive Toolkit. Conclusion Machine learning and artificial intelligence can be expensive - skills and resources can cost a lot. For that reason, MLaaS is going to be a hugely influential development within cloud. Yes, the range of services on offer are impressive from AWS, Azure and GCP, but it’s really the ease and convenience that is most remarkable. With these services it’s easy to set up and run machine learning algorithms that enhance business processes and operations, customer interactions and overall business strategy. You don’t need a PhD, and you don’t need to code algorithms from scratch. The MLaaS market will likely continue to grow as more companies realise the potential machine learning has on their business - however, whether anyone can deliver a better set of services than the established cloud providers remains to be seen. Predictive Analytics with AWS: A quick look at Amazon ML Microsoft supercharges its Azure AI platform with new features AmoebaNets: Google’s new evolutionary AutoML
Read more
  • 0
  • 0
  • 39144
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-2018-year-of-graph-databases
Amey Varangaonkar
04 May 2018
5 min read
Save for later

2018 is the year of graph databases. Here's why.

Amey Varangaonkar
04 May 2018
5 min read
With the explosion of data, businesses are looking to innovate as they connect their operations to a whole host of different technologies. The need for consistency across all data elements is now stronger than ever. That’s where graph databases come in handy. Because they allow for a high level of flexibility when it comes to representing your data and also while handling complex interactions within different elements, graph databases are considered by many to be the next big trend in databases. In this article, we dive deep into the current graph database scene, and list out 3 top reasons why graph databases will continue to soar in terms of popularity in 2018. What are graph databases, anyway? Simply put, graph databases are databases that follow the graph model. What is a graph model, then? In mathematical terms, a graph is simply a collection of nodes, with different nodes connected by edges. Each node contains some information about the graph, while edges denote the connection between the nodes. How are graph databases different from the relational databases, you might ask? Well, the key difference between the two is the fact that graph data models allow for more flexible and fine-grained relationships between data objects, as compared to relational models. There are some more differences between the graph data model and the relational data model, which you should read through for more information. Often, you will see that graph databases are without a schema. This allows for a very flexible data model, much like the document or key/value store database models. A unique feature of the graph databases, however, is that they also support relationships between the data objects like a relational database. This is useful because it allows for a more flexible and faster database, which can be invaluable to your project which demands a quicker response time. Image courtesy DB-Engines The rise in popularity of the graph database models over the last 5 years has been stunning, but not exactly surprising. If we were to drill down the 3 key factors that have propelled the popularity of graph databases to a whole new level, what would they be? Let’s find out. Major players entering the graph database market About a decade ago, the graph database family included just Neo4j and a couple of other less-popular graph databases. More recently, however, all the major players in the industry such as Oracle (Oracle Spatial and Graph), Microsoft (Graph Engine), SAP (SAP Hana as a graph store) and IBM (Compose for JanusGraph) have come up with graph offerings of their own. The most recent entrant to the graph database market is Amazon, with Amazon Neptune announced just last year. According to Andy Jassy, CEO of Amazon Web Services, graph databases are becoming a part of the growing trend of multi-model databases. Per Jassy, these databases are finding increased adoption on the cloud as they support a myriad of useful data processing methods. The traditional over-reliance on relational databases is slowly breaking down, he says. Rise of the Cypher Query Language With graph databases slowly getting mainstream recognition and adoption, the major companies have identified the need for a standard query language for all graph databases. Similar to SQL, Cypher has emerged as a standard and is a widely-adopted alternative to write efficient and easy to understand graph queries. As of today, the Cypher Query Language is used in popular graph databases such as Neo4j, SAP Hana, Redis graph and so on. The OpenCypher project, the project that develops and maintains Cypher, has also released Cypher for popular Big Data frameworks like Apache Spark. Cypher’s popularity has risen tremendously over the last few years. The primary reason for this is the fact that like SQL, Cypher’s declarative nature allows users to state the actions they want performed on their graph data without explicitly specifying them. Finding critical real-world applications Graph databases were in the news as early as 2016, when the Panama paper leaks were revealed with the help of Neo4j and Linkurious, a data visualization software. In more recent times, graph databases have also found increased applications in online recommendation engines, as well as for performing tasks that include fraud detection and managing social media. Facebook’s search app also uses graph technology to map social relationships. Graph databases are also finding applications in virtual assistants to drive conversations - eBay’s virtual shopping assistant is an example. Even NASA uses the knowledge graph architecture to find critical data. What next for graph databases? With growing adoption of graph databases, we expect graph-based platforms to soon become the foundational elements of many corporate tech stacks. The next focus area for these databases will be practical implementations such as graph analytics and building graph-based applications. The rising number of graph databases would also mean more competition, and that is a good thing - competition will bring more innovation, and enable incorporation of more cutting-edge features. With a healthy and steadily growing community of developers, data scientists and even business analysts, this evolution may be on the cards, sooner than we might expect. Amazon Neptune: A graph database service for your applications When, why and how to use Graph analytics for your big data
Read more
  • 0
  • 0
  • 38833

article-image-healthcare-analytics-logistic-regression-to-reduce-patient-readmissions
Guest Contributor
20 Dec 2017
8 min read
Save for later

Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions

Guest Contributor
20 Dec 2017
8 min read
[box type="info" align="" class="" width=""]We bring to you another guest post by Benjamin Rojogan on Logistic regression to aid healthcare sector in reducing patient readmission. Ben's previous post on ensemble methods to optimize machine learning models is also available for a quick read here.[/box] ER visits are not cheap for any party involved. Whether this be the patient or the insurance company. However, this does not stop some patients from being regular repeat visitors. These recurring visits are due to lack of intervention for problems such as substance abuse, chronic diseases and mental illness. This increases costs for everybody in the healthcare system and reduces quality of care by playing a role in the overflowing of Emergency Departments (EDs). Research teams at UW and other universities are partnering with companies like Kensci to figure out how to approach the problem of reducing readmission rates. The ability to predict the likelihood of a patient’s readmission will allow for targeted intervention which in turn will help reduce the frequency of readmissions. Thus making the population healthier and hopefully reducing the estimated 41.3 billion USD healthcare costs for the entire system. How do they plan to do it? With big data and statistics, of course. A plethora of algorithms are available for data scientists to use to approach this problem. Many possible variables could affect the readmission and medical costs. Also, there are also many different ways researchers might pose their questions. However, the researchers at UW and many other institutions have been heavily focused on reducing the readmission rate simply by trying to calculate whether a person would or would not be readmitted. In particular, this team of researchers was curious about chronic ailments. Patients with chronic ailments are likely to have random flare ups that require immediate attention. Being able to predict if a patient will have an ER visit can lead to managing the cause more effectively. One approach taken by the data science team at UW as well as the Department of Family and Community Medicine at the University of Toronto was to utilize logistic regression to predict whether or not a patient would be readmitted. Patient readmission can be broken down into a binary output: either the patient is readmitted or not. As such logistic regression has been a useful model in my experience to approach this problem. Logistic Regression to predict patient readmissions Why do data scientists like to use logistic regression? Where is it used? And how does it compare to other data algorithms? Logistic regression is a statistical method that statisticians and data scientists use to classify people, products, entities, etc. It is used for analyzing data that produces a binary classification based on one or many independent variables. This means, it produces two clear classifications (Yes or No, 1 or 0, etc). With the example above, the binary classification would be: is the patient readmitted or not? Other examples of this could be whether to give a customer a loan or not, whether a medical claim is fraud or not, whether a patient has diabetes or not. Despite its name, logistic regression does not provide the same output like linear regression (per se). There are some similarities, for instance, the linear model is somewhat consistent as you might notice in the equation below where you see what is very similar to a linear equation. But the final output is based on the log odds. Linear regression and multivariate regression both take one to many independent variables and produce some form of continuous function. Linear regression could be used to predict the price of a house, a person’s age or the cost of a product an e-commerce should display to each customer. The output is not limited to only a few discrete classifications. Whereas logistic regression produces discrete classifiers. For instance, an algorithm using logistic regression could be used to classify whether or not a certain stock price would be either >$50 a share or <$50 a share. Linear regression would be used to predict if a stock share would be worth $50.01, $50.02….etc. Logistic regression is a calculation that uses the odds of a certain classification. In the equation above, the symbol you might know as pi actually represents the odds or probability. To reduce the error rate, we should predict Y = 1 when p ≥ 0.5 and Y = 0 when p < 0.5. This creates a linear classifier, a boundary that when the coefficients β0 + x · β has a p value that is p < 0.5 then Y = 0. By generating coefficients that help predict the logit transformation, the method allows to classify for the characteristic of interest. Now that is a lot of complex math mumbo jumbo. Let’s try to break it down into simpler terms. Probability vs. Odds Let’s start with probability. Let’s say a patient has the probability of 0.6 of being readmitted. Then the probability that the patient won’t be readmitted is .4. Now, we want to take this and convert it into odds. This is what the formula above is doing. You would take .6/.4 and get odds of 1.5. That means the odds of the patient being readmitted are 1.5 to 1. If instead the probability was .5 for both being readmitted and not being readmitted, then the odds would be 1:1. Now the next step in the logistic regression model would be to take the odds and get the “Log odds”. You do this by taking the 1.5 and put it into the log portion of the equation. Now you will get .18(rounded). In logistic regression, we don’t actually know p. That is what we are trying to essentially find and model using various coefficients and input variables. Each input provides a value that changes how much more likely an event will or will not occur. All of these coefficients are used to calculate the log odds. This model can take multiple variables like age, sex, height, etc. and specify how much of an effect each variable has on the odds an event will occur. Once the initial model is developed, then comes the work of deciding its value. How does a business go from creating an algorithm inside a computer and translate it into action. Some of us like to say the “computers” are the easy part. Personally I find the hard part to be the “people”. After all, at the end of the day, it comes down to business value. Will an algorithm save money or not? That means it has to be applied in real life. This could take the form of a new initiative, strategy, product recommendation, etc. You need to find the outliers that are worth going after! For instance, if we go back to the patient readmission example again. The algorithm points out patients with high probabilities of being readmitted. However if the readmission costs are low, they will probably be ignored..sadly. That is how businesses (including hospitals) look at problems. Logistic regression is a great tool for binary classification. It is unlike many other algorithms that estimate continuous variables or estimate distributions. This statistical method can be utilized to classify whether a person will be likely to get cancer because of environmental variables like proximity to a highway, smoking habits, etc? This method has been used effectively in the medical, financial and insurance industry successfully for a while. Knowing when to use what algorithm takes time. However, the more problems a data scientist faces, the faster they will recognize whether to use logistic regression or decision trees. Using logistic regression provides the opportunity for healthcare institutions to accurately target at risk individuals who should receive a more tailored behavioral health plan to help improve their daily health habits. This in turn opens the opportunity for better health for patients and lower costs for hospitals. [box type="shadow" align="" class="" width=""] About the Author Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/box]
Read more
  • 0
  • 0
  • 38257

article-image-self-service-business-intelligence-qlik-sense-users
Amey Varangaonkar
29 May 2018
7 min read
Save for later

Four self-service business intelligence user types in Qlik Sense

Amey Varangaonkar
29 May 2018
7 min read
With the introduction of self-service to BI, there is segmentation at various levels and breaths on how self-service is conducted and to what extent. There are, quite frankly, different user types that differ from each other in level of interest, technical expertise, and the way in which they consume data. While each user will almost be unique in the way they use self-service, the user base can be divided into four different groups. In this article, we take a look at the four types of users in self-service business intelligence model. The following excerpt is taken from the book Mastering Qlik Sense, authored by Martin Mahler and Juan Ignacio Vitantonio. This book presents expert techniques to design and deploy enterprise-grade Business Intelligence solutions for your business, by leveraging the power of Qlik Sense. Power Users or Data Champions Power users are the most tech-savvy business users, who show a great interest in self-service BI. They produce and build dashboards themselves and know how to load data and process it to create a logical data model. They tend to be self-learning and carry a hybrid set of skills, usually a mixture of business knowledge and some advanced technical skills. This user group is often frustrated with existing reporting or BI solutions and finds IT inadequate in delivering the same. As a result, especially in the past, they take away data dumps from IT solutions and create their own dashboards in Excel, using advanced skills such as VBA, Visual Basic for Applications. They generally like to participate in the development process but have been unable to do so due to governance rules and a strict old-school separation of IT from the business. Self-service BI is addressing this group in particular, and identifying those users is key in reaching adoption within an organization. Within an established self-service environment, power users generally participate in committees revolving around the technical environments and represent the business interest. They also develop the bulk of the first versions of the apps, which, as part of a naturally evolving process, are then handed over to more experienced IT for them to be polished and optimized. Power users advocate the self-service BI technology and often not only demo the insights and information they achieved to extract from their data, but also the efficiency and timeliness of doing so. At the same time, they also serve as the first point of contact for other users and consumers when it comes to questions about their apps and dashboards. Sometimes they also participate in a technical advisory capacity on whether other projects are feasible to be implemented using the same technology. Within a self-service BI environment, it is safe to say that those power users are the pillars of a successful adoption. Business Users or Data Visualizers Users are frequent users of data analytics, with the main goal to extract value from the data they are presented with. They represent the group of the user base which is interested in conducting data analysis and data discovery to better understand their business in order to make better-informed decisions. Presentation and ease of use of the application are key to this type of user group and they are less interested in building new analytics themselves. That being said, some form of creating new charts and loading data is sometimes still of interest to them, albeit on a very basic level. Timeliness, the relevance of data, and the user experience are most relevant to them. They are the ones who are slicing and dicing the data and drilling down into dimensions, and who are keen to click around in the app to obtain valuable information. Usually, a group of users belong to the same department and have a power user overseeing them with regard to questions but also in receiving feedback on how the dashboard can be improved even more. Their interaction with IT is mostly limited to requesting access and resolving unexpected technical errors. Consumers or Data Readers Consumers usually form the largest user group of a self-service BI analytics solution. They are the end recipients of the insights and data analytics that have been produced and, normally, are only interested in distilled information which is presented to them in a digested form. They are usually the kind of users who are happy with a report, either digital or in printed form, which summarizes highlights and lowlights in a few pages, requiring no interaction at all. Also, they are most sensitive to the timeliness and availability of their reports. While usually the largest audience, at the same time this user group leverages the self-service capabilities of a BI tool the least. This poses a licensing challenge, as those users don’t take full advantage of the functionality on offer, but are costing the full amount in order to access the reports. It is therefore not uncommon to assign this type of user group a bucket of login access passes or not give them access to the self-service BI platform at all and give them the information they need in (digitally) printed format or within presentations, prepared by users. IT or Data Overseers IT represents the technical user group within this context, who sit in the background and develop and manage the framework within which the self-service BI solution operates. They are the backbone of the deployment and ensure the environment is set up correctly to cater for the various use cases required by the above-described user groups. At the same time, they ensure a security policy is in place and maintained and they introduce a governance framework for deployment, data quality, and best practices. They are in effect responsible for overseeing the power users and helping them with technical questions, but at the same time ensuring terms and definition as well as the look and feel is consistent and maintained across all apps. With self-service BI, IT plays a lesser role in actually developing the dashboards but assumes a more mentoring position, where training, consultation, and advisory in best practices are conducted. While working closely with power users, IT also provides technical support to users and liaises with the IT infrastructure to ensure the server infrastructure is fit for purpose and up and running to serve the users. This also includes upgrading the platform where required and enriching it with additional functionality if and when available. Bringing them together The previous four groups can be distinguished within a typical enterprise environment; however, this is not to say hybrid or fewer user groups are not viable models for self-service BI. It is an evolutionary process in how an organization adapts self-service data analytics with a lot of dependencies on available skills, competing established solutions, culture, and appetite on new technologies. It usually begins with IT being the first users in a newly deployed self-service environment, not only setting up the infrastructure but also developing the first apps for a couple of consumers. Power users then follow up; generally, they are the business sponsors themselves who are often big fans of data analytics, modifying the app to their liking and promoting it to their users. The user base emerges with the success of the solution, where analytics are integrated into their business as the usual process. The last group, the consumers, is mostly the last type of user group that is established, which more often than not doesn’t have actual access to the platform itself, but rather receives printouts, email summaries with screenshots, or PowerPoint presentations. Due to licensing cost and the size of the consumer audience, it is not always easy to give them access to the self-service platform; hence, most of the time, an automated and streamlined PDF printing process is the most elegant solution to cater to this type of user group. At the same time, the size of the deployment also determines the number of various user groups. In small enterprise environments, it will be mostly power users and IT who will be using self-service. This greatly simplifies the approach as well as the setup considerations. If you found the above excerpt useful, make sure you check out the book Mastering Qlik Sense to learn helpful tips and tricks to perform effective Business Intelligence using Qlik Sense. Read more: How Qlik Sense is driving self-service Business Intelligence What we learned from Qlik Qonnections 2018 How self-service analytics is changing modern-day businesses
Read more
  • 0
  • 0
  • 38153

article-image-common-data-science-terms
Aarthi Kumaraswamy
16 May 2018
27 min read
Save for later

30 common data science terms explained

Aarthi Kumaraswamy
16 May 2018
27 min read
Let’s begin at the beginning. What do terms like statistical population, statistical comparison, statistical inference mean? What good is munging, coding, booting, regularization etc. On a scale of 1 to 30 (1 being the lowest and 30, the highest), rate yourself as a data scientist. No matter what you have scored yourself, we hope to have improved that score at least by a little, by the end of this post. Let’s start with a basic question: What is data science? [box type="shadow" align="" class="" width=""]The following is an excerpt from the book, Statistics for Data Science written by James D. Miller and published by Packt Publishing.[/box] The idea of how data science is defined is a matter of opinion. I personally like the explanation that data science is a progression or, even better, an evolution of thought or steps, as shown in the following figure: Although a progression or evolution implies a sequential journey, in practice, this is an extremely fluid process; each of the phases may inspire the data scientist to reverse and repeat one or more of the phases until they are satisfied. In other words, all or some phases of the process may be repeated until the data scientist determines that the desired outcome is reached. Depending on your sources and individual beliefs, you may say the following: Statistics is data science, and data science is statistics. Based upon personal experience, research, and various industry experts' advice, someone delving into the art of data science should take every opportunity to understand and gain experience as well as proficiency with the following list of common data science terms: Statistical population Probability False positives Statistical inference Regression Fitting Categorical data Classification Clustering Statistical comparison Coding Distributions Data mining Decision trees Machine learning Munging and wrangling Visualization D3 Regularization Assessment Cross-validation Neural networks Boosting Lift Mode Outlier Predictive modeling Big data Confidence interval Writing Statistical population You can perhaps think of a statistical population as a recordset (or a set of records). This set or group of records will be of similar items or events that are of interest to the data scientist for some experiment. For a data developer, a population of data may be a recordset of all sales transactions for a month, and the interest might be reporting to the senior management of an organization which products are the fastest sellers and at which time of the year. For a data scientist, a population may be a recordset of all emergency room admissions during a month, and the area of interest might be to determine the statistical demographics for emergency room use. [box type="note" align="" class="" width=""]Typically, the terms statistical population and statistical model are or can be used interchangeably. Once again, data scientists continue to evolve with their alignment on their use of common terms. [/box] Another key point concerning statistical populations is that the recordset may be a group of (actually) existing objects or a hypothetical group of objects. Using the preceding example, you might draw a comparison of actual objects as those actual sales transactions recorded for the month while the hypothetical objects as sales transactions are expected, forecast, or presumed (based upon observations or experienced assumptions or other logic) to occur during a month. Finally, through the use of statistical inference, the data scientist can select a portion or subset of the recordset (or population) with the intention that it will represent the total population for a particular area of interest. This subset is known as a statistical sample. If a sample of a population is chosen accurately, characteristics of the entire population (that the sample is drawn from) can be estimated from the corresponding characteristics of the sample. Probability Probability is concerned with the laws governing random events.                                           -www.britannica.com When thinking of probability, you think of possible upcoming events and the likelihood of them actually occurring. This compares to a statistical thought process that involves analyzing the frequency of past events in an attempt to explain or make sense of the observations. In addition, the data scientist will associate various individual events, studying the relationship of these events. How these different events relate to each other governs the methods and rules that will need to be followed when we're studying their probabilities. [box type="note" align="" class="" width=""]A probability distribution is a table that is used to show the probabilities of various outcomes in a sample population or recordset. [/box] False positives The idea of false positives is a very important statistical (data science) concept. A false positive is a mistake or an errored result. That is, it is a scenario where the results of a process or experiment indicate a fulfilled or true condition when, in fact, the condition is not true (not fulfilled). This situation is also referred to by some data scientists as a false alarm and is most easily understood by considering the idea of a recordset or statistical population (which we discussed earlier in this section) that is determined not only by the accuracy of the processing but by the characteristics of the sampled population. In other words, the data scientist has made errors during the statistical process, or the recordset is a population that does not have an appropriate sample (or characteristics) for what is being investigated. Statistical inference What developer at some point in his or her career, had to create a sample or test data? For example, I've often created a simple script to generate a random number (based upon the number of possible options or choices) and then used that number as the selected option (in my test recordset). This might work well for data development, but with statistics and data science, this is not sufficient. To create sample data (or a sample population), the data scientist will use a process called statistical inference, which is the process of deducing options of an underlying distribution through analysis of the data you have or are trying to generate for. The process is sometimes called inferential statistical analysis and includes testing various hypotheses and deriving estimates. When the data scientist determines that a recordset (or population) should be larger than it actually is, it is assumed that the recordset is a sample from a larger population, and the data scientist will then utilize statistical inference to make up the difference. [box type="note" align="" class="" width=""]The data or recordset in use is referred to by the data scientist as the observed data. Inferential statistics can be contrasted with descriptive statistics, which is only concerned with the properties of the observed data and does not assume that the recordset came from a larger population. [/box] Regression Regression is a process or method (selected by the data scientist as the best fit technique for the experiment at hand) used for determining the relationships among variables. If you're a programmer, you have a certain understanding of what a variable is, but in statistics, we use the term differently. Variables are determined to be either dependent or independent. An independent variable (also known as a predictor) is the one that is manipulated by the data scientist in an effort to determine its relationship with a dependent variable. A dependent variable is a variable that the data scientist is measuring. [box type="note" align="" class="" width=""]It is not uncommon to have more than one independent variable in a data science progression or experiment. [/box] More precisely, regression is the process that helps the data scientist comprehend how the typical value of the dependent variable (or criterion variable) changes when any one or more of the independent variables is varied while the other independent variables are held fixed. Fitting Fitting is the process of measuring how well a statistical model or process describes a data scientist's observations pertaining to a recordset or experiment. These measures will attempt to point out the discrepancy between observed values and probable values. The probable values of a model or process are known as a distribution or a probability distribution. Therefore, a probability distribution fitting (or distribution fitting) is when the data scientist fits a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon. The object of a data scientist performing a distribution fitting is to predict the probability or to forecast the frequency of, the occurrence of the phenomenon at a certain interval. [box type="note" align="" class="" width=""]One of the most common uses of fitting is to test whether two samples are drawn from identical distributions.[/box] There are numerous probability distributions a data scientist can select from. Some will fit better to the observed frequency of the data than others will. The distribution giving a close fit is supposed to lead to good predictions; therefore, the data scientist needs to select a distribution that suits the data well. Categorical data Earlier, we explained how variables in your data can be either independent or dependent. Another type of variable definition is a categorical variable. This type of variable is one that can take on one of a limited, and typically fixed, number of possible values, thus assigning each individual to a particular category. Often, the collected data's meaning is unclear. Categorical data is a method that a data scientist can use to put meaning to the data. For example, if a numeric variable is collected (let's say the values found are 4, 10, and 12), the meaning of the variable becomes clear if the values are categorized. Let's suppose that based upon an analysis of how the data was collected, we can group (or categorize) the data by indicating that this data describes university students, and there is the following number of players: 4 tennis players 10 soccer players 12 football players Now, because we grouped the data into categories, the meaning becomes clear. Some other examples of categorized data might be individual pet preferences (grouped by the type of pet), or vehicle ownership (grouped by the style of a car owned), and so on. So, categorical data, as the name suggests, is data grouped into some sort of category or multiple categories. Some data scientists refer to categories as sub-populations of data. [box type="note" align="" class="" width=""]Categorical data can also be data that is collected as a yes or no answer. For example, hospital admittance data may indicate that patients either smoke or do not smoke. [/box] Classification Statistical classification of data is the process of identifying which category (discussed in the previous section) a data point, observation, or variable should be grouped into. The data science process that carries out a classification process is known as a classifier. Read this post: Classification using Convolutional Neural Networks [box type="note" align="" class="" width=""]Determining whether a book is fiction or non-fiction is a simple example classification. An analysis of data about restaurants might lead to the classification of them among several genres. [/box] Clustering Clustering is the process of dividing up the data occurrences into groups or homogeneous subsets of the dataset, not a predetermined set of groups as in classification (described in the preceding section) but groups identified by the execution of the data science process based upon similarities that it found among the occurrences. Objects in the same group (a group is also referred to as a cluster) are found to be more analogous (in some sense or another) to each other than to those objects found in other groups (or found in other clusters). The process of clustering is found to be very common in exploratory data mining and is also a common technique for statistical data analysis. Statistical comparison Simply put, when you hear the term statistical comparison, one is usually referring to the act of a data scientist performing a process of analysis to view the similarities or variances of two or more groups or populations (or recordsets). As a data developer, one might be familiar with various utilities such as FC Compare, UltraCompare, or WinDiff, which aim to provide the developer with a line-by-line comparison of the contents of two or more (even binary) files. In statistics (data science), this process of comparing is a statistical technique to compare populations or recordsets. In this method, a data scientist will conduct what is called an Analysis of Variance (ANOVA), compare categorical variables (within the recordsets), and so on. [box type="note" align="" class="" width=""]ANOVA is an assortment of statistical methods that are used to analyze the differences among group means and their associated procedures (such as variations among and between groups, populations, or recordsets). This method eventually evolved into the Six Sigma dataset comparisons. [/box] Coding Coding or statistical coding is again a process that a data scientist will use to prepare data for analysis. In this process, both quantitative data values (such as income or years of education) and qualitative data (such as race or gender) are categorized or coded in a consistent way. Coding is performed by a data scientist for various reasons such as follows: More effective for running statistical models Computers understand the variables Accountability--so the data scientist can run models blind, or without knowing what variables stand for, to reduce programming/author bias [box type="shadow" align="" class="" width=""]You can imagine the process of coding as the means to transform data into a form required for a system or application. [/box] Distributions The distribution of a statistical recordset (or of a population) is a visualization showing all the possible values (or sometimes referred to as intervals) of the data and how often they occur. When a distribution of categorical data (which we defined earlier in this chapter) is created by a data scientist, it attempts to show the number or percentage of individuals in each group or category. Linking an earlier defined term with this one, a probability distribution, stated in simple terms, can be thought of as a visualization showing the probability of occurrence of different possible outcomes in an experiment. Data mining With data mining, one is usually more absorbed in the data relationships (or the potential relationships between points of data, sometimes referred to as variables) and cognitive analysis. To further define this term, we can say that data mining is sometimes more simply referred to as knowledge discovery or even just discovery, based upon processing through or analyzing data from new or different viewpoints and summarizing it into valuable insights that can be used to increase revenue, cuts costs, or both. Using software dedicated to data mining is just one of several analytical approaches to data mining. Although there are tools dedicated to this purpose (such as IBM Cognos BI and Planning Analytics, Tableau, SAS, and so on.), data mining is all about the analysis process finding correlations or patterns among dozens of fields in the data and that can be effectively accomplished using tools such as MS Excel or any number of open source technologies. [box type="note" align="" class="" width=""]A common technique to data mining is through the creation of custom scripts using tools such as R or Python. In this way, the data scientist has the ability to customize the logic and processing to their exact project needs. [/box] Decision trees A statistical decision tree uses a diagram that looks like a tree. This structure attempts to represent optional decision paths and a predicted outcome for each path selected. A data scientist will use a decision tree to support, track, and model decision making and their possible consequences, including chance event outcomes, resource costs, and utility. It is a common way to display the logic of a data science process. Machine learning Machine learning is one of the most intriguing and exciting areas of data science. It conjures all forms of images around artificial intelligence which includes Neural Networks, Support Vector Machines (SVMs), and so on. Fundamentally, we can describe the term machine learning as a method of training a computer to make or improve predictions or behaviors based on data or, specifically, relationships within that data. Continuing, machine learning is a process by which predictions are made based upon recognized patterns identified within data, and additionally, it is the ability to continuously learn from the data's patterns, therefore continuingly making better predictions. It is not uncommon for someone to mistake the process of machine learning for data mining, but data mining focuses more on exploratory data analysis and is known as unsupervised learning. Machine learning can be used to learn and establish baseline behavioral profiles for various entities and then to find meaningful anomalies. Here is the exciting part: the process of machine learning (using data relationships to make predictions) is known as predictive analytics. Predictive analytics allow the data scientists to produce reliable, repeatable decisions and results and uncover hidden insights through learning from historical relationships and trends in the data. Munging and wrangling The terms munging and wrangling are buzzwords or jargon meant to describe one's efforts to affect the format of data, recordset, or file in some way in an effort to prepare the data for continued or otherwise processing and/or evaluations. With data development, you are most likely familiar with the idea of Extract, Transform, and Load (ETL). In somewhat the same way, a data developer may mung or wrangle data during the transformation steps within an ETL process. Common munging and wrangling may include removing punctuation or HTML tags, data parsing, filtering, all sorts of transforming, mapping, and tying together systems and interfaces that were not specifically designed to interoperate. Munging can also describe the processing or filtering of raw data into another form, allowing for more convenient consumption of the data elsewhere. Munging and wrangling might be performed multiple times within a data science process and/or at different steps in the evolving process. Sometimes, data scientists use munging to include various data visualization, data aggregation, training a statistical model, as well as much other potential work. To this point, munging and wrangling may follow a flow beginning with extracting the data in a raw form, performing the munging using various logic, and lastly, placing the resulting content into a structure for use. Although there are many valid options for munging and wrangling data, preprocessing and manipulation, a tool that is popular with many data scientists today is a product named Trifecta, which claims that it is the number one (data) wrangling solution in many industries. [box type="note" align="" class="" width=""]Trifecta can be downloaded for your personal evaluation from https://www.trifacta.com/. Check it out! [/box] Visualization The main point (although there are other goals and objectives) when leveraging a data visualization technique is to make something complex appear simple. You can think of visualization as any technique for creating a graphic (or similar) to communicate a message. Other motives for using data visualization include the following: To explain the data or put the data in context (which is to highlight demographic statistics) To solve a specific problem (for example, identifying problem areas within a particular business model) To explore the data to reach a better understanding or add clarity (such as what periods of time do this data span?) To highlight or illustrate otherwise invisible data (such as isolating outliers residing in the data) To predict, such as potential sales volumes (perhaps based upon seasonality sales statistics) And others Statistical visualization is used in almost every step in the data science process, within the obvious steps such as exploring and visualizing, analyzing and learning, but can also be leveraged during collecting, processing, and the end game of using the identified insights. D3 D3 or D3.js, is essentially an open source JavaScript library designed with the intention of visualizing data using today's web standards. D3 helps put life into your data, utilizing Scalable Vector Graphics (SVG), Canvas, and standard HTML. D3 combines powerful visualization and interaction techniques with a data-driven approach to DOM manipulation, providing data scientists with the full capabilities of modern browsers and the freedom to design the right visual interface that best depicts the objective or assumption. In contrast to many other libraries, D3.js allows inordinate control over the visualization of data. D3 is embedded within an HTML webpage and uses pre-built JavaScript functions to select elements, create SVG objects, style them, or add transitions, dynamic effects, and so on. Regularization Regularization is one possible approach that a data scientist may use for improving the results generated from a statistical model or data science process, such as when addressing a case of overfitting in statistics and data science. [box type="note" align="" class="" width=""]We defined fitting earlier (fitting describes how well a statistical model or process describes a data scientist's observations). Overfitting is a scenario where a statistical model or process seems to fit too well or appears to be too close to the actual data.[/box] Overfitting usually occurs with an overly simple model. This means that you may have only two variables and are drawing conclusions based on the two. For example, using our previously mentioned example of daffodil sales, one might generate a model with temperature as an independent variable and sales as a dependent one. You may see the model fail since it is not as simple as concluding that warmer temperatures will always generate more sales. In this example, there is a tendency to add more data to the process or model in hopes of achieving a better result. The idea sounds reasonable. For example, you have information such as average rainfall, pollen count, fertilizer sales, and so on; could these data points be added as explanatory variables? [box type="note" align="" class="" width=""]An explanatory variable is a type of independent variable with a subtle difference. When a variable is independent, it is not affected at all by any other variables. When a variable isn't independent for certain, it's an explanatory variable. [/box] Continuing to add more and more data to your model will have an effect but will probably cause overfitting, resulting in poor predictions since it will closely resemble the data, which is mostly just background noise. To overcome this situation, a data scientist can use regularization, introducing a tuning parameter (additional factors such as a data points mean value or a minimum or maximum limitation, which gives you the ability to change the complexity or smoothness of your model) into the data science process to solve an ill-posed problem or to prevent overfitting. Assessment When a data scientist evaluates a model or data science process for performance, this is referred to as assessment. Performance can be defined in several ways, including the model's growth of learning or the model's ability to improve (with) learning (to obtain a better score) with additional experience (for example, more rounds of training with additional samples of data) or accuracy of its results. One popular method of assessing a model or processes performance is called bootstrap sampling. This method examines performance on certain subsets of data, repeatedly generating results that can be used to calculate an estimate of accuracy (performance). The bootstrap sampling method takes a random sample of data, splits it into three files--a training file, a testing file, and a validation file. The model or process logic is developed based on the data in the training file and then evaluated (or tested) using the testing file. This tune and then test process is repeated until the data scientist is comfortable with the results of the tests. At that point, the model or process is again tested, this time using the validation file, and the results should provide a true indication of how it will perform. [box type="note" align="" class="" width=""]You can imagine using the bootstrap sampling method to develop program logic by analyzing test data to determine logic flows and then running (or testing) your logic against the test data file. Once you are satisfied that your logic handles all of the conditions and exceptions found in your testing data, you can run a final test on a new, never-before-seen data file for a final validation test. [/box] Cross-validation Cross-validation is a method for assessing a data science process performance. Mainly used with predictive modeling to estimate how accurately a model might perform in practice, one might see cross-validation used to check how a model will potentially generalize, in other words, how the model can apply what it infers from samples to an entire population (or recordset). With cross-validation, you identify a (known) dataset as your validation dataset on which training is run along with a dataset of unknown data (or first seen data) against which the model will be tested (this is known as your testing dataset). The objective is to ensure that problems such as overfitting (allowing non-inclusive information to influence results) are controlled and also provide an insight into how the model will generalize a real problem or on a real data file. The cross-validation process will consist of separating data into samples of similar subsets, performing the analysis on one subset (called the training set) and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple iterations (also called folds or rounds) of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Typically, a data scientist will use a models stability to determine the actual number of rounds of cross-validation that should be performed. Neural networks Neural networks are also called artificial neural networks (ANNs), and the objective is to solve problems in the same way that the human brain would. Google will provide the following explanation of ANN as stated in Neural Network Primer: Part I, by Maureen Caudill, AI Expert, Feb. 1989: [box type="note" align="" class="" width=""]A computing system made up of several simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. [/box] To oversimplify the idea of neural networks, recall the concept of software encapsulation, and consider a computer program with an input layer, a processing layer, and an output layer. With this thought in mind, understand that neural networks are also organized in a network of these layers, usually with more than a single processing layer. Patterns are presented to the network by way of the input layer, which then communicates to one (or more) of the processing layers (where the actual processing is done). The processing layers then link to an output layer where the result is presented. Most neural networks will also contain some form of learning rule that modifies the weights of the connections (in other words, the network learns which processing nodes perform better and gives them a heavier weight) per the input patterns that it is presented with. In this way (in a sense), neural networks learn by example as a child learns to recognize a cat from being exposed to examples of cats. Boosting In a manner of speaking, boosting is a process generally accepted in data science for improving the accuracy of a weak learning data science process. [box type="note" align="" class="" width=""]Data science processes defined as weak learners are those that produce results that are only slightly better than if you would randomly guess the outcome. Weak learners are basically thresholds or a 1-level decision tree. [/box] Specifically, boosting is aimed at reducing bias and variance in supervised learning. What do we mean by bias and variance? Before going on further about boosting, let's take note of what we mean by bias and variance. Data scientists describe bias as a level of favoritism that is present in the data collection process, resulting in uneven, disingenuous results and can occur in a variety of different ways. A sampling method is called biased if it systematically favors some outcomes over others. A variance may be defined (by a data scientist) simply as the distance from a variable mean (or how far from the average a result is). The boosting method can be described as a data scientist repeatedly running through a data science process (that has been identified as a weak learning process), with each iteration running on different and random examples of data sampled from the original population recordset. All the results (or classifiers or residue) produced by each run are then combined into a single merged result (that is a gradient). This concept of using a random subset of the original recordset for each iteration originates from bootstrap sampling in bagging and has a similar variance-reducing effect on the combined model. In addition, some data scientists consider boosting a means to convert weak learners into strong ones; in fact, to some, the process of boosting simply means turning a weak learner into a strong learner. Lift In data science, the term lift compares the frequency of an observed pattern within a recordset or population with how frequently you might expect to see that same pattern occur within the data by chance or randomly. If the lift is very low, then typically, a data scientist will expect that there is a very good probability that the pattern identified is occurring just by chance. The larger the lift, the more likely it is that the pattern is real. Mode In statistics and data science, when a data scientist uses the term mode, he or she refers to the value that occurs most often within a sample of data. Mode is not calculated but is determined manually or through processing of the data. Outlier Outliers can be defined as follows: A data point that is way out of keeping with the others That piece of data that doesn't fit Either a very high value or a very low value Unusual observations within the data An observation point that is distant from all others Predictive modeling The development of statistical models and/or data science processes to predict future events is called predictive modeling. Big Data Again, we have some variation of the definition of big data. A large assemblage of data, data sets that are so large or complex that traditional data processing applications are inadequate, and data about every aspect of our lives have all been used to define or refer to big data. In 2001, then Gartner analyst Doug Laney introduced the 3V's concept. The 3V's, as per Laney, are volume, variety, and velocity. The V's make up the dimensionality of big data: volume (or the measurable amount of data), variety (meaning the number of types of data), and velocity (referring to the speed of processing or dealing with that data). Confidence interval The confidence interval is a range of values that a data scientist will specify around an estimate to indicate their margin of error, combined with a probability that a value will fall in that range. In other words, confidence intervals are good estimates of the unknown population parameter. Writing Although visualizations grab much more of the limelight when it comes to presenting the output or results of a data science process or predictive model, writing skills are still not only an important part of how a data scientist communicates but still considered an essential skill for all data scientists to be successful. Did we miss any of your favorite terms? Now that you are at the end of this post, we ask you again: On a scale of 1 to 30 (1 being the lowest and 30, the highest), how do you rate yourself as a data scientist? Why You Need to Know Statistics To Be a Good Data Scientist [interview] How data scientists test hypotheses and probability 6 Key Areas to focus on while transitioning to a Data Scientist role Soft skills every data scientist should teach their child
Read more
  • 0
  • 0
  • 37966
article-image-why-oracle-losing-database-race
Aaron Lazar
06 Apr 2018
3 min read
Save for later

Why Oracle is losing the Database Race

Aaron Lazar
06 Apr 2018
3 min read
When you think of databases, the first thing that comes to mind is Oracle or IBM. Oracle has been ruling the database world for decades now, and it has been able to acquire tonnes of applications that use its databases. However, that’s changing now, and if you didn’t know already, you might be surprised to know that Oracle is losing the database race. Oracle = Goliath Oracle was and still is ranked number one among databases, owing to its legacy in the database ballpark. Source - DB Engines The main reason why Oracle has managed to hold its position is because of lock-in, a CIO’s worst nightmare. Migrating data that’s accumulated over the years is not a walk in the park and usually has top management flinching every time it’s mentioned. Another reason is because Oracle is known to be aggressive when it comes to maintaining and enforcing licensing terms. You won’t be surprised to find Oracle ‘agents’ at the doorstep of your organisation, slapping you with a big fine for non-compliance! Oracle != Goliath for everyone You might wonder whether even the biggies are in the same position, locked-in with Oracle. Well, the Amazons and Salesforces of the world have quietly moved away from lock-in hell and have their applications now running on open-source projects. In fact, Salesforce plans to be completely free of Oracle databases by 2023 and has even codenamed this project “Sayonara”. I wonder what inspired the name! Enter the “Davids” of Databases While Oracle’s databases have been declining, alternatives like SQL Server and PostgreSQL have been steadily growing. SQL Server has been doing it in leaps and bounds, with a growth rate of over 30%. Amazon and Microsoft’s cloud based databases have seen close to 10x growth. While one might think that all Cloud solutions would have dominated the database world, databases like Google Cloud SQL and IBM Cognos have been suffering very slow to no growth as the question of lock-in arises again, only this time with a cloud vendor. MongoDB has been another shining star in the database race. Several large organisations like HSBC, Adobe, Ebay, Forbes and MTV have adopted MongoDB as their database solution. Newer organisations have been resorting to adopt these databases instead to looking to Oracle. However, it’s not really eating into Oracle’s existing market, at least not yet. Is 18c Oracle’s silver bullet? Oracle bragged a lot about 18c, last year, positioning it as a database that needs little to no human interference thanks to its ground-breaking machine learning; one that operates at less than 30 minutes of downtime a year and many more features. Does this make Microsoft and Amazon break into a sweat? Hell no! Although Oracle has strategically positioned 18c as a database that lowers operational cost by cutting down on the human element, it still is quite expensive when compared to its competitors - they haven’t dropped their price one bit. Moreover, it can’t really automate “everything” and there’s always a need for a human administrator - not really convincing enough. Quite naturally customers will be drawn towards competition. In the end, the way I look at it, Oracle already had a head start and is now inches from the elusive finish line, probably sniggering away at all the customers that it has on a leash. All while cloud databases are slowly catching up and will soon be leaving Oracle in a heap of dirt. Reminds me of that fable mum used to read to me...what’s it called...The hare and the tortoise.
Read more
  • 0
  • 0
  • 37808

article-image-5-ways-artificial-intelligence-is-upgrading-software-engineering
Melisha Dsouza
02 Sep 2018
8 min read
Save for later

5 ways artificial intelligence is upgrading software engineering

Melisha Dsouza
02 Sep 2018
8 min read
47% of digitally mature organizations, or those that have advanced digital practices, said they have a defined AI strategy (Source: Adobe). It is estimated that  AI-enabled tools alone will generate $2.9 trillion in business value by 2021.  80% of enterprises are smartly investing in AI. The stats speak for themselves. AI clearly follows the motto “go big or go home”. This explosive growth of AI in different sectors of technology is also beginning to show its colors in software development. Shawn Drost, co-founder and lead instructor of coding boot camp ‘Hack Reactor’ says that AI still has a long way to go and is only impacting the workflow of a small portion of software engineers on a minority of projects right now. AI promises to change how organizations will conduct business and to make applications smarter. It is only logical then that software development, i.e., the way we build apps, will be impacted by AI as well. Forrester Research recently surveyed 25 application development and delivery (AD&D) teams, and respondents said AI will improve planning, development and especially testing. We can expect better software created under traditional environments. 5 areas of Software Engineering AI will transform The 5 major spheres of software development-  Software design, Software testing, GUI testing, strategic decision making, and automated code generation- are all areas where AI can help. A majority of interest in applying AI to software development is already seen in automated testing and bug detection tools. Next in line are the software design precepts, decision-making strategies, and finally automating software deployment pipelines. Let's take an in-depth look into the areas of high and medium interest of software engineering impacted by AI according to the Forrester Research report.     Source: Forbes.com #1 Software design In software engineering, planning a project and designing it from scratch need designers to apply their specialized learning and experience to come up with alternative solutions before settling on a definite solution. A designer begins with a vision of the solution, and after that retracts and forwards investigating plan changes until they reach the desired solution. Settling on the correct plan choices for each stage is a tedious and mistake-prone action for designers. Along this line, a few AI developments have demonstrated the advantages of enhancing traditional methods with intelligent specialists. The catch here is that the operator behaves like an individual partner to the client. This associate should have the capacity to offer opportune direction on the most proficient method to do design projects. For instance, take the example of AIDA- The Artificial Intelligence Design Assistant, deployed by Bookmark (a website building platform). Using AI, AIDA understands a users needs and desires and uses this knowledge to create an appropriate website for the user. It makes selections from millions of combinations to create a website style, focus, image and more that are customized for the user. In about 2 minutes, AIDA designs the first version of the website, and from that point it becomes a drag and drop operation. You can get a detailed overview of this tool on designshack. #2 Software testing Applications interact with each other through countless  APIs. They leverage legacy systems and grow in complexity everyday. Increase in complexity also leads to its fair share of challenges that can be overcome by machine-based intelligence. AI tools can be used to create test information, explore information authenticity, advancement and examination of the scope and also for test management. Artificial intelligence, trained right, can ensure the testing performed is error free. Testers freed from repetitive manual tests thus have more time to create new automated software tests with sophisticated features. Also, if software tests are repeated every time source code is modified, repeating those tests can be not only time-consuming but extremely costly. AI comes to the rescue once again by automating the testing for you! With AI automated testing, one can increase the overall scope of tests leading to an overall improvement of software quality. Take, for instance, the Functionize tool. It enables users to test fast and release faster with AI enabled cloud testing. The users just have to type a test plan in English and it will be automatically get converted into a functional test case. The tool allows one to elastically scale functional, load, and performance tests across every browser and device in the cloud. It also includes Self-healing tests that update autonomously in real-time. SapFix is another AI Hybrid tool deployed by Facebook which can automatically generate fixes for specific bugs identified by 'Sapienz'. It then proposes these fixes to engineers for approval and deployment to production.   #3 GUI testing Graphical User Interfaces (GUI) have become important in interacting with today's software. They are increasingly being used in critical systems and testing them is necessary to avert failures. With very few tools and techniques available to aid in the testing process, testing GUIs is difficult. Currently used GUI testing methods are ad hoc. They require the test designer to perform humongous tasks like manually developing test cases, identifying the conditions to check during test execution, determining when to check these conditions, and finally evaluate whether the GUI software is adequately tested. Phew! Now that is a lot of work. Also, not forgetting that if the GUI is modified after being tested, the test designer must change the test suite and perform re-testing. As a result, GUI testing today is resource intensive and it is difficult to determine if the testing is adequate. Applitools is a GUI tester tool empowered by AI. The Applitools Eyes SDK automatically tests whether visual code is functioning properly or not. Applitools enables users to test their visual code just as thoroughly as their functional UI code to ensure that the visual look of the application is as you expect it to be. Users can test how their application looks in multiple screen layouts to ensure that they all fit the design. It allows users to keep track of both the web page behaviour, as well as the look of the webpage. Users can test everything they develop from the functional behavior of their application to its visual look. #4 Using Artificial Intelligence in Strategic Decision-Making Normally, developers have to go through a long process to decide what features to include in a product. However, machine learning AI solution trained on business factors and past development projects can analyze the performance of existing applications and help both teams of engineers and business stakeholders like project managers to find solutions to maximize impact and cut risk. Normally, the transformation of business requirements into technology specifications requires a significant timeline for planning. Machine learning can help software development companies to speed up the process, deliver the product in lesser time, and increase revenue within a short span. AI canvas is a well known tool for Strategic Decision making.The canvas helps identify the key questions and feasibility challenges associated with building and deploying machine learning models in the enterprise. The AI Canvas is a simple tool that helps enterprises organize what they need to know into seven categories, namely- Prediction, Judgement, Action, Outcome, Input, Training and feedback. Clarifying these seven factors for each critical decision throughout the organization will help in identifying opportunities for AIs to either reduce costs or enhance performance.   #5 Automatic Code generation/Intelligent Programming Assistants Coding a huge project from scratch is often labour intensive and time consuming. An Intelligent AI programming assistant will reduce the workload by a great extent. To combat the issues of time and money constraints, researchers have tried to build systems that can write code before, but the problem is that these methods aren’t that good with ambiguity. Hence, a lot of details are needed about what the target program aims at doing, and writing down these details can be as much work as just writing the code. With AI, the story can be flipped. ”‘Bayou’- an A.I. based application is an Intelligent programming assistant. It began as an initiative aimed at extracting knowledge from online source code repositories like GitHub. Users can try it out at askbayou.com. Bayou follows a method called neural sketch learning. It trains an artificial neural network to recognize high-level patterns in hundreds of thousands of Java programs. It does this by creating a “sketch” for each program it reads and then associates this sketch with the “intent” that lies behind the program. This DARPA initiative aims at making programming easier and less error prone. Sounds intriguing? Now that you know how this tool works, why not try it for yourself on i-programmer.info. Summing it all up Software engineering has seen massive transformation over the past few years. AI and software intelligence tools aim to make software development easier and more reliable. According to a Forrester Research report on AI's impact on software development, automated testing and bug detection tools use AI the most to improve software development. It will be interesting to see the future developments in software engineering empowered with AI. I’m expecting faster, more efficient, more effective, and less costly software development cycles while engineers and other development personnel focus on bettering their skills to make advanced use of AI in their processes. Implementing Software Engineering Best Practices and Techniques with Apache Maven Intelligent Edge Analytics: 7 ways machine learning is driving edge computing adoption in 2018 15 millions jobs in Britain at stake with AI robots set to replace humans at workforce
Read more
  • 0
  • 0
  • 37412

article-image-how-artificial-intelligence-and-machine-learning-can-turbocharge-a-game-developers-career
Guest Contributor
06 Sep 2018
7 min read
Save for later

How Artificial Intelligence and Machine Learning can turbocharge a Game Developer's career

Guest Contributor
06 Sep 2018
7 min read
Gaming - whether board games or games set in the virtual realm - has been a massively popular form of entertainment since time immemorial. In the pursuit of creating more sophisticated, thrilling, and intelligent games, game developers have delved into ML and AI technologies to fuel innovation in the gaming sphere. The gaming domain is the ideal experimentation bed for evolving technologies because not only do they put up complex and challenging problems for ML and AI to solve, they also pose as a ground for creativity - a meeting ground for machine learning and the art of interaction. Machine Learning and Artificial Intelligence in Gaming The reliance on AI for gaming is not a recent development. In fact, it dates back to 1949, when the famous cryptographer and mathematician Claude Shannon made his musings public about how a supercomputer could be made to master Chess. Then again, in 1952, a graduate student in the UK developed an AI that could play tic-tac-toe with ultimate perfection. Source : Medium However, it isn’t just ML and AI that are progressing through experimentations on games. Game development, too, has benefited a great deal from these pioneering technologies. AI and ML have helped enhance the gaming experience on many grounds such as gaming design, the interactive quotient, as well as the inner functionalities of games. The above mentioned AI use cases focus on two primary things: one is to impart enhanced realism in virtual gaming environment and the second is to create a more naturalistic interface between the gaming environment and the players. As of now, the focus of game developers, data scientists, and ML researchers lies in two specific categories of the gaming domain - games of perfect information and games of imperfect information. In games of perfect information, a player is aware of all the aspects of the game throughout the playing session, whereas, in games of imperfect information, players are oblivious to specific aspects of the game. When it comes to games of perfect information such as Chess and Go, AI has shown various instances of overpowering human intelligence. Back in 1997, IBM’s Deep Blue successfully defeated world Chess champion, Garry Kasparov in a six-game match. In 2016, Google’s AlphaGo emerged as the victor in a Go match scoring 4-1 after defeating South Korean Go champion, Lee Sedol. One of the most advanced chess AIs developed yet, Stockfish, uses a combination of advanced heuristics and brute force to compute numeric values for each and every move in a specific position in Chess. It also effectively eliminates bad moves using the Alpha-beta pruning search algorithm. While the progress and contribution of AI and ML to the field of games of perfect information is laudable, researchers are now intrigued by games of imperfect information. Games of imperfect information offer much more challenging situations that are essentially difficult for machines to learn and master. Thus, the next evolution in the world of gaming will be to create spontaneous gaming environment using AI technology, in which developers will build only the gaming environment and its mechanics instead of creating a game with pre-programmed/scripted plots. In such a scenario, the AI will have to confront and solve spontaneous challenges with personalized scenarios generated on the spot. Games like StarCraft and StarCraft II have stirred up massive interest among game researchers and developers. In these games, the players are only partially aware of the gaming aspects and the game is largely determined not just by the AI moves and the previous state of the game, but also by the moves of other players. Since in these games you will have little knowledge about your rival’s moves, you have to take decisions on the go and your moves have to be spontaneous. The recent win of OpenAI Five over amateur human players in Dota2 is a good case in point. OpenAI Five is a team of five neural networks that leverages an advanced version of Proximal Policy Optimization and uses a separate LSTM to learn identifiable strategies. The progress of OpenAI Five shows that even without human data, reinforcement learning can facilitate long-term planning, thus, allowing us to make further progress in the games of imperfect information. Career in Game Development With ML and AI As ML and AI continue to penetrate the gaming industry, it is creating a huge demand for talented and skilled game developers who are well-versed in these technologies. Today, game development is at a place where it’s no longer necessary to build games using time-consuming manual techniques. ML and AI have made the task of game developers easier as by leveraging these technologies, they can design and build innovative gaming environment, and test them automatically. The integration of AI and ML in the gaming domain is giving birth to new job positions like Gameplay Software Engineer (AI), Gameplay Programmer (AI), and Game Security Data Scientist, to name a few. The salaries of traditional game developers is in stark contrast with that of those having AI/ML skills. While the average salary of game developers is usually around $44,000, it can scale up to and over $1,20,000 if one possesses AI/ML skills. Gameplay Engineer Average salary - $73,000 - $1,16,000 Gameplay engineers are usually part of the core game dev team and are entrusted with the responsibility of enhancing the existing gameplay systems to enrich the player experience. Companies today demand for gameplay engineers who are proficient in C/C++ and well-versed with AI/ML technologies. Gameplay Programmer Average salary - $98,000 - $1,49,000 Gameplay programmers work in close collaboration with the production and design team to develop cutting edge features in the existing and upcoming gameplay systems. Programming skills are a must and knowledge of AI/ML technologies is an added bonus. Game Security Data Scientist Average salary - $73,000 - $1,06,000 The role of a gameplay security data scientist is to combine both security and data science approaches to detect anomalies and fraudulent behavior in games. This calls for a high degree of expertise in AI, ML, and other statistical methods. With impressive salaries and exciting job opportunities cropping up fast in the game development sphere, the industry is attracting some major talent towards it. Game developers and software developers around the world are choosing the field due to the promises of rapid career growth. If you wish to bag better and more challenging roles in the domain of game development, you should definitely try and upskill your talent and knowledge base by mastering the fields of ML and AI. Packt Publishing is the leading UK provider of Technology eBooks, Coding eBooks, Videos and Blogs; helping IT professionals to put software to work. It offers several books and videos on Game development with AI and machine learning. It’s never too late to learn new disciplines and expand your knowledge base. There are numerous online platforms that offer great artificial intelligent courses. The perk of learning from a registered online platform is that you can learn and grow at your own pace and according to your convenience. So, enroll yourself in one and spice up your career in game development! About Author: Abhinav Rai is the Data Analyst at UpGrad, an online education platform providing industry oriented programs in collaboration with world-class institutes, some of which are MICA, IIIT Bangalore, BITS and various industry leaders which include MakeMyTrip, Ola, Flipkart etc.   Best game engines for AI game development Implementing Unity game engine and assets for 2D game development [Tutorial] How to use arrays, lists, and dictionaries in Unity for 3D game development      
Read more
  • 0
  • 0
  • 37006
article-image-quantum-expert-robert-sutor-explains-the-basics-of-quantum-computing
Packt Editorial Staff
12 Dec 2019
9 min read
Save for later

Quantum expert Robert Sutor explains the basics of Quantum Computing

Packt Editorial Staff
12 Dec 2019
9 min read
What if we could do chemistry inside a computer instead of in a test tube or beaker in the laboratory? What if running a new experiment was as simple as running an app and having it completed in a few seconds? For this to really work, we would want it to happen with complete fidelity. The atoms and molecules as modeled in the computer should behave exactly like they do in the test tube. The chemical reactions that happen in the physical world would have precise computational analogs. We would need a completely accurate simulation. If we could do this at scale, we might be able to compute the molecules we want and need. These might be for new materials for shampoos or even alloys for cars and airplanes. Perhaps we could more efficiently discover medicines that are customized to your exact physiology. Maybe we could get a better insight into how proteins fold, thereby understanding their function, and possibly creating custom enzymes to positively change our body chemistry. Is this plausible? We have massive supercomputers that can run all kinds of simulations. Can we model molecules in the above ways today?  This article is an excerpt from the book Dancing with Qubits written by Robert Sutor. Robert helps you understand how quantum computing works and delves into the math behind it with this quantum computing textbook.  Can supercomputers model chemical simulations? Let’s start with C8H10N4O2 – 1,3,7-Trimethylxanthine.  This is a very fancy name for a molecule that millions of people around the world enjoy every day: caffeine. An 8-ounce cup of coffee contains approximately 95 mg of caffeine, and this translates to roughly 2.95 × 10^20 molecules. Written out, this is 295, 000, 000, 000, 000, 000, 000 molecules. A 12 ounce can of a popular cola drink has 32 mg of caffeine, the diet version has 42 mg, and energy drinks often have about 77 mg. These numbers are large because we are counting physical objects in our universe, which we know is very big. Scientists estimate, for example, that there are between 10^49 and 10^50 atoms in our planet alone. To put these values in context, one thousand = 10^3, one million = 10^6, one billion = 10^9, and so on. A gigabyte of storage is one billion bytes, and a terabyte is 10^12 bytes. Getting back to the question I posed at the beginning of this section, can we model caffeine exactly on a computer? We don’t have to model the huge number of caffeine molecules in a cup of coffee, but can we fully represent a single molecule at a single instant? Caffeine is a small molecule and contains protons, neutrons, and electrons. In particular, if we just look at the energy configuration that determines the structure of the molecule and the bonds that hold it all together, the amount of information to describe this is staggering. In particular, the number of bits, the 0s and 1s, needed is approximately 10^48: 10, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000. And this is just one molecule! Yet somehow nature manages to deal quite effectively with all this information. It handles the single caffeine molecule, to all those in your coffee, tea, or soft drink, to every other molecule that makes up you and the world around you. How does it do this? We don’t know! Of course, there are theories and these live at the intersection of physics and philosophy. However, we do not need to understand it fully to try to harness its capabilities.  We have no hope of providing enough traditional storage to hold this much information. Our dream of exact representation appears to be dashed. This is what Richard Feynman meant in his quote: “Nature isn’t classical.” However, 160 qubits (quantum bits) could hold 2^160 ≈ 1.46 × 10^48 bits while the qubits were involved in a computation. To be clear, I’m not saying how we would get all the data into those qubits and I’m also not saying how many more we would need to do something interesting with the information. It does give us hope, however. In the classical case, we will never fully represent the caffeine molecule. In the future, with enough very high-quality qubits in a powerful quantum computing system, we may be able to perform chemistry on a computer. How quantum computing is different than classical computing I can write a little app on a classical computer that can simulate a coin flip. This might be for my phone or laptop. Instead of heads or tails, let’s use 1 and 0. The routine, which I call R, starts with one of those values and randomly returns one or the other. That is, 50% of the time it returns 1 and 50% of the time it returns 0. We have no knowledge whatsoever of how R does what it does. When you see “R,” think “random.” This is called a “fair flip.” It is not weighted to slightly prefer one result over the other. Whether we can produce a truly random result on a classical computer is another question. Let’s assume our app is fair. If I apply R to 1, half the time I expect 1 and another half 0. The same is true if I apply R to 0. I’ll call these applications R(1) and R(0), respectively. If I look at the result of R(1) or R(0), there is no way to tell if I started with 1 or 0. This is just like a  secret coin flip where I can’t tell whether I began with heads or tails just by looking at how the coin has landed. By “secret coin flip,” I mean that someone else has flipped it and I can see the result, but I have no knowledge of the mechanics of the flip itself or the starting state of the coin.  If R(1) and R(0) are randomly 1 and 0, what happens when I apply R twice? I write this as R(R(1)) and R(R(0)). It’s the same answer: random result with an equal split. The same thing happens no matter how many times we apply R. The result is random, and we can’t reverse things to learn the initial value.  Now for the quantum version, Instead of R, I use H. It too returns 0 or 1 with equal chance, but it has two interesting properties. It is reversible. Though it produces a random 1 or 0 starting from either of them, we can always go back and see the value with which we began. It is its own reverse (or inverse) operation. Applying it two times in a row is the same as having done nothing at all.  There is a catch, though. You are not allowed to look at the result of what H does if you want to reverse its effect. If you apply H to 0 or 1, peek at the result, and apply H again to that, it is the same as if you had used R. If you observe what is going on in the quantum case at the wrong time, you are right back at strictly classical behavior.  To summarize using the coin language: if you flip a quantum coin and then don’t look at it, flipping it again will yield heads or tails with which you started. If you do look, you get classical randomness. A second area where quantum is different is in how we can work with simultaneous values. Your phone or laptop uses bytes as individual units of memory or storage. That’s where we get phrases like “megabyte,” which means one million bytes of information. A byte is further broken down into eight bits, which we’ve seen before. Each bit can be a 0 or 1. Doing the math, each byte can represent 2^8 = 256 different numbers composed of eight 0s or 1s, but it can only hold one value at a time. Eight qubits can represent all 256 values at the same time This is through superposition, but also through entanglement, the way we can tightly tie together the behavior of two or more qubits. This is what gives us the (literally) exponential growth in the amount of working memory. How quantum computing can help artificial intelligence Artificial intelligence and one of its subsets, machine learning, are extremely broad collections of data-driven techniques and models. They are used to help find patterns in information, learn from the information, and automatically perform more “intelligently.” They also give humans help and insight that might have been difficult to get otherwise. Here is a way to start thinking about how quantum computing might be applicable to large, complicated, computation-intensive systems of processes such as those found in AI and elsewhere. These three cases are in some sense the “small, medium, and large” ways quantum computing might complement classical techniques: There is a single mathematical computation somewhere in the middle of a software component that might be sped up via a quantum algorithm. There is a well-described component of a classical process that could be replaced with a quantum version. There is a way to avoid the use of some classical components entirely in the traditional method because of quantum, or the entire classical algorithm can be replaced by a much faster or more effective quantum alternative. As I write this, quantum computers are not “big data” machines. This means you cannot take millions of records of information and provide them as input to a quantum calculation. Instead, quantum may be able to help where the number of inputs is modest but the computations “blow up” as you start examining relationships or dependencies in the data.  In the future, however, quantum computers may be able to input, output, and process much more data. Even if it is just theoretical now, it makes sense to ask if there are quantum algorithms that can be useful in AI someday. To summarize, we explored how quantum computing works and different applications of artificial intelligence in quantum computing. Get this quantum computing book Dancing with Qubits by Robert Sutor today where he has explored the inner workings of quantum computing. The book entails some sophisticated mathematical exposition and is therefore best suited for those with a healthy interest in mathematics, physics, engineering, and computer science. Intel introduces cryogenic control chip, ‘Horse Ridge’ for commercially viable quantum computing Microsoft announces Azure Quantum, an open cloud ecosystem to learn and build scalable quantum solutions Amazon re:Invent 2019 Day One: AWS launches Braket, its new quantum service and releases
Read more
  • 0
  • 0
  • 36680

article-image-neo4j-most-popular-graph-database
Amey Varangaonkar
02 Aug 2018
7 min read
Save for later

Why Neo4j is the most popular graph database

Amey Varangaonkar
02 Aug 2018
7 min read
Neo4j is an open source, distributed data store used to model graph problems. It departs from the traditional nomenclature of database technologies, in which entities are stored in schema-less, entity-like structures called nodes, which are connected to other nodes via relationships or edges. In this article, we are going to discuss the different features and use-cases of Neo4j. This article is an excerpt taken from the book 'Seven NoSQL Databases in a Week' written by Aaron Ploetz et al. Neo4j's best features Aside from its support of the property graph model, Neo4j has several other features that make it a desirable data store. Here, we will examine some of those features and discuss how they can be utilized in a successful Neo4j cluster. Clustering Enterprise Neo4j offers horizontal scaling through two types of clustering. The first is the typical high-availability clustering, in which several slave servers process data overseen by an elected master. In the event that one of the instances should fail, a new master is chosen. The second type of clustering is known as causal clustering. This option provides additional features, such as disposable read replicas and built-in load balancing, that help abstract the distributed nature of the clustered database from the developer. It also supports causal consistency, which aims to support Atomicity Consistency Isolation and Durability (ACID) compliant consistency in use cases where eventual consistency becomes problematic. Essentially, causal consistency is delivered with a distributed transaction algorithm that ensures that a user will be able to immediately read their own write, regardless of which instance handles the request. Neo4j Browser Neo4j ships with Neo4j Browser, a web-based application that can be used for database management, operations, and the execution of Cypher queries. In addition to, monitoring the instance on which it runs, Neo4j Browser also comes with a few built-in learning tools designed to help new users acclimate themselves to Neo4j and graph databases. Neo4j Browser is a huge step up from the command-line tools that dominate the NoSQL landscape. Cache sharding In most clustered Neo4j configurations, a single instance contains a complete copy of the data. At the moment, true sharding is not available, but Neo4j does have a feature known as cache sharding. This feature involves directing queries to instances that only have certain parts of the cache preloaded, so that read requests for extremely large data sets can be adequately served. Help for beginners One of the things that Neo4j does better than most NoSQL data stores is the amount of documentation and tutorials that it has made available for new users. The Neo4j website provides a few links to get started with in-person or online training, as well as meetups and conferences to become acclimated to the community. The Neo4j documentation is very well-done and kept up to date, complete with well-written manuals on development, operations, and data modeling. The blogs and videos by the Neo4j, Inc. engineers are also quite helpful in getting beginners started on the right path. Additionally, when first connecting to your instance/cluster with Neo4j Browser, the first thing that is shown is a list of links directed at beginners. These links direct the user to information about the Neo4j product, graph modeling and use cases, and interactive examples. In fact, executing the play movies command brings up a tutorial that loads a database of movies. This database consists of various nodes and edges that are designed to illustrate the relationships between actors and their roles in various films. Neo4j's versatility demonstrated in its wide use cases Because of Neo4j's focus on node/edge traversal, it is a good fit for use cases requiring analysis and examination of relationships. The property graph model helps to define those relationships in meaningful ways, enabling the user to make informed decisions. Bearing that in mind, there are several use cases for Neo4j (and other graph databases) that seem to fit naturally. Social networks Social networks seem to be a natural fit for graph databases. Individuals have friends, attend events, check in to geographical locations, create posts, and send messages. All of these different aspects can be tracked and managed with a graph database such as Neo4j. Who can see a certain person's posts? Friends? Friends of friends? Who will be attending a certain event? How is a person connected to others attending the same event? In small numbers, these problems could be solved with a number of data stores. But what about an event with several thousand people attending, where each person has a network of 500 friends? Neo4j can help to solve a multitude of problems in this domain, and appropriately scale to meet increasing levels of operational complexity. Matchmaking Like social networks, Neo4j is also a good fit for solving problems presented by matchmaking or dating sites. In this way, a person's interests, goals, and other properties can be traversed and matched to profiles that share certain levels of equality. Additionally, the underlying model can also be applied to prevent certain matches or block specific contacts, which can be useful for this type of application. Network management Working with an enterprise-grade network can be quite complicated. Devices are typically broken up into different domains, sometimes have physical and logical layers, and tend to share a delicate relationship of dependencies with each other. In addition, networks might be very dynamic because of hardware failure/replacement, organization, and personnel changes. The property graph model can be applied to adequately work with the complexity of such networks. In a use case study with Enterprise Management Associates (EMA), this type of problem was reported as an excellent format for capturing and modeling the inter dependencies that can help to diagnose failures. For instance, if a particular device needs to be shut down for maintenance, you would need to be aware of other devices and domains that are dependent on it, in a multitude of directions. Neo4j allows you to capture that easily and naturally without having to define a whole mess of linear relationships between each device. The path of relationships can then be easily traversed at query time to provide the necessary results. Analytics Many scalable data store technologies are not particularly suitable for business analysis or online analytical processing (OLAP) uses. When working with large amounts of data, coalescing desired data can be tricky with relational database management systems (RDBMS). Some enterprises will even duplicate their RDBMS into a separate system for OLAP so as not to interfere with their online transaction processing (OLTP) workloads. Neo4j can scale to present meaningful data about relationships between different enterprise-marketing entities, which is crucial for businesses. Recommendation engines Many brick-and-mortar and online retailers collect data about their customers' shopping habits. However, many of them fail to properly utilize this data to their advantage. Graph databases, such as Neo4j, can help assemble the bigger picture of customer habits for searching and purchasing, and even take trends in geographic areas into consideration. For example, purchasing data may contain patterns indicating that certain customers tend to buy certain beverages on Friday evenings. Based on the relationships of other customers to products in that area, the engine could also suggest things such as cups, mugs, or glassware. Is the customer also a male in his thirties from a sports-obsessed area? Perhaps suggesting a mug supporting the local football team may spark an additional sale. An engine backed by Neo4j may be able to help a retailer uncover these small troves of insight. To summarize, we saw Neo4j is widely used across all enterprises and businesses, primarily due to its speed, efficiency and accuracy. Check out the book Seven NoSQL Databases in a Week to learn more about Neo4j and the other popularly used NoSQL databases such as Redis, HBase, MongoDB, and more. Read more Top 5 programming languages for crunching Big Data effectively Top 5 NoSQL Databases Is Apache Spark today’s Hadoop?
Read more
  • 0
  • 0
  • 36566
Modal Close icon
Modal Close icon