Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

How-To Tutorials - Data

1208 Articles
article-image-how-to-handle-categorical-data-for-machine-learning-algorithms
Packt Editorial Staff
20 Sep 2019
9 min read
Save for later

How to handle categorical data for machine learning algorithms

Packt Editorial Staff
20 Sep 2019
9 min read
The quality of data and the amount of useful information are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical that we make sure to encode categorical variables correctly, before we feed data into a machine learning algorithm. In this article, with simple yet effective examples we will explain how to deal with categorical data in computing machine learning algorithms and how we to map ordinal and nominal feature values to integer representations. The article is an excerpt from the book Python Machine Learning - Third Edition by Sebastian Raschka and Vahid Mirjalili. This book is a comprehensive guide to machine learning and deep learning with Python. It acts as both a clear step-by-step tutorial, and a reference you’ll keep coming back to as you build your machine learning systems.  It is not uncommon that real-world datasets contain one or more categorical feature columns. When we are talking about categorical data, we have to further distinguish between nominal and ordinal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal features don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue. Categorical data encoding with pandas Before we explore different techniques to handle such categorical data, let's create a new DataFrame to illustrate the problem: >>> import pandas as pd >>> df = pd.DataFrame([ ...            ['green', 'M', 10.1, 'class1'], ...            ['red', 'L', 13.5, 'class2'], ...            ['blue', 'XL', 15.3, 'class1']]) >>> df.columns = ['color', 'size', 'price', 'classlabel'] >>> df color  size price  classlabel 0   green     M 10.1     class1 1     red   L 13.5      class2 2    blue   XL 15.3      class1 As we can see in the preceding output, the newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical feature (price) column. The class labels (assuming that we created a dataset for a supervised learning task) are stored in the last column. Mapping ordinal features To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size feature, so we have to define the mapping manually. In the following simple example, let's assume that we know the numerical difference between features, for example, XL = L + 1 = M + 2: >>> size_mapping = { ...                 'XL': 3, ...                 'L': 2, ...                 'M': 1} >>> df['size'] = df['size'].map(size_mapping) >>> df color  size price  classlabel 0   green     1 10.1     class1 1     red   2 13.5      class2 2    blue     3 15.3     class1 If we want to transform the integer values back to the original string representation at a later stage, we can simply define a reverse-mapping dictionary inv_size_mapping = {v: k for k, v in size_mapping.items()} that can then be used via the pandas map method on the transformed feature column, similar to the size_mapping dictionary that we used previously. We can use it as follows: >>> inv_size_mapping = {v: k for k, v in size_mapping.items()} >>> df['size'].map(inv_size_mapping) 0   M 1   L 2   XL Name: size, dtype: object Encoding class labels Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches. To encode the class labels, we can use an approach similar to the mapping of ordinal features discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we assign to a particular string label. Thus, we can simply enumerate the class labels, starting at 0: >>> import numpy as np >>> class_mapping = {label:idx for idx,label in ...                  enumerate(np.unique(df['classlabel']))} >>> class_mapping {'class1': 0, 'class2': 1} Next, we can use the mapping dictionary to transform the class labels into integers: >>> df['classlabel'] = df['classlabel'].map(class_mapping) >>> df     color  size price  classlabel 0   green     1 10.1         0 1     red   2 13.5           1 2    blue     3 15.3           0 We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class labels back to the original string representation: >>> inv_class_mapping = {v: k for k, v in class_mapping.items()} >>> df['classlabel'] = df['classlabel'].map(inv_class_mapping) >>> df     color  size price  classlabel 0   green     1 10.1     class1 1     red   2 13.5      class2 2    blue     3 15.3     class1 Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn to achieve this: >>> from sklearn.preprocessing import LabelEncoder >>> class_le = LabelEncoder() >>> y = class_le.fit_transform(df['classlabel'].values) >>> y array([0, 1, 0]) Note that the fit_transform method is just a shortcut for calling fit and transform separately, and we can use the inverse_transform method to transform the integer class labels back into their original string representation: >>> class_le.inverse_transform(y) array(['class1', 'class2', 'class1'], dtype=object) Performing a technique ‘one-hot encoding’ on nominal features In the Mapping ordinal features section, we used a simple dictionary-mapping approach to convert the ordinal size feature into integers. Since scikit-learn's estimators for classification treat class labels as categorical data that does not imply any order (nominal), we used the convenient LabelEncoder to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color column of our dataset, as follows: >>> X = df[['color', 'size', 'price']].values >>> color_le = LabelEncoder() >>> X[:, 0] = color_le.fit_transform(X[:, 0]) >>> X array([[1, 1, 10.1],        [2, 2, 13.5],        [0, 3, 15.3]], dtype=object) After executing the preceding code, the first column of the NumPy array X now holds the new color values, which are encoded as follows: blue = 0 green = 1 red = 2 If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal. A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of an example; for example, a blue example can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the OneHotEncoder that is implemented in scikit-learn's preprocessing module: >>> from sklearn.preprocessing import OneHotEncoder >>> X = df[['color', 'size', 'price']].values >>> color_ohe = OneHotEncoder() >>> color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()  array([[0., 1., 0.],            [0., 0., 1.],            [1., 0., 0.]]) Note that we applied the OneHotEncoder to a single column (X[:, 0].reshape(-1, 1))) only, to avoid modifying the other two columns in the array as well. If we want to selectively transform columns in a multi-feature array, we can use the ColumnTransformer that accepts a list of (name, transformer, column(s)) tuples as follows: >>> from sklearn.compose import ColumnTransformer >>> X = df[['color', 'size', 'price']].values >>> c_transf = ColumnTransformer([ ...     ('onehot', OneHotEncoder(), [0]), ...     ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X) .astype(float)     array([[0.0, 1.0, 0.0, 1, 10.1],            [0.0, 0.0, 1.0, 2, 13.5],            [1.0, 0.0, 0.0, 3, 15.3]]) In the preceding code example, we specified that we only want to modify the first column and leave the other two columns untouched via the 'passthrough' argument. An even more convenient way to create those dummy features via one-hot encoding is to use the get_dummies method implemented in pandas. Applied to a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged: >>> pd.get_dummies(df[['price', 'color', 'size']])     price  size color_blue  color_green color_red 0    10.1     1   0 1          0 1    13.5     2   0 0          1 2    15.3     3   1 0          0 When we are using one-hot encoding datasets, we have to keep in mind that it introduces multicollinearity, which can be an issue for certain methods (for instance, methods that require matrix inversion). If features are highly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any important information by removing a feature column, though; for example, if we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue. If we use the get_dummies function, we can drop the first column by passing a True argument to the drop_first parameter, as shown in the following code example: >>> pd.get_dummies(df[['price', 'color', 'size']], ...                drop_first=True)     price  size color_green  color_red 0    10.1     1     1 0 1    13.5     2     0 1 2    15.3     3     0 0 In order to drop a redundant column via the OneHotEncoder , we need to set drop='first' and set categories='auto' as follows: >>> color_ohe = OneHotEncoder(categories='auto', drop='first') >>> c_transf = ColumnTransformer([  ...            ('onehot', color_ohe, [0]), ...            ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X).astype(float) array([[  1. , 0. ,  1. , 10.1],        [  0. ,  1. , 2. ,  13.5],        [  0. ,  0. , 3. ,  15.3]]) In this article, we have gone through some of the methods to deal with categorical data in datasets. We distinguished between nominal and ordinal features, and with examples we explained how they can be handled. To harness the power of the latest Python open source libraries in machine learning check out this book Python Machine Learning - Third Edition, written by Sebastian Raschka and Vahid Mirjalili. Other interesting read in data! The best business intelligence tools 2019: when to use them and how much they cost Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data & Society report
Read more
  • 0
  • 0
  • 17367

article-image-github-acquires-semmle-to-secure-open-source-supply-chain-attains-cve-numbering-authority-status
Savia Lobo
19 Sep 2019
5 min read
Save for later

GitHub acquires Semmle to secure open-source supply chain; attains CVE Numbering Authority status

Savia Lobo
19 Sep 2019
5 min read
Yesterday, GitHub announced that it has acquired Semmle, a code analysis platform provider and also that it is now a Common Vulnerabilities and Exposures (CVE) Numbering Authority. https://twitter.com/github/status/1174371016497405953 The Semmle acquisition is a part of the plan to securing the open-source supply chain, Nat Friedman explains in his blog post. Semmle provides a code analysis engine, named QL, which allows developers to write queries that identify code patterns in large codebases and search for vulnerabilities and their variants. Security researchers use Semmle to quickly find vulnerabilities in code with simple declarative queries. “Semmle is trusted by security teams at Uber, NASA, Microsoft, Google, and has helped find thousands of vulnerabilities in some of the largest codebases in the world, as well as over 100 CVEs in open source projects to date,” Friedman writes. Also Read: GitHub now supports two-factor authentication with security keys using the WebAuthn API Semmle originally spun out of research at Oxford in 2006 announced a $21 million Series B investment led by Accel Partners, last year. “In total, the company raised $31 million before this acquisition,” Techcrunch reports. Shanku Niyogi, Senior Vice President of Product at GitHub, in his blog post writes, “An important measure of the success of Semmle’s approach is the number of vulnerabilities that have been identified and disclosed through their technology. Today, over 100 CVEs in open source projects have been found using Semmle, including high-profile projects like Apache Struts, Apple’s XNU, the Linux Kernel, Memcached, U-Boot, and VLC. No other code analysis tool has a similar success rate.” GitHub also announced that it has been approved as a CVE Numbering Authority for open source projects. Now, GitHub will be able to issue CVEs for security advisories opened on GitHub, allowing for even broader awareness across the industry. With Semmle integration, every CVE-ID can be associated with a Semmle QL query, which can then be shared and tracked by the broader developer community. The CVE approval will make it easier for project maintainers to report security flaws directly from their repositories. Also, GitHub can assign CVE identifiers directly and post them to the CVE List and the National Vulnerability Database (NVD). Earlier this year, GitHub acquired Dependabot, to provide automatic security fixes natively within GitHub. With automatic security fixes, developers no longer need to manually patch their dependencies. When a vulnerability is found in a dependency, GitHub will automatically issue a pull request on downstream repositories with the information needed to accept the patch. In August, GitHub was in the limelight for being a part of the Capital One data breach that affected 106 million users in the US and Canada. The law firm Tycko & Zavareei LLP filed a lawsuit in California’s federal district court on behalf of their plaintiffs Seth Zielicke and Aimee Aballo. Also Read: GitHub acquires Spectrum, a community-centric conversational platform Both plaintiffs claimed Capital One and GitHub were unable to protect user’s personal data. The complaint highlighted that Paige A. Thompson, the alleged hacker stole the data in March, posted about the theft on GitHub in April. According to the lawsuit, “As a result of GitHub’s failure to monitor, remove, or otherwise recognize and act upon obviously-hacked data that was displayed, disclosed, and used on or by GitHub and its website, the Personal Information sat on GitHub.com for nearly three months.” The Semmle acquisition may be GitHub’s move to improve security for users in the future. It would be interesting to know how GitHub will mold security for users with additional CVE approval. A user on Reddit writes, “I took part in a tutorial session Semmle held at a university CS society event, where we were shown how to use their system to write semantic analysis passes to look for things like use-after-free and null pointer dereferences. It was only an hour and a bit long, but I found the query language powerful & intuitive and the platform pretty effective. At the time, you could set up your codebase to run Semmle passes on pre-commit hooks or CI deployments etc. and get back some pretty smart reporting if you had introduced a bug.” The user further writes, “The session focused on Java, but a few other languages were supported as first-class, iirc. It felt kinda like writing an SQL query, but over AST rather than tuples in a table, and using modal logic to choose the selections. It took a little while to first get over the 'wut' phase (like 'how do I even express this'), but I imagine that a skilled team, once familiar with the system, could get a lot of value out of Semmle's QL/semantic analysis, especially for large/enterprise-scale codebases.” https://twitter.com/kurtseifried/status/1174395660960796672 https://twitter.com/timneutkens/status/1174598659310313472 To know more about this announcement in detail, read GitHub’s official blog post. Other news in Data Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine GQL (Graph Query Language) joins SQL as a Global Standards Project and is now the international standard declarative query language for graphs
Read more
  • 0
  • 0
  • 2755

article-image-the-best-business-intelligence-tools-2019-when-to-use-them-and-how-much-they-cost
Richard Gall
19 Sep 2019
11 min read
Save for later

The best business intelligence tools 2019: when to use them and how much they cost

Richard Gall
19 Sep 2019
11 min read
Business intelligence is big business. Salesforce’s purchase of Tableau earlier this year (for a cool $16 billion) proves the value of a powerful data analytics platform, and demonstrates how the business intelligence space is reshaping expectations and demands in the established CRM and ERP marketplace. To a certain extent, the amount Salesforce paid for Tableau highlights that when it comes to business intelligence, tooling is paramount. Without a tool that fits the needs and skill levels of those that need BI and use analytics, discussions around architecture and strategy are practically moot. So, what are the best business intelligence tools? And more importantly, how do they differ from one another? Which one is right for you? Read next: 4 important business intelligence considerations for the rest of 2019 The best business intelligence tools 2019 Tableau Let’s start with the obvious one: Tableau. It’s one of the most popular business intelligence tools on the planet, and with good reason; it makes data visualization and compelling data storytelling surprisingly easy. With a drag and drop interface, Tableau requires no coding knowledge from users. It also allows users to ask ‘what if’ scenarios to model variable changes, which means you can get some fairly sophisticated insights with just a few simple interactions. But while Tableau is undoubtedly designed to be simple, it doesn’t sacrifice complexity. Unlike other business intelligence tools, Tableau allows users to include an unlimited number of datapoints in their analytics projects. When should you use Tableau and how much does it cost? The core Tableau product is aimed at data scientists and data analysts who want to be able to build end-to-end analytics pipelines. You can trial the product for free for 14 days, but this will then cost you $70/month. This is perhaps one of the clearest use cases - if you’re interested and passionate about data, Tableau practically feels like a toy. However, for those that want to employ Tableau across their organization, the product offers a neat pricing tier. Tableau Creator is built for individual power users - like those described above, Tableau Explorer for self-service analytics, and Tableau Viewer for those that need access to Tableau for limited access to dashboard and analytics. Tableau eBooks and videos Mastering Tableau 2019.1 - Second Edition Tableau 2019.x Cookbook Getting Started with Tableau 2019.2 - Second Edition Tableau in 7 Steps [Video] PowerBI PowerBI is Microsoft’s business intelligence platform. Compared to Tableau it is designed more for reporting and dashboards rather than data exploration and storytelling. If you use a wide range of Microsoft products, PowerBI is particularly powerful. It can become a centralized space for business reporting and insights. Like Tableau, it’s also relatively easy to use. With support from Microsoft Cortana - the company’s digital assistant - it’s possible to perform natural language queries. When should you use PowerBI and how much does it cost? PowerBI is an impressive business intelligence product. But to get the most value, you need to be committed to Microsoft. This isn’t to say you shouldn’t be - the company has been on form in recent years and appears to really understand what modern businesses and users need. On a similar note, a good reason to use PowerBI is for unified and aligned business insights. If Tableau is more suited to personal exploration, or project-based storytelling, PowerBI is an effective option for organizations that want more clarity and shared visibility on key performance metrics. This is reflected in the price. For personal users the desktop version of PowerBI is free, while a pro license is $9.99 a month. A premium plan which includes cloud resources (storage and compute) starts at $4,995. This is the option for larger organizations that are fully committed to the Microsoft suite and has a clear vision of how it wants to coordinate analytics and reporting. PowerBI eBooks and videos Learn Power BI Microsoft Power BI Quick Start Guide Learning Microsoft Power BI [Video] Qlik Sense and QlikView Okay, so here we’re going to include two business intelligence products together: Qlik Sense and QlikView. Obviously, they’re both part of the same family - they’re built by business intelligence company Qlik. More importantly, they’re both quite different products. What’s the difference between Qlik Sense and QlikView? As we’ve said, Qlik Sense and QlikView are two different products. QlikView is the older and more established tool. It’s what’s usually described as a ‘guided analytics’ platform, which means dashboards and analytics applications can be built for end users. The tool gives freedom to engineers and data scientists to build what they want but doesn’t allow end users to ‘explore’ data in any more detail than what is provided. QlikView is quite a sophisticated platform and is widely regarded as being more complex to use than Tableau or PowerBI. While PowerBI or Tableau can be used by anyone with an intermediate level of data literacy and a willingness to learn, QlikView will always be the preserve of data scientists and analysts. This doesn’t make it a poor choice. If you know how to use it properly, QlikView can provide you with more in-depth analysis than any other business intelligence platforms, helping users to see patterns and relationships across different data sets. If you’re working with big data, for example, and you have a team of data scientists and data engineers, QlikView is a good option. Qlik Sense, meanwhile, could be seen as Qlik’s attempt to compete with the likes of Tableau and PowerBI. It’s a self-service BI tool, which allows end users to create their own data visualisations and explore data through a process of ‘data discovery’. When should you use QlikView and how much does it cost? QlikView should be used when you need to build a cohesive reporting and business intelligence solution. It’s perfect for when you need a space to manage KPIs and metrics across different teams. Although a free edition is available for personal use, Qlik doesn’t publish prices for enterprise users. You’ll need to get in touch with the company’s sales team to purchase. QlikView eBooks and videos QlikView: Advanced Data Visualization QlikView Dashboard Development [Video] When should you use Qlik Sense and how much does it cost? Qlik Sense should be used when you have an organization full of people curious and prepared to get their hands on their data. If you already have an established model of reporting performance, Qlik Sense is a useful extra that can give employees more autonomy over how data can be used. When it comes to pricing, Qlik Sense is one of the more complicated business intelligence options. Like QlikView, there’s a free option for personal use, and again like QlikView, there’s no public price available - so you’ll have to connect with Qlik directly. To add an additional layer of complexity, there’s also a product called ‘Cloud Basic’ - this is free and can be shared between up to 5 users. It’s essentially a SaaS version of the Qlik Sense product. If you need to add more than 5 users, it costs $15 per user/month. Qlik Sense eBooks and videos Mastering Qlik Sense [Video] Qlik Sense Cookbook - Second Edition Data Storytelling with Qlik Sense [Video] Hands-On Business Intelligence with Qlik Sense Read next: Top 5 free Business Intelligence tools Splunk Splunk isn’t just a business intelligence tool. To a certain extent, it’s related to application monitoring and logging tools such as Datadog, New Relic, and AppDynamics. It’s built for big data and real-time analytics, which means that it’s well-suited to offering insights on business processes and product performance. The copy on the Splunk website talks about “real-time visibility across the enterprise” and describes Splunk as a “data-to-everything” platform. The product, then, is pitching itself as something that can embed itself inside existing systems, and bring insight and intelligence to places and spaces where it’s particularly valuable. This is in contrast to PowerBI and Tableau, which are designed for exploration and accessibility. This isn’t to say that Splunk doesn’t enable sophisticated data exploration, but rather that it is geared towards monitoring systems and processes, understanding change. It’s a tool built for companies that need full transparency - or, in other words, dynamic operational intelligence. When should you use Splunk and how much does it cost? Splunk is a tool that should be used if you’re dealing with dynamic and real-time data. If you want to be able to model and explore wide-ranging existing sets of data Tableau or PowerBI are probably a better bet. But if you need to be able to make decisions in an active and ongoing scenario, Splunk is a tool that can provide substantial support. The reason that Splunk is included as a part of this list of business intelligence tools is because real-time visibility and insight is vital for businesses. Typically understanding application performance or process efficiency might have been embedded within particular departments, such as a centralized IT function. Now, with businesses dependent upon operational excellence, and security and reliability in the digital arena becoming business critical, Splunk is a tool that deserves its status inside (and across) organizations. Splunk’s pricing is complicated. Prices are generally dependent on how much data you want to index - or, in other words, how much you’re giving Splunk to deal with. But to add to that, Splunk also have a perpetual license ( a one time payment) and an annual term license, which needs to be renewed. So, you can index 1GB/day for $4,500 on a perpetual license, or $1,800 on an annual license. If you want to learn more about Splunk’s pricing option, this post is very useful. Splunk eBooks and videos Splunk 7 Essentials [E-Learning] Splunk 7.x Quick Start Guide Splunk Operational Intelligence Cookbook - Third Edition IBM Cognos IBM Cognos is IBM’s flagship business intelligence tool. It’s probably best viewed as existing somewhere between PowerBI and Tableau. It’s designed for reporting dashboards that allow monitoring and analytics, but it is nevertheless also intended for self-service. To that end, you might say it’s more limited in capabilities than PowerBI, but it’s nevertheless more accessible for non-technical end users to explore data. It’s also relatively easy to integrate with other systems and data sources. So, if your data is stored in Microsoft or Oracle cloud services and databases, it’s relatively straightforward to get started with IBM Cognos. However, it’s worth noting that despite the accesibility of IBM’s product, it still needs centralized control and implementation. It doesn’t offer the level of ease that you get with Tableau, for example. When should you use IBM Cognos and how much does it cost? Cognos is perhaps the go-to option if PowerBI and Tableau don’t quite work for you. Perhaps you like the idea of Tableau but need more centralization. Or maybe you need a strong and cohesive reporting system but don’t feel prepared to buy into Microsoft. This isn’t to make IBM Cognos sound like the outsider - in fact, from an efficiency perspective it’s possibly the best way to ensure to ensure some degree of portability between data sources and to manage the age-old problem of data silos. If you’re not quite sure what business intelligence tool is right for you, it’s well worth taking advantage of Cognos’s free trial - you get unlimited access for a month. If you like what you get, you then have a choice between a premium version - which costs $70 per user/month, and the enterprise plan, the price of which isn’t publicly available. IBM Cognos eBooks and videos IBM Cognos Framework Manager [Video] IBM Cognos Report Studio [Video] IBM Cognos Connection and Workspace Advanced [Video] Conclusion: To choose the best business intelligence solution for your organization, you need to understand your needs and goals Business intelligence is a crowded market. The products listed here are the tip of the iceberg when it comes to analytics, monitoring, and data visualization. This is good and bad - it means there are plenty of options and opportunities, but it also means that sorting through the options to find the right one might take up some of your time. That’s okay though - if possible, try to take advantage of free trial periods. And if you’re in a rush to get work done, use them on active projects. You could even allocate different platforms and tools to different team members and get them to report on what worked well and what didn’t. That way you can have documented insights on how the products might actually be used within the organization. This will help you to better reach a conclusion about the best tool for the job. Business intelligence done well can be extremely valuable - so don’t waste money and don’t waste time on tools that aren’t going to deliver what you need.
Read more
  • 0
  • 0
  • 8580

article-image-gql-graph-query-language-joins-sql-as-a-global-standards-project-and-is-now-the-international-standard-declarative-query-language-for-graphs
Amrata Joshi
19 Sep 2019
6 min read
Save for later

GQL (Graph Query Language) joins SQL as a Global Standards Project and will be the international standard declarative query language for graphs

Amrata Joshi
19 Sep 2019
6 min read
On Tuesday, the team at Neo4j, the graph database management system announced that the international committees behind the development of the SQL standard have voted to initiate GQL (Graph Query Language) as the new database query language. GQL is now going to be the international standard declarative query language for property graphs and it is also a Global Standards Project. GQL is developed and maintained by the same international group that maintains the SQL standard. How did the proposal for GQL pass? Last year in May, the initiative for GQL was first time processed in the GQL Manifesto. This year in June, the national standards bodies across the world from the ISO/IEC’s Joint Technical Committee 1 (responsible for IT standards) started voting on the GQL project proposal.  The ballot closed earlier this week and the proposal was passed wherein ten countries including Germany, Korea, United States, UK, and China voted in favor. And seven countries agreed to put forward their experts to work on this project. Japan was the only country to vote against in the ballot because according to Japan, existing languages already do the job, and SQL/Property Graph Query extensions along with the rest of the SQL standard can do the same job. According to the Neo4j team, the GQL project will initiate development of next-generation technology standards for accessing data. Its charter mandates building on core foundations that are established by SQL and ongoing collaboration in order to ensure SQL and GQL interoperability and compatibility. GQL would reflect rapid growth in the graph database market by increasing adoption of the Cypher language.  Stefan Plantikow, GQL project lead and editor of the planned GQL specification, said, “I believe now is the perfect time for the industry to come together and define the next generation graph query language standard.”  Plantikow further added, “It’s great to receive formal recognition of the need for a standard language. Building upon a decade of experience with property graph querying, GQL will support native graph data types and structures, its own graph schema, a pattern-based approach to data querying, insertion and manipulation, and the ability to create new graphs, and graph views, as well as generate tabular and nested data. Our intent is to respect, evolve, and integrate key concepts from several existing languages including graph extensions to SQL.” Keith Hare, who has served as the chair of the international SQL standards committee for database languages since 2005, charted the progress toward GQL, said, “We have reached a balance of initiating GQL, the database query language of the future whilst preserving the value and ubiquity of SQL.” Hare further added, “Our committee has been heartened to see strong international community participation to usher in the GQL project.  Such support is the mark of an emerging de jure and de facto standard .” The need for a graph-specific query language Researchers and vendors needed a graph-specific query language because of the following limitations: SQL/PGQ language is restricted to read-only queries SQL/PGQ cannot project new graphs The SQL/PGQ language can only access those graphs that are based on taking a graph view over SQL tables. Researchers and vendors needed a language like Cypher that would cover insertion and maintenance of data and not just data querying. But SQL wasn’t the apt model for a graph-centric language that takes graphs as query inputs and outputs a graph as a result. But GQL, on the other hand, builds in openCypher, a project that brings Cypher to Apache Spark and gives users a composable graph query language. SQL and GQL can work together According to most of the companies and national standards bodies that are supporting the GQL initiative, GQL and SQL are not competitors. Instead, these languages can complement each other via interoperation and shared foundations. Alastair Green, Query Languages Standards & Research Lead at Neo4j writes, “A SQL/PGQ query is in fact a SQL sub-query wrapped around a chunk of proto-GQL.” SQL is a language that is built around tables whereas GQL is built around graphs. Users can use GQL to find and project a graph from a graph.  Green further writes, “I think that the SQL standards community has made the right decision here: allow SQL, a language built around tables, to quote GQL when the SQL user wants to find and project a table from a graph, but use GQL when the user wants to find and project a graph from a graph. Which means that we can produce and catalog graphs which are not just views over tables, but discrete complex data objects.” It is still not clear when will the first implementation version of GQL will be out. The official page reads,  “The work of the GQL project starts in earnest at the next meeting of the SQL/GQL standards committee, ISO/IEC JTC 1 SC 32/WG3, in Arusha, Tanzania, later this month. It is impossible at this stage to say when the first implementable version of GQL will become available, but it is highly likely that some reasonably complete draft will have been created by the second half of 2020.” Developer community welcomes the new addition Users are excited to see how GQL will incorporate Cypher, a user commented on HackerNews, “It's been years since I've worked with the product and while I don't miss Neo4j, I do miss the query language. It's a little unclear to me how GQL will incorporate Cypher but I hope the initiative is successful if for no other reason than a selfish one: I'd love Cypher to be around if I ever wind up using a GraphDB again.” Few others mistook GQL to be Facebook’s GraphQL and are sceptical about the name. A comment on HackerNews reads, “Also, the name is of course justified, but it will be a mess to search for due to (Facebook) GraphQL.” A user commented, “I read the entire article and came away mistakenly thinking this was the same thing as GraphQL.” Another user commented, “That's quiet an unfortunate name clash with the existing GraphQL language in a similar domain.” Other interesting news in Data Media manipulation by Deepfakes and cheap fakes refquire both AI and social fixes, finds a Data & Society report Percona announces Percona Distribution for PostgreSQL to support open source databases  Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out
Read more
  • 0
  • 0
  • 6559

article-image-4-important-business-intelligence-considerations-for-the-rest-of-2019
Richard Gall
16 Sep 2019
7 min read
Save for later

4 important business intelligence considerations for the rest of 2019

Richard Gall
16 Sep 2019
7 min read
Business intelligence occupies a strange position, often overshadowed by fields like data science and machine learning. But it remains a critical aspect of modern business - indeed, the less attention the world appears to pay to it, the more it is becoming embedded in modern businesses. Where analytics and dashboards once felt like a shiny and exciting interruption in our professional lives, today it is merely the norm. But with business intelligence almost baked into the day to day routines and activities of many individuals, teams, and organizations, what does this actually mean in practice. For as much as we’d like to think that we’re all data-driven now, the reality is that there’s much we can do to use data more effectively. Research confirms that data-driven initiatives often fail - so with that in mind here’s what’s important when it comes to business intelligence in 2019. Popular business intelligence eBooks and videos Oracle Business Intelligence Enterprise Edition 12c - Second Edition Microsoft Power BI Quick Start Guide Implementing Business Intelligence with SQL Server 2019 [Video] Hands-On Business Intelligence with Qlik Sense Hands-On Dashboard Development with QlikView Getting the balance between self-service business intelligence and centralization Self-service business intelligence is one of the biggest trends to emerge in the last two years. In practice, this means that a diverse range of stakeholders (marketers and product managers for example) have access to analytics tools. They’re no longer purely the preserve of data scientists and analysts. Self-service BI makes a lot of sense in the context of today’s data rich and data-driven environment. The best way to empower team members to actually use data is to remove any bottlenecks (like a centralized data team) and allow them to go directly to the data and tools they need to make decisions. In essence, self-service business intelligence solutions are a step towards the democratization of data. However, while the notion of democratizing data sounds like a noble cause, the reality is a little more complex. There are a number of different issues that make self-service BI a challenging thing to get right. One of the biggest pain points, for example, are the skill gaps of teams using these tools. Although self-service BI should make using data easy for team members, even the most user-friendly dashboards need a level of data literacy to be useful. Read next: What are the limits of self-service BI? Many analytics products are being developed with this problem in mind. But it’s still hard to get around - you don’t, after all, want to sacrifice the richness of data for simplicity and accessibility. Another problem is the messiness of data itself - and this ultimately points to one of the paradoxes of self-service BI. You need strong alignment - centralization even - if you’re to ensure true democratization. The answer to all this isn’t to get tied up in decentralization or centralization. Instead, what’s important is striking a balance between the two. Decentralization needs centralization - there needs to be strong governance and clarity over what data exists, how it’s used, how it’s accessed - someone needs to be accountable for that for decentralized, self-service BI to actually work. Read next: How Qlik Sense is driving self-service Business Intelligence Self-service business intelligence: recommended viewing Power BI Masterclass - Beginners to Advanced [Video] Data storytelling that makes an impact Data storytelling is a phrase that’s used too much without real consideration as to what it means or how it can be done. Indeed, all too often it’s used to refer to stylish graphs and visualizations. And yes, stylish graphs and data visualizations are part of data storytelling, but you can’t just expect some nice graphics to communicate in depth data insights to your colleagues and senior management. To do data storytelling well, you need to establish a clear sense of objectives and goals. By that I’m not referring only to your goals, but also those of the people around you. It goes without saying that data and insight needs context, but what that context should be, exactly, is often the hard part - objectives and aims are perhaps the straightforward way of establishing that context and ensuring your insights are able to establish the scope of a problem and propose a way forward. Data storytelling can only really make an impact if you are able to strike a balance between centralization and self-service. Stakeholders that use self-service need confidence that everything they need is both available and accurate - this can only really be ensured by a centralized team of data scientists, architects, and analysts. Data storytelling: recommend viewing Data Storytelling with Qlik Sense [Video] Data Storytelling with Power BI [Video] The impact of cloud It’s impossible to properly appreciate the extent to which cloud is changing the data landscape. Not only is it easier than ever to store and process data, it’s also easy to do different things with it. This means that it’s now possible to do machine learning, or artificial intelligence projects with relative ease (the word relative being important, of course). For business intelligence, this means there needs to be a clear strategy that joins together every piece of the puzzle, from data collection to analysis. This means there needs to be buy-in and input from stakeholders before a solution is purchased - or built - and then the solution needs to be developed with every individual use case properly understood and supported. Indeed, this requires a combination of business acumen, soft skills, and technical expertise. A large amount of this will rest on the shoulders of an organization’s technical leadership team, but it’s also worth pointing out that those in other departments still have a part to play. If stakeholders are unable to present a clear vision of what their needs and goals are it’s highly likely that the advantages of cloud will pass them by when it comes to business intelligence. Cloud and business intelligence: recommended viewing Going beyond Dashboards with IBM Cognos Analytics [Video] Business intelligence ethics Ethics has become a huge issue for organizations over the last couple of years. With the Cambridge Analytica scandal placing the spotlight on how companies use customer data, and GDPR forcing organizations to take a new approach to (European) user data, it’s undoubtedly the case that ethical considerations have added a new dimension to business intelligence. But what does this actually mean in practice? Ethics manifests itself in numerous ways in business intelligence. Perhaps the most obvious is data collection - do you have the right to use someone’s data in a certain way? Sometimes the law will make it clear. But other times it will require individuals to exercise judgment and be sensitive to the issues that could arise. But there are other ways in which individuals and organizations need to think about ethics. Being data-driven is great, especially if you can approach insight in a way that is actionable and proactive. But at the same time it’s vital that business intelligence isn’t just seen as a replacement for human intelligence. Indeed, this is true not just in an ethical sense, but also in terms of sound strategic thinking. Business intelligence without human insight and judgment is really just the opposite of intelligence. Conclusion: business intelligence needs organizational alignment and buy-in There are many issues that have been slowly emerging in the business intelligence world for the last half a decade. This might make things feel confusing, but in actual fact it underlines the very nature of the challenges organizations, leadership teams, and engineers face when it comes to business intelligence. Essentially, doing business intelligence well requires you - and those around you - to tie all these different elements. It's certainly not straightforward, but with focus and a clarity of thought, it's possible to build a really effective BI program that can fulfil organizational needs well into the future.
Read more
  • 0
  • 0
  • 6407

article-image-how-artificial-intelligence-and-machine-learning-can-help-us-tackle-the-climate-change-emergency
Vincy Davis
16 Sep 2019
14 min read
Save for later

How artificial intelligence and machine learning can help us tackle the climate change emergency

Vincy Davis
16 Sep 2019
14 min read
“I don’t want you to be hopeful. I want you to panic. I want you to feel the fear I feel every day. And then I want you to act on changing the climate”- Greta Thunberg Greta Thunberg is a 16-year-old Swedish schoolgirl, who is famously called as a climate change warrior. She has started an international youth movement against climate change and has been nominated as a candidate for the Nobel Peace Prize 2019 for climate activism. According to a recent report by the Intergovernmental Panel (IPCC), climate change is seen as the top global threat by many countries. The effects of climate change is going to make 1 million species go extinct, warns a UN report. The Earth’s rising temperatures are fueling longer and hotter heat waves, more frequent droughts, heavier rainfall, and more powerful hurricanes. Antarctica is breaking. Indonesia, the world's 4th most populous country, just shifted its capital from Jakarta because it's sinking. Singapore's worried investments are moving away. Last year, Europe experienced an  'extreme year' for unusual weather events. After a couple of months of extremely cold weather, heat and drought plagued spring and summer with temperatures well above average in most of the northern and western areas. The UK Parliament has declared ‘climate change emergency’ after a series of intense protests earlier this month. More than 1,200 people were killed across South Asia due to heavy monsoon rains and intense flooding (in some places it was the worst in nearly 30 years). The CampFire, in November 2018, was the deadliest and most destructive in California’s history, causing the death of at least 85 people and destroying about 14,000 homes. Australia’s most populous state New South Wales suffered from an intense drought in 2018. According to a report released by the UN last year, there are “Only 11 Years Left to Prevent Irreversible Damage from Climate Change”.  Addressing climate change: How ARTIFICIAL INTELLIGENCE (AI) can help? As seen above, environmental impacts due to climate changes are clear, the list is vast and depressing. It is important to address climate change issues as they play a key role in the workings of a natural ecosystem like change in the nature of global rainfall, diminishing ice-sheets, and other factors on which the human economy and the civilization depends on. With the help of Artificial Intelligence (AI), we can increase our probability of becoming efficient, or at least slow down the damage caused by climate change. In the recently held ICLR 2019 (International Conference on Learning Representations), Emily Shuckburgh, a Climate scientist and deputy head of the Polar Oceans team at the British Antarctic Survey highlighted the need of actionable information on climate risk. It elaborated on how we can monitor, treat and find a solution to the climate changes using machine learning. Also mentioned is, how AI can synthesize and interpolate different datasets within a framework that will allow easy interrogation by users and near-real time ingestion of new data. According to MIT tech review on climate changes, there are three approaches to address climate change: mitigation, navigation and suffering. Technologies generally concentrate on mitigation, but it’s high time that we give more focus to the other two approaches. In a catastrophically altered world, it would be necessary to concentrate on adaptation and suffering. This review states that, the mitigation steps have had almost no help in preserving fossil fuels. Thus it is important for us to learn to adapt to these changes. Building predictive models by relying on masses of data will also help in providing a better idea of how bad the effect of a disaster can be and help us to visualize the suffering. By implementing Artificial Intelligence in these approaches, it will help not only to reduce the causes but also to adapt to these climate changes. Using AI, we can predict the accurate status of climate change, which will help create better futuristic climate models. These predictions can be used to identify our biggest vulnerabilities and risk zones. This will help us to respond in a better way to the impact of climate change such as hurricanes, rising sea levels, and higher temperatures. Let’s see how Artificial Intelligence is being used in all the three approaches - Mitigation: Reducing the severity of climate change Looking at the extreme climatic changes, many researchers have started exploring how AI can step-in to reduce the effects of climate change. These include ways to reduce greenhouse gas emissions or enhance the removal of these gases from the atmosphere. In view of consuming less energy, there has been an active increase in technologies to use energy smartly. One such startup is the ‘Verv’. It is an intelligent IoT hub which uses patented AI technology to give users the authority to take control of their energy usage. This home energy system provides you with information about your home appliances and other electricity data directly from the mains, which helps to reduce your electricity bills and lower your carbon footprint. ‘Igloo Energy’ is another system which helps customers use energy efficiently and save money. It uses smart meters to analyse behavioural, property occupancy and surrounding environmental data inputs to lower the energy consumption of users. ‘Nnergix’ is a weather analytics startup focused in the renewable energy industry. It collects weather and energy data from multiple sources from the industry in order to feed machine learning based algorithms to run several analytic solutions with the main goal to help any system become more efficient during operations and reduce costs. Recently, Google announced that by using Artificial Intelligence, it’s wind energy has boosted up to 20 percent. A neural network is trained on the widely available weather forecasts and historical turbine data. The DeepMind system is configured to predict the wind power output 36 hours ahead of actual generation. The model then recommends to make hourly delivery commitments to the power grid a full day in advance, based on the predictions. Large industrial systems are the cause of 54% of global energy consumption. This high-level of energy consumption is the primary contributor to greenhouse gas emissions. In 2016, Google’s ‘DeepMind’ was able to reduce the energy required to cool Google Data Centers by 30%. Initially, the team made a general purpose learning algorithm which was developed into a full-fledged AI system with features including continuous monitoring and human override. Just last year, Google has put an AI system in charge of keeping its data centers cool. Every five minutes, AI pulls a snapshot of the data center cooling system from thousands of sensors. This data is fed into deep neural networks, which predicts how different choices will affect future energy consumption. The neural networks are trained to maintain the future PUE (Power Usage Effectiveness) and to predict the future temperature and pressure of the data centre over the next hour, to ensure that any tweaks did not take the data center beyond its operating limits. Google has found that the machine learning systems were able to consistently achieve a 30 percent reduction in the amount of energy used for cooling, the equivalent of a 15 percent reduction in overall PUE. As seen, there are many companies trying to reduce the severity of climate change. Navigation: Adapting to current conditions Though there have been brave initiatives to reduce the causes of climate change, they have failed to show any major results. This could be due to the increasing demand for energy resources, which is expected to grow immensely globally. It is now necessary to concentrate more on adapting to climate change, as we are in a state where it is almost impossible to undo its effects. Thus, it is better to learn and navigate through this climate change. A startup in Berlin, called ‘GreenAdapt’ has created a software using AI, which can tackle local impacts induced both by gradual changes and changes of extreme weather events such as storms. It identifies  effects of climatic changes and proposes adequate adaptation measures. Another startup called ‘Zuli’ has a smartplug that reduces energy use. It contains sensors that can estimate energy usage, wirelessly communicate with your smartphone, and accurately sense your location. A firm called ‘Gridcure’ provides real-time analytics and insights for energy and utilities. It helps power companies recover losses and boost revenue by operating more efficiently. It also helps them provide better delivery to consumers, big reductions in energy waste, and increased adoption of clean technologies. With mitigation and navigation being pursued enough, let’s see how firms are working on futuristic goals. Visualization: Predicting the future It is also equally important to visualize accurate climate models, which will help humans to cope up with the aftereffects of climate change. Climate models are mathematical representations of the Earth's climate system, which takes into account humidity, temperature, air pressure, wind speed and direction, as well as cloud cover and predict future weather conditions. This can help in tackling disasters. It’s also imperative to fervently increase our information on global climate changes which will help to create more accurate models. A startup modeling firm called ‘Jupiter’ is trying to better the accuracy of predictions regarding climate changes. It makes physics-based and Artificial Intelligence-powered decisions using data from millions of ground-based and orbital sensors. Another firm, ‘BioCarbon Engineering’ plans to use drones which will fly over potentially suitable areas and compile 3D maps. Then, it will scatter small containers over the best areas containing fertilized seeds as well as nutrients and moisture gel. In this way, 36,000 trees can be planted every day in a way that is cheaper than other methods. After planting, drones will continue to monitor the germinating seeds and deliver further nutrients when necessary to ensure their healthy growth. This could help to absorb carbon dioxide from the atmosphere. Another initiative is by a ETH doctoral student at the Functional Materials Laboratory, who has developed a cooling curtain made of a porous triple-layer membrane as an alternative to electrically powered air conditioning. In 2017, Microsoft came up with ‘AI for Earth’ initiative, which primarily focuses on climate conservation, biodiversity, etc. AI for Earth awards grants to projects that use artificial intelligence to address critical areas that are vital for building a sustainable future. Microsoft is also using its cloud computing service Azure, to give computing resources to scientists working on environmental sustainability programs. Intel has deployed Artificial Intelligence-equipped Drones in Costa Rica to construct models of the forest terrain and calculate the amount of carbon being stored based on tree height, health, biomass, and other factors. The collected data about carbon capture can enhance management and conservation efforts, support scientific research projects on forest health and sustainability, and enable many other kinds of applications. The ‘Green Horizon Project from IBM’ analyzes environmental data and predicts pollution as well as tests scenarios that involve pollution-reducing tactics. IBM's Deep Thunder’ group works with research centers in Brazil and India to accurately predict flooding and potential mudslides due to the severe storms. As seen above, there are many organizations and companies ranging from startups to big tech who have understood the adverse effects of climate change and are taking steps to address them. However, there are certain challenges/limitations acting as a barrier for these systems to be successful. What do big tech firms and startups lack? Though many big tech and influential companies boast of immense contribution to fighting climate change, there have been instances where these firms get into lucrative deals with oil companies. Just last year, Amazon, Google and Microsoft struck deals with oil companies to provide cloud, automation, and AI services to them. These deals were published openly by Gizmodo and yet didn’t attract much criticism. This trend of powerful companies venturing into oil businesses even after knowing the effects of dangerous climate changes is depressing. Last year, Amazon quietly launched the ‘Amazon Sustainability Data Initiative’.It helps researchers store many weather observations and forecasts, satellite images and metrics about oceans, air quality so that they can be used for modeling and analysis. This encourages organizations to use the data to make decisions which will encourage sustainable development. This year, Amazon has expanded its vision by announcing ‘Shipment Zero’ to make all Amazon shipments with 50% net zero by 2030, with a wider aim to make it 100% in the future. However, Shipment Zero only commits to net carbon reductions. Recently, Amazon ordered 20,000 diesel vans whose emissions will need to be offset with carbon credits. Offsets can entail forest management policies that displace indigenous communities, and they do nothing to reduce diesel pollution which disproportionately harms communities of color. Some in the industry expressed disappointment that Amazon’s order is for 20,000 diesel vans — not a single electric vehicle. In April, Over 4,520 Amazon employees organized against Amazon’s continued profiting from climate devastation. They signed an open letter addressed to Jeff Bezos and Amazon board of directors asking for a company-wide action plan to address climate change and an end to the company’s reliance on dirty energy resources. Recently, Microsoft doubled its internal carbon fee to $15 per metric ton on all carbon emissions. The funds from this higher fee will maintain Microsoft’s carbon neutrality and help meet their sustainability goals. On the other hand, Microsoft is also two years into a seven-year deal—rumored to be worth over a billion dollars—to help Chevron, one of the world’s largest oil companies, better extract and distribute oil.  Microsoft Azure has also partnered with Equinor, a multinational energy company to provide data services in a deal worth hundreds of millions of dollars. Instead of gaining profit from these deals, Microsoft could have taken a stand by ending partnerships with these fossil fuel companies which accelerate oil and gas exploration and extraction. With respect to smaller firms, often it is difficult for a climate-focused conservative startup to survive due to the dearth of finance. Many such organizations are small and relatively weak as they struggle to rise in a sector with little apathy and lack of steady financing. Also startups being non-famous, it is difficult for them to market their ideas and convince people to try their systems. They always need a commercial boost to find more takers. Pitfalls of using Artificial Intelligence for climate preservation Though AI has enormous potential to help us create a sustainable future, it is only part of a bigger set of tools and pathways needed to reach the goal. It also comes with its own limitations and side effects. An inability to control malicious AI can cause unexpected outcomes. Hackers can use AI to develop smart malware that interfere with early warnings, enable bad actors to control energy, transportation or other critical systems and could also get them access to sensitive data. This could result in unexpected outcomes at crucial output points for AI systems. AI bias, is another dangerous phenomena, that can give an irrational result to a working system. Bias in an AI system mainly occurs in the data or in the system’s algorithmic model which may produce incorrect results in its functions and security. [dropcap]M[/dropcap]ore importantly, we should not rely on Artificial Intelligence alone to fight the effects of climate change. Our focus should be to work on the causes of climate change and try to minimize it, from an individual level. Even governments in every country must contribute, by initiating “climate policies” which will help its citizens in the long run. One vital task would be to implement quick responses in case of climate emergencies. Like the recent case of Odisha storms, the pinpoint accuracy by the Indian weather association helped to move millions of people to safe spaces, resulting in minimum casualties. Next up in Climate Amazon employees plan to walkout for climate change during the Sept 20th Global Climate Strike Machine learning experts on how we can use machine learning to mitigate and adapt to the changing climate Now there’s a CycleGAN to visualize the effects of climate change. But is this enough to mobilize action?
Read more
  • 0
  • 0
  • 6050
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-the-cap-theorem-in-practice-the-consistency-vs-availability-trade-off-in-distributed-databases
Richard Gall
12 Sep 2019
7 min read
Save for later

The CAP Theorem in practice: The consistency vs. availability trade-off in distributed databases

Richard Gall
12 Sep 2019
7 min read
When you choose a database you are making a design decision. One of the best frameworks for understanding what this means in practice is the CAP Theorem. What is the CAP Theorem? The CAP Theorem, developed by computer scientist Eric Brewer in the late nineties, states that databases can only ever fulfil two out of three elements: Consistency - that reads are always up to date, which means any client making a request to the database will get the same view of data. Availability - database requests always receive a response (when valid). Partition tolerance - that a network fault doesn’t prevent messaging between nodes. In the context of distributed (NoSQL) databases, this means there is always going to be a trade-off between consistency and availability. This is because distributed systems are always necessarily partition tolerant (ie. it simply wouldn’t be a distributed database if it wasn’t partition tolerant.) Read next: Different types of NoSQL databases and when to use them How do you use the CAP Theorem when making database decisions? Although the CAP Theorem can feel quite abstract, it has practical, real-world consequences. From both a technical and business perspective the trade-offs will lead you to some very important questions. There are no right answers. Ultimately it will be all about the context in which your database is operating, the needs of the business, and the expectations and needs of users. You will have to consider things like: Is it important to avoid throwing up errors in the client? Or are we willing to sacrifice the visible user experience to ensure consistency? Is consistency an actual important part of the user’s experience Or can we actually do what we want with a relational database and avoid the need for partition tolerance altogether? As you can see, these are ultimately user experience questions. To properly understand those, you need to be sensitive to the overall goals of the project, and, as said above, the context in which your database solution is operating. (Eg. Is it powering an internal analytics dashboard? Or is it supporting a widely used external-facing website or application?) And, as the final bullet point highlights, it’s always worth considering whether the consistency v availability trade-off should matter at all. Avoid the temptation to think a complex database solution will always be better when a simple, more traditional solution will do the job. Of course, it’s important to note that systems that aren’t partition tolerant are a single point of failure in a system. That introduces the potential for unreliability. Prioritizing consistency in a distributed database It’s possible to get into a lot of technical detail when talking about consistency and availability, but at a really fundamental level the principle is straightforward: you need consistency (or what is called a CP database) if the data in the database must always be up to date and aligned, even in the instance of a network failure (eg. the partitioned nodes are unable to communicate with one another for whatever reason). Particular use cases where you would prioritize consistency is when you need multiple clients to have the same view of the data. For example, where you’re dealing with financial information, personal information, using a database that gives you consistency and confidence that data you are looking at is up to date in a situation where the network is unreliable or fails. Examples of CP databases MongoDB Learning MongoDB 4 [Video] MongoDB 4 Quick Start Guide MongoDB, Express, Angular, and Node.js Fundamentals Redis Build Complex Express Sites with Redis and Socket.io [Video] Learning Redis HBase Learn by Example : HBase - The Hadoop Database [Video] HBase Design Patterns Prioritizing availability in a distributed database Availability is essential when data accumulation is a priority. Think here of things like behavioral data or user preferences. In scenarios like these, you will want to capture as much information as possible about what a user or customer is doing, but it isn’t critical that the database is constantly up to date. It simply just needs to be accessible and available even when network connections aren’t working. The growing demand for offline application use is also one reason why you might use a NoSQL database that prioritizes availability over consistency. Examples of AP databases Cassandra Learn Apache Cassandra in Just 2 Hours [Video] Mastering Apache Cassandra 3.x - Third Edition DynamoDB Managed NoSQL Database In The Cloud - Amazon AWS DynamoDB [Video] Hands-On Amazon DynamoDB for Developers [Video] Limitations and criticisms of CAP Theorem It’s worth noting that the CAP Theorem can pose problems. As with most things, in truth, things are a little more complicated. Even Eric Brewer is circumspect about the theorem, especially as what we expect from distributed databases. Back in 2012, twelve years after he first put his theorem into the world, he wrote that: “Although designers still need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them. The modern CAP goal should be to maximize combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations.” So, this means we must think about the trade-off between consistency and availability as a balancing act, rather than a binary design decision. Elsewhere, there have been more robust criticisms of CAP Theorem. Software engineer Martin Kleppmann, for example, pleaded Please stop calling databases CP or AP in 2015. In a blog post he argues that CAP Theorem only works if you adhere to specific definitions of consistency, availability, and partition tolerance. “If your use of words matches the precise definitions of the proof, then the CAP theorem applies to you," he writes. “But if you’re using some other notion of consistency or availability, you can’t expect the CAP theorem to still apply.” The consequences of this are much like those described in Brewer’s piece from 2012. You need to take a nuanced approach to database trade-offs in which you think them through on your own terms and up against your own needs. The PACELC Theorem One of the developments of this line of argument is an extension to the CAP Theorem: the PACELC Theorem. This moves beyond thinking about consistency and availability and instead places an emphasis on the trade-off between consistency and latency. The PACELC Theorem builds on the CAP Theorem (the ‘PAC’) and adds an else (the ‘E’). What this means is that while you need to choose between availability and consistency if communication between partitions has failed in a distributed system, even if things are running properly and there are no network issues, there is still going to be a trade-off between consistency and latency (the ‘LC’). Conclusion: Learn to align context with technical specs Although the CAP Theorem might seem somewhat outdated, it is valuable in providing a way to think about database architecture design. It not only forces engineers and architects to ask questions about what they want from the technologies they use, but it also forces them to think carefully about the requirements of a given project. What are the business goals? What are user expectations? The PACELC Theorem builds on CAP in an effective way. However, the most important thing about these frameworks is how they help you to think about your problems. Of course the CAP Theorem has limitations. Because it abstracts a problem it is necessarily going to lack nuance. There are going to be things it simplifies. It’s important, as Kleppmann reminds us - to be mindful of these nuances. But at the same time, we shouldn’t let an obsession with nuance and detail allow us to miss the bigger picture.
Read more
  • 0
  • 0
  • 23914

article-image-different-types-of-nosql-databases-and-when-to-use-them
Richard Gall
10 Sep 2019
8 min read
Save for later

Different types of NoSQL databases and when to use them

Richard Gall
10 Sep 2019
8 min read
Why NoSQL databases? The popularity of NoSQL databases over the last decade or so has been driven by an explosion of data. Before what’s commonly described as ‘the big data revolution’, relational databases were the norm - these are databases that contain structured data. Structured data can only be structured if it is based on an existing schema that defines the relationships (hence relational) between the data inside the database. However, with the vast quantities of data that are now available to just about every business with an internet connection, relational databases simply aren’t equipped to handle the complexity and scale of large datasets. Why not SQL databases? This is for a couple of reasons. The defined schemas that are a necessary component of every relational database will not only undermine the richness and integrity of the data you’re working with, relational databases are also hard to scale. Relational databases can only scale vertically, not horizontally. That’s fine to a certain extent, but when you start getting high volumes of data - such as when millions of people use a web application, for example - things get really slow and you need more processing power. You can do this by upgrading your hardware, but that isn’t really sustainable. By scaling out, as you can with NoSQL databases, you can use a distributed network of computers to handle data. That gives you more speed and more flexibility. This isn’t to say that relational and SQL databases have had their day. They still fulfil many use cases. The only difference is that NoSQL can offers a level of far greater power and control for data intensive use cases. Indeed, using a NoSQL database when SQL will do is only going to add more complexity to something that just doesn’t need it. Seven NoSQL Databases in a Week Different types of NoSQL databases and when to use them So, now we’ve looked at why NoSQL databases have grown in popularity in recent years, lets dig into some of the different options available. There are a huge number of NoSQL databases out there - some of them open source, some premium products - many of them built for very different purposes. Broadly speaking there are 4 different models of NoSQL databases: Key-Value pair-based databases Column-based databases Document-oriented databases Graph databases Let’s take a look at these four models, how they’re different from one another, and some examples of the product options in each. Key Value pair-based NoSQL database management systems Key/Value pair based NoSQL databases store data in, as you might expect, pairs of keys and values. Data is stored with a matching key - keys have no relation or structure (so, keys could be height, age, hair color, for example). When should you use a key/value pair-based NoSQL DBMS? Key/value pair based NoSQL databases are the most basic type of NoSQL database. They’re useful for storing fairly basic information, like details about a customer. Which key/value pair-based DBMS should you use? There are a number of different key/value pair databases. The most popular is Redis. Redis is incredibly fast and very flexible in terms of the languages and tools it can be used with. It can be used for a wide variety of purposes - one of the reasons high-profile organizations use it, including Verizon, Atlassian, and Samsung. It’s also open source with enterprise options available for users with significant requirements. Redis 4.x Cookbook Other than Redis, other options include Memcached and Ehcache. As well as those, there are a number of other multi-model options (which will crop up later, no doubt) such as Amazon DynamoDB, Microsoft’s Cosmos DB, and OrientDB. Hands-On Amazon DynamoDB for Developers [Video] RDS PostgreSQL and DynamoDB CRUD: AWS with Python and Boto3 [Video] Column-based NoSQL database management systems Column-based databases separate data into discrete columns. Instead of using rows - whereby the row ID is the main key - column-based database systems flip things around to make the data the main key. By using columns you can gain much greater speed when querying data. Although it’s true that querying a whole row of data would take longer in a column-based DBMS, the use cases for column based databases mean you probably won’t be doing this. Instead you’ll be querying a specific part of the data rather than the whole row. When should you use a column-based NoSQL DBMS? Column-based systems are most appropriate for big data and instances where data is relatively simple and consistent (they don’t particularly handle volatility that well). Which column-based NoSQL DBMS should you use? The most popular column-based DBMS is Cassandra. The software prizes itself on its performance, boasting 100% availability thanks to lacking a single point of failure, and offering impressive scalability at a good price. Cassandra’s popularity speaks for itself - Cassandra is used by 40% of the Fortune 100. Mastering Apache Cassandra 3.x - Third Edition Learn Apache Cassandra in Just 2 Hours [Video] There are other options available, such as HBase and Cosmos DB. HBase High Performance Cookbook Document-oriented NoSQL database management systems Document-oriented NoSQL systems are very similar to key/value pair database management systems. The only difference is that the value that is paired with a key is stored as a document. Each document is self-contained, which means no schema is required - giving a significant degree of flexibility over the data you have. For software developers, this is essential - it’s for this reason that document-oriented databases such as MongoDB and CouchDB are useful components of the full-stack development tool chain. Some search platforms such as ElasticSearch use mechanisms similar to standard document-oriented systems - so they could be considered part of the same family of database management systems. When should you use a document-oriented DBMS? Document-oriented databases can help power many different types of websites and applications - from stores to content systems. However, the flexibility of document-oriented systems means they are not built for complex queries. Which document-oriented DBMS should you use? The leader in this space is, MongoDB. With an amazing 40 million downloads (and apparently 30,000 more every single day), it’s clear that MongoDB is a cornerstone of the NoSQL database revolution. MongoDB 4 Quick Start Guide MongoDB Administrator's Guide MongoDB Cookbook - Second Edition There are other options as well as MongoDB - these include CouchDB, CouchBase, DynamoDB and Cosmos DB. Learning Azure Cosmos DB Guide to NoSQL with Azure Cosmos DB Graph-based NoSQL database management systems The final type of NoSQL database is graph-based. The notable distinction about graph-based NoSQL databases is that they contain the relationships between different data. Subsequently, graph databases look quite different to any of the other databases above - they store data as nodes, with the ‘edges’ of the nodes describing their relationship to other nodes. Graph databases, compared to relational databases, are multidimensional in nature. They display not just basic relationships between tables and data, but more complex and multifaceted ones. When should you use a graph database? Because graph databases contain the relationships between a set of data (customers, products, price etc.) they can be used to build and model networks. This makes graph databases extremely useful for applications ranging from fraud detection to smart homes to search. Which graph database should you use? The world’s most popular graph database is Neo4j. It’s purpose built for data sets that contain strong relationships and connections. Widely used in the industry in companies such as eBay and Walmart, it has established its reputation as one of the world’s best NoSQL database products. Back in 2015 Packt’s Data Scientist demonstrated how he used Neo4j to build a graph application. Read more. Learning Neo4j 3.x [Video] Exploring Graph Algorithms with Neo4j [Video] NoSQL databases are the future - but know when to use the right one for the job Although NoSQL databases will remain a fixture in the engineering world, SQL databases will always be around. This is an important point - when it comes to databases, using the right tool for the job is essential. It’s a valuable exercise to explore a range of options and get to know how they work - sometimes the difference might just be a personal preference about usability. And that’s fine - you need to be productive after all. But what’s ultimately most essential is having a clear sense of what you’re trying to accomplish, and choosing the database based on your fundamental needs.
Read more
  • 0
  • 0
  • 21920

article-image-what-can-you-expect-at-neurips-2019
Sugandha Lahoti
06 Sep 2019
5 min read
Save for later

What can you expect at NeurIPS 2019?

Sugandha Lahoti
06 Sep 2019
5 min read
Popular machine learning conference NeurIPS 2019 (Conference on Neural Information Processing Systems) will be held on Sunday, December 8 through Saturday, December 14 at the Vancouver Convention Center. The conference invites papers tutorials, and submissions on cross-disciplinary research where machine learning methods are being used in other fields, as well as methods and ideas from other fields being applied to ML.  NeurIPS 2019 accepted papers Yesterday, the conference published the list of their accepted papers. A total of 1429 papers have been selected. Submissions opened from May 1 on a variety of topics such as Algorithms, Applications, Data implementations, Neuroscience, and Cognitive Science, Optimization, Probabilistic Methods, Reinforcement Learning and Planning, and Theory. (The full list of Subject Areas are available here.) This year at NeurIPS 2019, authors of accepted submissions were mandatorily required to prepare either a 3-minute video or a PDF of slides summarizing the paper or prepare a PDF of the poster used at the conference. This was done to make NeurIPS content accessible to those unable to attend the conference. NeurIPS 2019 also introduced a mandatory abstract submission deadline, a week before final submissions are due. Only a submission with a full abstract was allowed to have the full paper uploaded. The authors were also asked to answer questions from the Reproducibility Checklist during the submission process. NuerIPS 2019 tutorial program NeurIPS also invites experts to present tutorials that feature topics that are of interest to a sizable portion of the NeurIPS community and are different from the ones already presented at other ML conferences like ICML or ICLR. They looked for tutorial speakers that cover topics beyond their own research in a comprehensive manner that encompasses multiple perspectives.  The tutorial chairs for NeurIPS 2019 are Danielle Belgrave and Alice Oh. They initially compiled a list based on the last few years’ publications, workshops, and tutorials presented at NeurIPS and at related venues. They asked colleagues for recommendations and conducted independent research. In reviewing the potential candidates, the chair read papers to understand their expertise and watch their videos to appreciate their style of delivery. The list of candidates was emailed to the General Chair, Diversity & Inclusion Chairs, and the rest of the Organizing Committee for their comments on this shortlist. Following a few adjustments based on their input, the potential speakers were selected. A total of 9 tutorials have been selected for NeurIPS 2019: Deep Learning with Bayesian Principles - Emtiyaz Khan Efficient Processing of Deep Neural Network: from Algorithms to Hardware Architectures - Vivienne Sze Human Behavior Modeling with Machine Learning: Opportunities and Challenges - Nuria Oliver, Albert Ali Salah Interpretable Comparison of Distributions and Models - Wittawat Jitkrittum, Dougal Sutherland, Arthur Gretton Language Generation: Neural Modeling and Imitation Learning -  Kyunghyun Cho, Hal Daume III Machine Learning for Computational Biology and Health - Anna Goldenberg, Barbara Engelhardt Reinforcement Learning: Past, Present, and Future Perspectives - Katja Hofmann Representation Learning and Fairness - Moustapha Cisse, Sanmi Koyejo Synthetic Control - Alberto Abadie, Vishal Misra, Devavrat Shah NeurIPS 2019 Workshops NeurIPS Workshops are primarily used for discussion of work in progress and future directions. This time the number of Workshop Chairs doubled, from two to four; selected chairs are Jenn Wortman Vaughan, Marzyeh Ghassemi, Shakir Mohamed, and Bob Williamson. However, the number of workshop submissions went down from 140 in 2018 to 111 in 2019. Of these 111 submissions, 51 workshops were selected. The full list of selected Workshops is available here.  The NeurIPS 2019 chair committee introduced new guidelines, expectations, and selection criteria for the Workshops. This time workshops had an important focus on the nature of the problem, intellectual excitement of the topic, diversity, and inclusion, quality of proposed invited speakers, organizational experience and ability of the team and more.  The Workshop Program Committee consisted of 37 reviewers with each workshop proposal assigned to two reviewers. The reviewer committee included more senior researchers who have been involved with the NeurIPS community. Reviewers were asked to provide a summary and overall rating for each workshop, a detailed list of pros and cons, and specific ratings for each of the new criteria. After all reviews were submitted, each proposal was assigned to two of the four chair committee members. The chair members looked through assigned proposals and their reviews to form an educated assessment of the pros and cons of each. Finally, the entire chair held a meeting to discuss every submitted proposal to make decisions.  You can check more details about the conference on the NeurIPS website. As always keep checking this space for more content about the conference. In the meanwhile, you can read our previous year coverage: NeurIPS Invited Talk: Reproducible, Reusable, and Robust Reinforcement Learning NeurIPS 2018: How machine learning experts can work with policymakers to make good tech decisions [Invited Talk] NeurIPS 2018: Rethinking transparency and accountability in machine learning NeurIPS 2018: Developments in machine learning through the lens of Counterfactual Inference [Tutorial] Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk]
Read more
  • 0
  • 0
  • 4599

article-image-google-is-circumventing-gdpr-reveals-braves-investigation-for-the-authorized-buyers-ad-business-case
Bhagyashree R
06 Sep 2019
6 min read
Save for later

Google is circumventing GDPR, reveals Brave's investigation for the Authorized Buyers ad business case

Bhagyashree R
06 Sep 2019
6 min read
Last year, Dr. Johnny Ryan, the Chief Policy & Industry Relations Officer at Brave, filed a complaint against Google’s DoubleClick/Authorized Buyers ad business with the Irish Data Protection Commission (DPC). New evidence produced by Brave reveals that Google is circumventing GDPR and also undermining its own data protection measures. Brave calls Google’s Push Pages a GDPR workaround Brave’s new evidence rebuts some of Google’s claims regarding its DoubleClick/Authorized Buyers system, the world’s largest real-time advertising auction house. Google says that it prohibits companies that use its real-time bidding (RTB) ad system “from joining data they receive from the Cookie Matching Service.” In September last year, Google announced that it has removed encrypted cookie IDs and list names from bid requests with buyers in its Authorized Buyers marketplace. Brave’s research, however, found otherwise, “Brave’s new evidence reveals that Google allowed not only one additional party, but many, to match with Google identifiers. The evidence further reveals that Google allowed multiple parties to match their identifiers for the data subject with each other.” When you visit a website that has Google ads embedded on its web pages, Google will run a real-time bidding ad auction to determine which advertiser will get to display its ads. For this, it uses Push Pages, which is the mechanism in question here. Brave hired Zach Edwards, the co-founder of digital analytics startup Victory Medium, and MetaX, a company that audits data supply chains, to investigate and analyze a log of Dr. Ryan’s web browsing. The research revealed that Google's Push Pages can essentially be used as a workaround for user IDs. Google shares a ‘google_push’ identifier with the participating companies to identify a user. Brave says that the problem here is that the identifier that was shared was common to multiple companies. This means that these companies could have cross-referenced what they learned about the user from Google with each other. Used by more than 8.4 million websites, Google's DoubleClick/Authorized Buyers broadcasts personal data of users to 2000+ companies. This data includes the category of what a user is reading, which can reveal their political views, sexual orientation, religious beliefs, as well as their locations. There are also unique ID codes that are specific to a user that can let companies uniquely identify a user. All this information can give these companies a way to keep tabs on what users are “reading, watching, and listening to online.” Brave calls Google’s RTB data protection policies “weak” as they ask these companies to self-regulate. Google does not have much control over what these companies do with the data once broadcast. “Its policy requires only that the thousands of companies that Google shares peoples’ sensitive data with monitor their own compliance, and judge for themselves what they should do,” Brave wrote. A Google spokesperson, as a response to this news, told Forbes, “We do not serve personalised ads or send bid requests to bidders without user consent. The Irish DPC — as Google's lead DPA — and the UK ICO are already looking into real-time bidding in order to assess its compliance with GDPR. We welcome that work and are co-operating in full." Users recommend starting an “information campaign” instead of a penalty that will hardly affect the big tech This news triggered a discussion on Hacker News where users talked about the implications of RTB and what strict actions the EU can take to protect user privacy. A user explained, "So, let's say you're an online retailer, and you have Google IDs for your customers. You probably have some useful and sensitive customer information, like names, emails, addresses, and purchase histories. In order to better target your ads, you could participate in one of these exchanges, so that you can use the information you receive to suggest products that are as relevant as possible to each customer. To participate, you send all this sensitive information, along with a Google ID, and receive similar information from other retailers, online services, video games, banks, credit card providers, insurers, mortgage brokers, service providers, and more! And now you know what sort of vehicles your customers drive, how much they make, whether they're married, how many kids they have, which websites they browse, etc. So useful! And not only do you get all these juicy private details, but you've also shared your customers sensitive purchase history with anyone else who is connected to the exchange." Others said that a penalty is not going to deter Google. "The whole penalty system is quite silly. The fines destroy small companies who are the ones struggling to comply, and do little more than offer extremely gentle pokes on the wrist for megacorps that have relatively unlimited resources available for complete compliance, if they actually wanted to comply." Users suggested that the EU should instead start an information campaign. "EU should ignore the fines this time and start an "information campaign" regarding behavior of Google and others. I bet that hurts Google 10 times more." Some also said that not just Google but the RTB participants should also be held responsible. "Because what Google is doing is not dissimilar to how any other RTB participant is acting, saying this is a Google workaround seems disingenuous." With this case, Brave has launched a full-fledged campaign that aims to “reform the multi-billion dollar RTB industry spans sixteen EU countries.” To achieve this goal it has collaborated with several privacy NGOs and academics including the Open Rights Group, Dr. Michael Veale of the Turing Institute, among others. In other news, a Bloomberg report reveals that Google and other internet companies have recently asked for an amendment to the California Consumer Privacy Act, which will be enacted in 2020. The law currently limits how digital advertising companies collect and make money from user data. The amendments proposed include approval for collecting user data for targeted advertising, using the collected data from websites for their own analysis, and many others. Read the Bloomberg report to know more in detail. Other news in Data Facebook content moderators work in filthy, stressful conditions and experience emotional trauma daily, reports The Verge GDPR complaint in EU claim billions of personal data leaked via online advertising bids European Union fined Google 1.49 billion euros for antitrust violations in online advertising  
Read more
  • 0
  • 0
  • 2599
article-image-key-skills-every-database-programmer-should-have
Sugandha Lahoti
05 Sep 2019
7 min read
Save for later

Key skills every database programmer should have

Sugandha Lahoti
05 Sep 2019
7 min read
According to Robert Half Technology’s 2019 IT salary report, ‘Database programmer’ is one of the 13 most in-demand tech jobs for 2019. For an entry-level programmer, the average salary is $98,250 which goes up to $167,750 for a seasoned expert. A typical database programmer is responsible for designing, developing, testing, deploying, and maintaining databases. In this article, we will list down the top critical tech skills essential to database programmers. #1 Ability to perform Data Modelling The first step is to learn to model the data. In Data modeling, you create a conceptual model of how data items relate to each other. In order to efficiently plan a database design, you should know the organization you are designing the database from. This is because Data models describe real-world entities such as ‘customer’, ‘service’, ‘products’, and the relation between these entities. Data models provide an abstraction for the relations in the database. They aid programmers in modeling business requirements and in translating business requirements into relations. They are also used for exchanging information between the developers and business owners. During the design phase, the database developer should pay great attention to the underlying design principles, run a benchmark stack to ensure performance, and validate user requirements. They should also avoid pitfalls such as data redundancy, null saturation, and tight coupling. #2 Know a database programming language, preferably SQL Database programmers need to design, write and modify programs to improve their databases. SQL is one of the top languages that are used to manipulate the data in a database and to query the database. It's also used to define and change the structure of the data—in other words, to implement the data model. Therefore it is essential that you learn SQL. In general, SQL has three parts: Data Definition Language (DDL): used to create and manage the structure of the data Data Manipulation Language (DML): used to manage the data itself Data Control Language (DCL): controls access to the data Considering, data is constantly inserted into the database, changed, or retrieved DML is used more often in day-to-day operations than the DDL, so you should have a strong grasp on DML. If you plan to grow in a database architect role in the near future, then having a good grasp of DDL will go a long way. Another reason why you should learn SQL is that almost every modern relational database supports SQL. Although different databases might support different features and implement their own dialect of SQL, the basics of the language remain the same. If you know SQL, you can quickly adapt to MySQL, for example. At present, there are a number of categories of database models predominantly, relational, object-relational, and NoSQL databases. All of these are meant for different purposes. Relational databases often adhere to SQL. Object-relational databases (ORDs) are also similar to relational databases. NoSQL, which stands for "not only SQL," is an alternative to traditional relational databases useful for working with large sets of distributed data. They provide benefits such as availability, schema-free, and horizontal scaling, but also have limitations such as performance, data retrieval constraints, and learning time. For beginners, it is advisable to first start with experimenting on relational databases learning SQL, gradually transitioning to NoSQL DBMS. #3 Know how to Extract, Transform, Load various data types and sources A database programmer should have a good working knowledge of ETL (Extract, Transform Load) programming. ETL developers basically extract data from different databases, transform it and then load the data into the Data Warehouse system. A Data Warehouse provides a common data repository that is essential for business needs. A database programmer should know how to tune existing packages, tables, and queries for faster ETL processing. They should conduct unit tests before applying any change to the existing ETL process. Since ETL takes data from different data sources (SQL Server, CSV, and flat files), a database developer should have knowledge on how to deal with different data sources. #4 Design and test Database plans Database programmers o perform regular tests to identify ways to solve database usage concerns and malfunctions. As databases are usually found at the lowest level of the software architecture, testing is done in an extremely cautious fashion. This is because changes in the database schema affect many other software components. A database developer should make sure that when changing the database structure, they do not break existing applications and that they are using the new structures properly. You should be proficient in Unit testing your database. Unit tests are typically used to check if small units of code are functioning properly. For databases, unit testing can be difficult. So the easiest way to do all of that is by writing the tests as SQL scripts. You should also know about System Integration Testing which is done on the complete system after the hardware and software modules of that system have been integrated. SIT validates the behavior of the system and ensures that modules in the system are functioning suitably. #5 Secure your Database Data protection and security are essential for the continuity of business. Databases often store sensitive data, such as user information, email addresses, geographical addresses, and payment information. A robust security system to protect your database against any data breach is therefore necessary. While a database architect is responsible for designing and implementing secure design options, a database admin must ensure that the right security and privacy policies are in place and are being observed. However, this does not absolve database programmers from adopting secure coding practices. Database programmers need to ensure that data integrity is maintained over time and is secure from unauthorized changes or theft. They need to especially be careful about Table Permissions i.e who can read and write to what tables. You should be aware of who is allowed to perform the 4 basic operations of INSERT, UPDATE, DELETE and SELECT against which tables. Database programmers should also adopt authentication best practices depending on the infrastructure setup, the application's nature, the user's characteristics, and data sensitivity. If the database server is accessed from the outside world, it is beneficial to encrypt sessions using SSL certificates to avoid packet sniffing. Also, you should secure database servers that trust all localhost connections, as anyone who accesses the localhost can access the database server. #6 Optimize your database performance A database programmer should also be aware of how to optimize their database performance to achieve the best results. At the basic level, they should know how to rewrite SQL queries and maintain indexes. Other aspects of optimizing database performance, include hardware configuration, network settings, and database configuration. Generally speaking, tuning database performance requires knowledge about the system's nature. Once the database server is configured you should calculate the number of transactions per second (TPS) for the database server setup. Once the system is up and running, and you should set up a monitoring system or log analysis, which periodically finds slow queries, the most time-consuming queries, etc. #7 Develop your soft skills Apart from the above technical skills, a database programmer needs to be comfortable communicating with developers, testers and project managers while working on any software project. A keen eye for detail and critical thinking can often spot malfunctions and errors that may otherwise be overlooked. A database programmer should be able to quickly fix issues within the database and streamline the code. They should also possess quick-thinking to prioritize tasks and meet deadlines effectively. Often database programmers would be required to work on documentation and technical user guides so strong writing and technical skills are a must. Get started If you want to get started with becoming a Database programmer, Packt has a range of products. Here are some of the best: PostgreSQL 11 Administration Cookbook Learning PostgreSQL 11 - Third Edition PostgreSQL 11 in 7 days [ Video ] Using MySQL Databases With Python [ Video ] Basic Relational Database Design [ Video ] How to learn data science: from data mining to machine learning How to ace a data science interview 5 barriers to learning and technology training for small software development teams
Read more
  • 0
  • 0
  • 14434

article-image-how-to-learn-data-science-from-data-mining-to-machine-learning
Richard Gall
04 Sep 2019
6 min read
Save for later

How to learn data science: from data mining to machine learning

Richard Gall
04 Sep 2019
6 min read
Data science is a field that’s complex and diverse. If you’re trying to learn data science and become a data scientist it can be easy to fall down a rabbit hole of machine learning or data processing. To a certain extent, that’s good. To be an effective data scientist you need to be curious. You need to be prepared to take on a range of different tasks and challenges. But that’s not always that efficient: if you want to learn quickly and effectively, you need a clear structure - a curriculum - that you can follow. This post will show you what you need to learn and how to go about it. Statistics Statistics is arguably the cornerstone of data science. Nate Silver called data scientists “sexed up statisticians”, a comment that was perhaps unfair but still nevertheless contains a kernel of truth in it: that data scientists are always working in the domain of statistics. Once you understand this everything else you need to learn will follow easily. Machine learning, data manipulation, data visualization - these are all ultimately technological methods for performing statistical analysis really well. Best Packt books and videos content for learning statistics Statistics for Data Science R Statistics Cookbook Statistical Methods and Applied Mathematics in Data Science [Video] Before you go any deeper into data science, it’s critical that you gain a solid foundation in statistics. Data mining and wrangling This is an important element of data science that often gets overlooked with all the hype about machine learning. However, without effective data collection and cleaning, all your efforts elsewhere are going to be pointless at best. At worst they might even be misleading or problematic. Sometimes called data manipulation or data munging, it's really all about managing and cleaning data from different sources so it can be used for analytics projects. To do it well you need to have a clear sense of where you want to get to - do you need to restructure the data? Sort or remove certain parts of a data set? Once you understand this, it’s much easier to wrangle data effectively. Data mining and wrangling tools There are a number of different tools you can use for data wrangling. Python and R are the two key programming languages, and both have some useful tools for data mining and manipulation. Python in particular has a great range of tools for data mining and wrangling, such as pandas and NLTK (Natural Language Toolkit), but that isn’t to say R isn’t powerful in this domain. Other tools are available too - Weka and Apache Mahout, for example, are popular. Weka is written in Java so is a good option if you have experience with that programming language, while Mahout integrates well with the Hadoop ecosystem. Data mining and data wrangling books and videos If you need to learn data mining, wrangling and manipulation, Packt has a range of products. Here are some of the best: Data Wrangling with R Data Wrangling with Python Python Data Mining Quick Start Guide Machine Learning for Data Mining Machine learning and artificial intelligence Although Machine learning and artificial intelligence are huge trends in their own right, they are nevertheless closely aligned with data science. Indeed, you might even say that their prominence today has grown out of the excitement around data science that we first we witnessed just under a decade ago. It’s a data scientist’s job to use machine learning and artificial intelligence in a way that can drive business value. That could, for example, be to recommend products or services to customers, perhaps to gain a better understanding into existing products, or even to better manage strategic and financial risks through predictive modelling. So, while we can see machine learning in a massive range of digital products and platforms - all of which require smart development and design - for it to work successfully, it needs to be supported by a capable and creative data scientist. Machine learning and artificial intelligence books for data scientists Machine Learning Algorithms Machine Learning with R - Third Edition Machine Learning with Apache Spark Quick Start Guide Machine Learning with TensorFlow 1.x Keras Deep Learning Cookbook Data visualization A talented data scientist isn’t just a great statistician and engineer, they’re also a great communicator. This means so-called soft skills are highly valuable - the ability to communicate insights and ideas with key stakeholders is essential. But great communication isn’t just about soft skills, it’s also about data visualization. Data visualization is, at a fundamental level, about organizing and presenting data in a way that tells a story, clarifies a problem, or illustrates a solution. It’s essential that you don’t overlook this step. Indeed, spending time learning about effective data visualization can also help you to develop your soft skills. The principles behind storytelling and communication through visualization are, in truth, exactly the same when applied to other scenarios. Data visualization tools There are a huge range of data visualization tools available. As with machine learning, understanding the differences between them and working out what solution will work for you is actually an important part of the learning process. For that reason, don’t be afraid to spend a little bit of time with a range of data visualization tools. Many of the most popular data visualization tools are paid for products. Perhaps the best known of these is Tableau (which, incidentally was bought by Salesforce earlier this year). Tableau and its competitors are very user friendly, which means the barrier to entry is pretty low. They allow you to create some pretty sophisticated data visualizations fairly easily. However, sticking to these tools is not only expensive, it can also limit your abilities. We’d recommend trying a number of different data visualization tools, such as Seabor, D3.js, Matplotlib, and ggplot2. Data visualization books and videos for data scientists Applied Data Visualization with R and ggplot2 Tableau 2019.1 for Data Scientists [Video] D3.js Data Visualization Projects [Video] Tableau in 7 Steps [Video] Data Visualization with Python If you want to learn data science, just get started! As we've seen, data science requires a number of very different skills and takes in a huge breadth of tools. That means that if you're going to be a data scientist, you need to be prepared to commit to learning forver: you're never going to reach a point where you know everything. While that might sound intimidating, it's important to have confidence. With a sense of direction and purpose, and a learning structure that works for you, it's possible to develop and build your data science capabilities in a way that could unlock new opportunities and act as the basis for some really exciting projects.
Read more
  • 0
  • 0
  • 4927

article-image-how-to-ace-a-data-science-interview
Richard Gall
02 Sep 2019
12 min read
Save for later

How to ace a data science interview

Richard Gall
02 Sep 2019
12 min read
So, you want to be a data scientist. It’s a smart move: it’s a job that’s in high demand, can command a healthy salary, and can also be richly rewarding and engaging. But to get the job, you’re going to have to pass a data science interview - something that’s notoriously tough. One of the reasons for this is that data science is a field that is incredibly diverse. I mean that in two different ways: on the one hand it’s a role that demands a variety of different skills (being a good data scientist is about much more than just being good at math). But it's also diverse in the sense that data science will be done differently at every company. That means that every data science interview is going to be different. If you specialize too much in one area, you might well be severely limiting your opportunities. There are plenty of articles out there that pretend to have all the answers to your next data science interview. And while these can be useful, they also treat job interviews like they’re just exams you need to pass. They’re not - you need to have a wide range of knowledge, but you also need to present yourself as a curious and critical thinker, and someone who is very good at communicating. You won’t get a data science by knowing all the answers. But you might get it by asking the right questions and talking in the right way. So, with all that in mind, here are what you need to do to ace your data science interview. Know the basics of data science This is obvious but it’s impossible to overstate. If you don’t know the basics, there’s no way you’ll get the job - indeed, it’s probably better for your sake that you don’t get it! But what are these basics? Basic data science interview questions "What is data science?" This seems straightforward, but proving you’ve done some thinking about what the role actually involves demonstrates that you’re thoughtful and self-aware - a sign of any good employee. "What’s the difference between supervised and unsupervised learning?" Again, this is straightforward, but it will give the interviewer confidence that you understand the basics of machine learning algorithms. "What is the bias and variance tradeoff? What is overfitting and underfitting?" Being able to explain these concepts in a clear and concise manner demonstrates your clarity of thought. It also shows that you have a strong awareness of the challenges of using machine learning and statistical systems. If you’re applying for a job as a data scientist you’ll probably already know the answers to all of these. Just make sure you have a clear answer and that you can explain each in a concise manner. Know your algorithms Knowing your algorithms is a really important part of any data science interview. However, it’s important to not get hung up on the details. Trying to learn everything you know about every algorithm you know isn’t only impossible, it’s also not going to get you the job. What’s important instead is demonstrating that you understand the differences between algorithms, and when to use one over another. Data science interview questions about algorithms you might be asked "When would you use a supervised machine learning algorithm?" "Can you name some supervised machine learning algorithms and the differences between them?" (supervised machine learning algorithms include Support Vector Machines, Naive Bayes, K-nearest Neighbor Algorithm, Regression, Decision Trees) "When would you use an unsupervised machine learning algorithm?" (unsupervised machine learning algorithms include K-Means, autoencoders, Generative Adversarial Networks, and Deep Belief Nets.) Name some unsupervised machine learning algorithms and how they’re different from one another. "What are classification algorithms?" There are others, but try to focus on these as core areas. Remember, it’s also important to always talk about your experience - that’s just as useful, if not even more useful than listing off the differences between different machine learning algorithms. Some of the questions you face in a data science interview might even be about how you use algorithms: "Tell me about the time you used an algorithm. Why did you decide to use it? Were there any other options?" "Tell me about a time you used an algorithm and it didn’t work how you expected it to. What did you do?" When talking about algorithms in a data science interview it’s useful to present them as tools for solving business problems. It can be tempting to talk about them as mathematical concepts, and although it’s good to show off your understanding, showing how algorithms help solve real-world business problems will be a big plus for your interviewer. Be confident talking about data sources and infrastructure challenges One of the biggest challenges for data scientists is dealing with incomplete or poor quality data. If that’s something you’ve faced - or even if it’s something you think you might face in the future - then make sure you talk about that. Data scientists aren’t always responsible for managing a data infrastructure (that will vary from company to company), but even if that isn’t in the job description, it’s likely that you’ll have to work with a data architect to make sure data is available and accurate to be able to carry our data science projects. This means that understanding topics like data streaming, data lakes and data warehouses is very important in a data science interview. Again, remember that it’s important that you don’t get stuck on the details. You don’t need to recite everything you know, but instead talk about your experience or how you might approach problems in different ways. Data science interview questions you might get asked about using different data sources "How do you work with data from different sources?" "How have you tackled dirty or unreliable data in the past?" Data science interview questions you might get asked about infrastructure "Talk me through a data infrastructure challenge you’ve faced in the past" "What’s the difference between a data lake and data warehouse? How would you approach each one differently?" Show that you have a robust understanding of data science tools You can’t get through a data science interview without demonstrating that you have knowledge and experience of data science tools. It’s likely that the job you’re applying for will mention a number of different skill requirements in the job description, so make sure you have a good knowledge of them all. Obviously, the best case scenario is that you know all the tools mentioned in the job description inside out - but this is unlikely. If you don’t know one - or more - make sure you understand what they’re for and how they work. The hiring manager probably won’t expect candidates to know everything, but they will expect them to be ready and willing to learn. If you can talk about a time you learned a new tool that will give the interviewer a lot of confidence that you’re someone that can pick up knowledge and skills quickly. Show you can evaluate different tools and programming languages Another element here is to be able to talk about the advantages and disadvantages of different tools. Why might you use R over Python? Which Python libraries should you use to solve a specific problem? And when should you just use Excel? Sometimes the interviewer might ask for your own personal preferences. Don’t be scared about giving your opinion - as long as you’ve got a considered explanation for why you hold the opinion that you do, you’re fine! Read next: Why is Python so good for AI and Machine Learning? 5 Python Experts Explain Data science interview questions about tools that you might be asked "What tools have you - or could you - use for data processing and cleaning? What are their benefits and disadvantages?" (These include tools such as Hadoop, Pentaho, Flink, Storm, Kafka.) "What tools do you think are best for data visualization and why?" (This includes tools like Tableau, PowerBI, D3.js, Infogram, Chartblocks - there are so many different products in this space that it’s important that you are able to talk about what you value most about data visualization tools.) "Do you prefer using Python or R? Are there times when you’d use one over another?" "Talk me through machine learning libraries. How do they compare to one another?" (This includes tools like TensorFlow, Keras, and PyTorch. If you don’t have any experience with them, make sure you’re aware of the differences, and talk about which you are most curious about learning.) Always focus on business goals and results This sounds obvious, but it’s so easy to forget. This is especially true if you’re a data geek that loves to talk about statistical models and machine learning. To combat this, make sure you’re very clear on how your experience was tied to business goals. Take some time to think about why you were doing what you were doing. What were you trying to find out? What metrics were you trying to drive? Interpersonal and communication skills Another element to this is talking about your interpersonal skills and your ability to work with a range of different stakeholders. Think carefully about how you worked alongside other teams, how you went about capturing requirements and building solutions for them. Think also about how you managed - or would manage - expectations. It’s well known that business leaders can expect data to be a silver bullet when it comes to results, so how do you make sure that people are realistic. Show off your data science portfolio A good way of showing your business acumen as a data scientist is to build a portfolio of work. Portfolios are typically viewed as something for creative professionals, but they’re becoming increasingly popular in the tech industry as competition for roles gets tougher. This post explains everything you need to build a great data science portfolio. Broadly, the most important thing is that it demonstrates how you have added value to an organization. This could be: Insights you’ve shared in reports with management Building customer-facing applications that rely on data Building internal dashboards and applications Bringing a portfolio to an interview can give you a solid foundation on which you can answer questions. But remember - you might be asked questions about your work, so make sure you have an answer prepared! Data science interview questions about business performance "Talk about a time you have worked across different teams." "How do you manage stakeholder expectations?" "What do you think are the most important elements in communicating data insights to management?" If you can talk fluently about how your work impacts business performance and how you worked alongside others in non-technical positions, you will give yourself a good chance of landing the job! Show that you understand ethical and privacy issues in data science This might seem like a superfluous point but given the events of recent years - like the Cambridge Analytica scandal - ethics has become a big topic of conversation. Employers will expect prospective data scientists to have an awareness of some of these problems and how you can go about mitigating them. To some extent, this is an extension of the previous point. Showing you are aware of ethical issues, such as privacy and discrimination, proves that you are fully engaged with the needs and risks a business might face. It also underlines that you are aware of the consequences and potential impact of data science activities on customers - what your work does in the real-world. Read next: Introducing Deon, a tool for data scientists to add an ethics checklist Data science interview questions about ethics and privacy "What are some of the ethical issues around machine learning and artificial intelligence?" "How can you mitigate any of these issues? What steps would you take?" "Has GDPR impacted the way you do data science?"  "What are some other privacy implications for data scientists?" "How do you understand explainability and interpretability in machine learning?" Ethics is a topic that’s easy to overlook but it’s essential for every data scientist. To get a good grasp of the issues it’s worth investigating more technical content on things like machine learning interpretability, as well as following news and commentary around emergent issues in artificial intelligence. Conclusion: Don’t treat a data science interview like an exam Data science is a complex and multi-faceted field. That can make data science interviews feel like a serious test of your knowledge - and it can be tempting to revise like you would for an exam. But, as we’ve seen, that’s foolish. To ace a data science interview you can’t just recite information and facts. You need to talk clearly and confidently about your experience and demonstrate your drive and curiosity. That doesn’t mean you shouldn’t make sure you know the basics. But rather than getting too hung up on definitions and statistical details, it’s a better use of your time to consider how you have performed your roles in the past, and what you might do in the future. A thoughtful, curious data scientist is immensely valuable. Show your interviewer that you are one.
Read more
  • 0
  • 0
  • 4836
article-image-data-science-vs-machine-learning-understanding-the-difference-and-what-it-means-today
Richard Gall
02 Sep 2019
8 min read
Save for later

Data science vs. machine learning: understanding the difference and what it means today

Richard Gall
02 Sep 2019
8 min read
One of the things that I really love about the tech industry is how often different terms - buzzwords especially - can cause confusion. It isn’t hard to see this in the wild. Quora is replete with confused people asking about the difference between a ‘developer’ and an ‘engineer’ and how ‘infrastructure’ is different from ‘architecture'. One of the biggest points of confusion is the difference between data science and machine learning. Both terms refer to different but related domains - given their popularity it isn’t hard to see how some people might be a little perplexed. This might seem like a purely semantic problem, but in the context of people’s careers, as they make decisions about the resources they use and the courses they pay for, the distinction becomes much more important. Indeed, it can be perplexing for developers thinking about their career - with machine learning engineer starting to appear across job boards, it’s not always clear where that role begins and ‘data scientist’ begins. Tl;dr: To put it simply - and if you can’t be bothered to read further - data science is a discipline or job role that’s all about answering business questions through data. Machine learning, meanwhile, is a technique that can be used to analyze or organize data. So, data scientists might well use machine learning to find something out, but it would only be one aspect of their job. But what are the implications of this distinction between machine learning and data science? What can the relationship between the two terms tell us about how technology trends evolve? And how can it help us better understand them both? Read next: 9 data science myths debunked What’s causing confusion about the difference between machine learning and data science? The data science v machine learning confusion comes from the fact that both terms have a significant grip on the collective imagination of the tech and business world. Back in 2012 the Harvard Business Review declared data scientist to be the ‘sexiest job of the 21st century’. This was before the machine learning and artificial intelligence boom, but it’s the point we need to go back to understand how data has shaped the tech industry as we know it today. Data science v machine learning on Google Trends Take a look at this Google trends graph: Both terms broadly received a similar level of interest. ‘Machine learning’ was slightly higher throughout the noughties and a larger gap has emerged more recently. However, despite that, it’s worth looking at the period around 2014 when ‘data science’ managed to eclipse machine learning. Today, that feels remarkable given how machine learning is a term that’s extended out into popular consciousness. It suggests that the HBR article was incredibly timely, identifying the emergence of the field. But more importantly, it’s worth noting that this spike for ‘data science’ comes at the time that both terms surge in popularity. So, although machine learning eventually wins out, ‘data science’ was becoming particularly important at a time when these twin trends were starting to grow. This is interesting, and it’s contrary to what I’d expect. Typically, I’d imagine the more technical term to take precedence over a more conceptual field: a technical trend emerges, for a more abstract concept to gain traction afterwards. But here the concept - the discipline - spikes just at the point before machine learning can properly take off. This suggests that the evolution and growth of machine learning begins with the foundations of data science. This is important. It highlights that the obsession with data science - which might well have seemed somewhat self-indulgent - was, in fact, an integral step for business to properly make sense of what the ‘big data revolution’ (a phrase that sounds eighty years old) meant in practice. Insofar as ‘data science’ is a term that really just refers to a role that’s performed, it’s growth was ultimately evidence of a space being carved out inside modern businesses that gave a domain expert the freedom to explore and invent in the service of business objectives. If that was the baseline, then the continued rise of machine learning feels inevitable. From being contained in computer science departments in academia, and then spreading into business thanks to the emergence of the data scientist job role, we then started to see a whole suite of tools and use cases that were about much more than analytics and insight. Machine learning became a practical tool that had practical applications everywhere. From cybersecurity to mobile applications, from marketing to accounting, machine learning couldn’t be contained within the data science discipline. This wasn’t just a conceptual point - practically speaking, a data scientist simply couldn’t provide support to all the different ways in which business functions wanted to use machine learning. So, the confusion around the relationship between machine learning and data science stems from the fact that the two trends go hand in hand - or at least they used to. To properly understand how they’re different, let’s look at what a data scientist actually does. Read next: Data science for non-techies: How I got started (Part 1) What is data science, exactly? I know you’re not supposed to use Wikipedia as a reference, but the opening sentence in the entry for ‘data science’ is instructive: “Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.” The word that deserves your attention is multi-disciplinary as this underlines what makes data science unique and why it stands outside of the more specific taxonomy of machine learning terms. Essentially, it’s a human activity as much as a technical one - it’s about arranging, organizing, interpreting, and communicating data. To a certain extent it shares a common thread of DNA with statistics. But although Nate Silver said that ‘data scientist’ was “a sexed up term for statistician”, I think there are some important distinctions. To do data science well you need to be deeply engaged with how your work integrates with the wider business strategy and processes. The term ‘statistics’ - like ‘machine learning’ - doesn’t quite do this. Indeed, to a certain extent this has made data science a challenging field to work in. It isn’t hard to find evidence that data scientists are trying to leave their jobs, frustrated with how their roles are being used and how they integrate into existing organisational structures. How do data scientists use machine learning? As a data scientist, your job is to answer questions. These are questions like: What might happen if we change the price of a product in this way? What do our customers think of our products? How often do customers purchase products? How are customers using our products? How can we understand the existing market? How might we tackle it? Where could we improve efficiencies in our processes? That’s just a small set. The types of questions data scientists will be tackling will vary depending on the industry, their company - everything. Every data science job is unique. But whatever questions data scientists are asking, it’s likely that at some point they’ll be using machine learning. Whether it’s to analyze customer sentiment (grouping and sorting) or predicting outcomes, a data scientist will have a number of algorithms up their proverbial sleeves ready to tackle whatever the business throws at them. Machine learning beyond data science The machine learning revolution might have started in data science, but it has rapidly expanded far beyond that strict discipline. Indeed, one of the reasons that some people are confused about the relationship between the two concepts is because machine learning is today touching just about everything, like water spilling out of its neat data science container. Machine learning is for everyone Machine learning is being used in everything from mobile apps to cybersecurity. And although data scientists might sometimes play a part in these domains, we’re also seeing subject specific developers and engineers taking more responsibility for how machine learning is used. One of the reasons for this is, as I mentioned earlier, the fact that a data scientist - or even a couple of them - can’t do all the things that a business might want them to when it comes to machine learning. But another is the fact that machine learning is getting easier. You no longer need to be an expert to employ machine learning algorithms - instead, you need to have the confidence and foundational knowledge to use existing machine learning tools and products. This ‘productization’ of machine learning is arguably what’s having the biggest impact on how we understand the topic. It’s even shrinking data science, making it a more specific role. That might sound like data science is less important today than it was in 2014, but it can only be a good thing for data scientists - it means they are being asked to spread themselves so thinly. So, if you've been googling 'data science v machine learning', you now know the answer. The two terms are distinct but they both come out of the 'big data revolution' which we're still living through. Both trends and terms are likely to evolve in the future, but they're certainly not going to disappear - as the data at our disposal grow, making effective use of it is only going to become more important.
Read more
  • 0
  • 0
  • 5250

article-image-bitbucket-to-no-longer-support-mercurial-users-must-migrate-to-git-by-may-2020
Fatema Patrawala
21 Aug 2019
6 min read
Save for later

Bitbucket to no longer support Mercurial, users must migrate to Git by May 2020

Fatema Patrawala
21 Aug 2019
6 min read
Yesterday marked an end of an era for Mercurial users, as Bitbucket announced to no longer support Mercurial repositories after May 2020. Bitbucket, owned by Atlassian, is a web-based version control repository hosting service, for source code and development projects. It has used Mercurial since the beginning in 2008 and then Git since October 2011. Now almost after ten years of sharing its journey with Mercurial, the Bitbucket team has decided to remove the Mercurial support from the Bitbucket Cloud and its API. The official announcement reads, “Mercurial features and repositories will be officially removed from Bitbucket and its API on June 1, 2020.” The Bitbucket team also communicated the timeline for the sunsetting of the Mercurial functionality. After February 1, 2020 users will no longer be able to create new Mercurial repositories. And post June 1, 2020 users will not be able to use Mercurial features in Bitbucket or via its API and all Mercurial repositories will be removed. Additionally all current Mercurial functionality in Bitbucket will be available through May 31, 2020. The team said the decision was not an easy one for them and Mercurial held a special place in their heart. But according to a Stack Overflow Developer Survey, almost 90% of developers use Git, while Mercurial is the least popular version control system with only about 3% developer adoption. Apart from this Mercurial usage on Bitbucket saw a steady decline, and the percentage of new Bitbucket users choosing Mercurial fell to less than 1%. Hence they decided on removing the Mercurial repos. How can users migrate and export their Mercurial repos Bitbucket team recommends users to migrate their existing Mercurial repos to Git. They have also extended support for migration, and kept the available options open for discussion in their dedicated Community thread. Users can discuss about conversion tools, migration, tips, and also offer troubleshooting help. If users prefer to continue using the Mercurial system, there are a number of free and paid Mercurial hosting services for them. The Bitbucket team has also created a Git tutorial that covers everything from the basics of creating pull requests to rebasing and Git hooks. Community shows anger and sadness over decision to discontinue Mercurial support There is an outrage among the Mercurial users as they are extremely unhappy and sad with this decision by Bitbucket. They have expressed anger not only on one platform but on multiple forums and community discussions. Users feel that Bitbucket’s decision to stop offering Mercurial support is bad, but the decision to also delete the repos is evil. On Hacker News, users speculated that this decision was influenced by potential to market rather than based on technically superior architecture and ease of use. They feel GitHub has successfully marketed Git and that's how both have become synonymous to the developer community. One of them comments, “It's very sad to see bitbucket dropping mercurial support. Now only Facebook and volunteers are keeping mercurial alive. Sometimes technically better architecture and user interface lose to a non user friendly hard solutions due to inertia of mass adoption. So a lesson in Software development is similar to betamax and VHS, so marketing is still a winner over technically superior architecture and ease of use. GitHub successfully marketed git, so git and GitHub are synonymous for most developers. Now majority of open source projects are reliant on a single proprietary solution Github by Microsoft, for managing code and project. Can understand the difficulty of bitbucket, when Python language itself moved out of mercurial due to the same inertia. Hopefully gitlab can come out with mercurial support to migrate projects using it from bitbucket.” Another user comments that Mercurial support was the only reason for him to use Bitbucket when GitHub is miles ahead of Bitbucket. Now when it stops supporting Mercurial too, Bitbucket will end soon. The comment reads, “Mercurial support was the one reason for me to still use Bitbucket: there is no other Bitbucket feature I can think of that Github doesn't already have, while Github's community is miles ahead since everyone and their dog is already there. More importantly, Bitbucket leaves the migration to you (if I read the article correctly). Once I download my repo and convert it to git, why would I stay with the company that just made me go through an annoying (and often painful) process, when I can migrate to Github with the exact same command? And why isn't there a "migrate this repo to git" button right there? I want to believe that Bitbucket has smart people and that this choice is a good one. But I'm with you there - to me, this definitely looks like Bitbucket will die.” On Reddit, programming folks see this as a big change from Bitbucket as they are the major mercurial hosting provider. And they feel Bitbucket announced this at a pretty short notice and they require more time for migration. Apart from the developer community forums, on Atlassian community blog as well users have expressed displeasure. A team of scientists commented, “Let's get this straight : Bitbucket (offering hosting support for Mercurial projects) was acquired by Atlassian in September 2010. Nine years later Atlassian decides to drop Mercurial support and delete all Mercurial repositories. Atlassian, I hate you :-) The image you have for me is that of a harmful predator. We are a team of scientists working in a university. We don't have computer scientists, we managed to use a version control simple as Mercurial, and it was a hard work to make all scientists in our team to use a version control system (even as simple as Mercurial). We don't have the time nor the energy to switch to another version control system. But we will, forced and obliged. I really don't want to check out Github or something else to migrate our projects there, but we will, forced and obliged.” Atlassian Bitbucket, GitHub, and GitLab take collective steps against the Git ransomware attack Attackers wiped many GitHub, GitLab, and Bitbucket repos with ‘compromised’ valid credentials leaving behind a ransom note BitBucket goes down for over an hour
Read more
  • 0
  • 0
  • 10126