Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

How-To Tutorials - Data

1205 Articles
article-image-vertex-ai-workbench-your-complete-guide-to-scaling-machine-learning-with-google-cloud
Jasmeet Bhatia, Kartik Chaudhary
04 Nov 2024
15 min read
Save for later

Vertex AI Workbench: Your Complete Guide to Scaling Machine Learning with Google Cloud

Jasmeet Bhatia, Kartik Chaudhary
04 Nov 2024
15 min read
This article is an excerpt from the book, "The Definitive Guide to Google Vertex AI", by Jasmeet Bhatia, Kartik Chaudhary. The Definitive Guide to Google Vertex AI is for ML practitioners who want to learn Google best practices, MLOps tooling, and turnkey AI solutions for solving large-scale real-world AI/ML problems. This book takes a hands-on approach to help you become an ML rockstar on Google Cloud Platform in no time.Introduction While working on an ML project, if we are running a Jupyter Notebook in a local environment, or using a web-based Colab- or Kaggle-like kernel, we can perform some quick experiments and get some initial accuracy or results from ML algorithms very fast. But we hit a wall when it comes to performing large-scale experiments, launching long-running jobs, hosting a model, and also in the case of model monitoring. Additionally, if the data related to a project requires some more granular permissions on security and privacy (fine-grained control over who can view/access the data), it’s not feasible in local or Colab-like environments. All these challenges can be solved just by moving to the cloud. Vertex AI Workbench within Google Cloud is a JupyterLab-based environment that can be leveraged for all kinds of development needs of a typical data science project. The JupyterLab environment is very similar to the Jupyter Notebook environment, and thus we will be using these terms interchangeably throughout the book. Vertex AI Workbench has options for creating managed notebook instances as well as user-managed notebook instances. User-managed notebook instances give more control to the user, while managed notebooks come with some key extra features. We will discuss more about these later in this section. Some key features of the Vertex AI Workbench notebook suite include the following: Fully managed–Vertex AI Workbench provides a Jupyter Notebook-based fully managed environment that provides enterprise-level scale without managing infrastructure, security, and user-management capabilities. Interactive experience–Data exploration and model experiments are easier as managed notebooks can easily interact with other Google Cloud services such as storage systems, big data solutions, and so on. Prototype to production AI–Vertex AI notebooks can easily interact with other Vertex AI tools and Google Cloud services and thus provide an environment to run end-to-end ML projects from development to deployment with minimal transition. Multi-kernel support–Workbench provides multi-kernel support in a single managed notebook instance including kernels for tools such as TensorFlow, PyTorch, Spark, and R. Each of these kernels comes with pre-installed useful ML libraries and lets us install additional libraries as required. Scheduling notebooks–Vertex AI Workbench lets us schedule notebook runs on an ad hoc and recurring basis. This functionality is quite useful in setting up and running large-scale experiments quickly. This feature is available through managed notebook instances. More information will be provided on this in the coming sections. With this background, we can now start working with Jupyter Notebooks on Vertex AI Workbench. The next section provides basic guidelines for getting started with notebooks on Vertex AI. Getting started with Vertex AI Workbench Go to the Google Cloud console and open Vertex AI from the products menu on the left pane or by using the search bar on the top. Inside Vertex AI, click on Workbench, and it will open a page very similar to the one shown in Figure 4.3. More information on this is available in the official  documentation (https://cloud.google.com/vertex-ai/docs/workbench/ introduction).  Figure 4.3 – Vertex AI Workbench UI within the Google Cloud console As we can see, Vertex AI Workbench is basically Jupyter Notebook as a service with the flexibility of working with managed as well as user-managed notebooks. User-managed notebooks are suitable for use cases where we need a more customized environment with relatively higher control. Another good thing about user-managed notebooks is that we can choose a suitable Docker container based on our development needs; these notebooks also let us change the type/size of the instance later on with a restart. To choose the best Jupyter Notebook option for a particular project, it’s important to know about the common differences between the two solutions. Table 4.1 describes some common differences between fully managed and user-managed notebooks: Table 4.1 – Differences between managed and user-managed notebook instances Let’s create one user-managed notebook to check the available options:  Figure 4.4 – Jupyter Notebook kernel configurations As we can see in the preceding screenshot, user-managed notebook instances come with several customized image options to choose from. Along with the support of tools such as TensorFlow Enterprise, PyTorch, JAX, and so on, it also lets us decide whether we want to work with GPUs (which can be changed later, of course, as per needs). These customized images come with all useful libraries pre-installed for the desired framework, plus provide the flexibility to install any third-party packages within the instance. After choosing the appropriate image, we get more options to customize things such as notebook name, notebook region, operating system, environment, machine types, accelerators, and so on (see the following screenshot):  Figure 4.5 – Configuring a new user-managed Jupyter Notebook Once we click on the CREATE button, it can take a couple of minutes to create a notebook instance. Once it is ready, we can launch the Jupyter instance in a browser tab using the link provided inside Workbench (see Figure 4.6). We also get the option to stop the notebook for some time when we are not using it (to reduce cost):  Figure 4.6 – A running Jupyter Notebook instance This Jupyter instance can be accessed by all team members having access to Workbench, which helps in collaborating and sharing progress with other teammates. Once we click on OPEN JUPYTERLAB, it opens a familiar Jupyter environment in a new tab (see Figure 4.7):  Figure 4.7 – A user-managed JupyterLab instance in Vertex AI Workbench A Google-managed JupyterLab instance also looks very similar (see Figure 4.8):  Figure 4.8 – A Google-managed JupyterLab instance in Vertex AI Workbench Now that we can access the notebook instance in the browser, we can launch a new Jupyter Notebook or terminal and get started on the project. After providing sufficient permissions to the service account, many useful Google Cloud services such as BigQuery, GCS, Dataflow, and so on can be accessed from the Jupyter Notebook itself using SDKs. This makes Vertex AI Workbench a one-stop tool for every ML development need. Note: We should stop Vertex AI Workbench instances when we are not using them or don’t plan to use them for a long period of time. This will help prevent us from incurring costs from running them unnecessarily for a long period of time. In the next sections, we will learn how to create notebooks using custom containers and how to schedule notebooks with Vertex AI Workbench. Custom containers for Vertex AI Workbench Vertex AI Workbench gives us the flexibility of creating notebook instances based on a custom container as well. The main advantage of a custom container-based notebook is that it lets us customize the notebook environment based on our specific needs. Suppose we want to work with a new TensorFlow version (or any other library) that is currently not available as a predefined kernel. We can create a custom Docker container with the required version and launch a Workbench instance using this container. Custom containers are supported by both managed and user-managed notebooks. Here is how to launch a user-managed notebook instance using a custom container: 1. The first step is to create a custom container based on the requirements. Most of the time, a derivative container (a container based on an existing DL container image) would be easy to set up. See the following example Dockerfile; here, we are first pulling an existing TensorFlow GPU image and then installing a new TensorFlow version from the source: FROM gcr.io/deeplearning-platform-release/tf-gpu:latest RUN pip install -y tensorflow2. Next, build and push the container image to Container Registry, such that it should be accessible to the Google Compute Engine (GCE) service account. See the following source to build and push the container image: export PROJECT=$(gcloud config list project --format "value(core.project)") docker build . -f Dockerfile.example -t "gcr.io/${PROJECT}/ tf-custom:latest" docker push "gcr.io/${PROJECT}/tf-custom:latest"Note that the service account should be provided with sufficient permissions to build and push the image to the container registry, and the respective APIs should be enabled. 3. Go to the User-managed notebooks page, click on the New Notebook button, and then select Customize. Provide a notebook name and select an appropriate Region and Zone value. 4. In the Environment field, select Custom Container. 5. In the Docker Container Image field, enter the address of the custom image; in our case, it would look like this: gcr.io/${PROJECT}/tf-custom:latest 6. Make the remaining appropriate selections and click the Create button. We are all set now. While launching the notebook, we can select the custom container as a kernel and start working on the custom environment. Conclusion Vertex AI Workbench stands out as a powerful, cloud-based environment that streamlines machine learning development and deployment. By leveraging its managed and user-managed notebook options, teams can overcome local development limitations, ensuring better scalability, enhanced security, and integrated access to Google Cloud services. This guide has explored the foundational aspects of working with Vertex AI Workbench, including its customizable environments, scheduling features, and the use of custom containers. With Vertex AI Workbench, data scientists and ML practitioners can focus on innovation and productivity, confidently handling projects from inception to production. Author BioJasmeet Bhatia is a machine learning solution architect with over 18 years of industry experience, with the last 10 years focused on global-scale data analytics and machine learning solutions. In his current role at Google, he works closely with key GCP enterprise customers to provide them guidance on how to best use Google's cutting-edge machine learning products. At Google, he has also worked as part of the Area 120 incubator on building innovative data products such as Demand Signals, and he has been involved in the launch of Google products such as Time Series Insights. Before Google, he worked in similar roles at Microsoft and Deloitte.When not immersed in technology, he loves spending time with his wife and two daughters, reading books, watching movies, and exploring the scenic trails of southern California.He holds a bachelor's degree in electronics engineering from Jamia Millia Islamia University in India and an MBA from the University of California Los Angeles (UCLA) Anderson School of Management.Kartik Chaudhary is an AI enthusiast, educator, and ML professional with 6+ years of industry experience. He currently works as a senior AI engineer with Google to design and architect ML solutions for Google's strategic customers, leveraging core Google products, frameworks, and AI tools. He previously worked with UHG, as a data scientist, and helped in making the healthcare system work better for everyone. Kartik has filed nine patents at the intersection of AI and healthcare.Kartik loves sharing knowledge and runs his own blog on AI, titled Drops of AI.Away from work, he loves watching anime and movies and capturing the beauty of sunsets.
Read more
  • 0
  • 0
  • 109

article-image-essential-sql-for-data-engineers
Kedeisha Bryan, Taamir Ransome
31 Oct 2024
10 min read
Save for later

Essential SQL for Data Engineers

Kedeisha Bryan, Taamir Ransome
31 Oct 2024
10 min read
This article is an excerpt from the book, Cracking the Data Engineering Interview, by Kedeisha Bryan, Taamir Ransome. The book is a practical guide that’ll help you prepare to successfully break into the data engineering role. The chapters cover technical concepts as well as tips for resume, portfolio, and brand building to catch the employer's attention, while also focusing on case studies and real-world interview questions.Introduction In the world of data engineering, SQL is the unsung hero that empowers us to store, manipulate, transform, and migrate data easily. It is the language that enables data engineers to communicate with databases, extract valuable insights, and shape data to meet their needs. Regardless of the nature of the organization or the data infrastructure in use, a data engineer will invariably need to use SQL for creating, querying, updating, and managing databases. As such, proficiency in SQL can often the difference between a good data engineer and a great one. Whether you are new to SQL or looking to brush up your skills, this chapter will serve as a comprehensive guide. By the end of this chapter, you will have a solid understanding of SQL as a data engineer and be prepared to showcase your knowledge and skills in an interview setting. In this article, we will cover the following topics: Must-know foundational SQL concepts Must-know advanced SQL concepts Technical interview questions Must-know foundational SQL concepts In this section, we will delve into the foundational SQL concepts that form the building blocks of data engineering. Mastering these fundamental concepts is crucial for acing SQL-related interviews and effectively working with databases. Let’s explore the critical foundational SQL concepts every data engineer should be comfortable with, as follows: SQL syntax: SQL syntax is the set of rules governing how SQL statements should be written. As a data engineer, understanding SQL syntax is fundamental because you’ll be writing and reviewing SQL queries regularly. These queries enable you to extract, manipulate, and analyze data stored in relational databases. SQL order of operations: The order of operations dictates the sequence in which each of the following operators is executed in a query: FROM and JOIN WHERE GROUP BY HAVING SELECT DISTINCT ORDER BY LIMIT/OFFSET Data types: SQL supports a variety of data types, such as INT, VARCHAR, DATE, and so on. Understanding these types is crucial because they determine the kind of data that can be stored in a column, impacting storage considerations, query performance, and data integrity. As a data engineer, you might also need to convert data types or handle mismatches. SQL operators: SQL operators are used to perform operations on data. They include arithmetic operators (+, -, *, /), comparison operators (>, <, =, and so on), and logical operators (AND, OR, and NOT). Knowing these operators helps you construct complex queries to solve intricate data-related problems. Data Manipulation Language (DML), Data Definition Language (DDL), and Data Control  Language (DCL) commands: DML commands such as SELECT, INSERT, UPDATE, and DELETE allow you to manipulate data stored in the database. DDL commands such as CREATE, ALTER, and DROP enable you to manage database schemas. DCL commands such as GRANT and REVOKE are used for managing permissions. As a data engineer, you will frequently use these commands to interact with databases. Basic queries: Writing queries to select, filter, sort, and join data is an essential skill for any data engineer. These operations form the basis of data extraction and manipulation. Aggregation functions: Functions such as COUNT, SUM, AVG, MAX, MIN, and GROUP BY are used to perform calculations on multiple rows of data. They are essential for generating reports and deriving statistical insights, which are critical aspects of a data engineer’s role. The following section will dive deeper into must-know advanced SQL concepts, exploring advanced techniques to elevate your SQL proficiency. Get ready to level up your SQL game and unlock new possibilities in data engineering! Must-know advanced SQL concepts This section will explore advanced SQL concepts that will elevate your data engineering skills to the next level. These concepts will empower you to tackle complex data analysis, perform advanced data transformations, and optimize your SQL queries. Let’s delve into must-know advanced SQL concepts, as follows: Window functions: These do a calculation on a group of rows that are related to the current row. They are needed for more complex analyses, such as figuring out running totals or moving averages, which are common tasks in data engineering. Subqueries: Queries nested within other queries. They provide a powerful way to perform complex data extraction, transformation, and analysis, often making your code more efficient and readable. Common Table Expressions (CTEs): CTEs can simplify complex queries and make your code more maintainable. They are also essential for recursive queries, which are sometimes necessary for problems involving hierarchical data. Stored procedures and triggers: Stored procedures help encapsulate frequently performed tasks, improving efficiency and maintainability. Triggers can automate certain operations, improving data integrity. Both are important tools in a data engineer’s toolkit. Indexes and optimization: Indexes speed up query performance by enabling the database to locate data more quickly. Understanding how and when to use indexes is key for a data engineer, as it affects the efficiency and speed of data retrieval. Views: Views simplify access to data by encapsulating complex queries. They can also enhance security by restricting access to certain columns. As a data engineer, you’ll create and manage views to facilitate data access and manipulation. By mastering these advanced SQL concepts, you will have the tools and knowledge to handle complex data scenarios, optimize your SQL queries, and derive meaningful insights from your datasets. The following section will prepare you for technical interview questions on SQL. We will equip you with example answers and strategies to excel in SQL-related interview discussions. Let’s further enhance your SQL expertise and be well prepared for the next phase of your data engineering journey. Technical interview questions This section will address technical interview questions specifically focused on SQL for data engineers. These questions will help you demonstrate your SQL proficiency and problem-solving abilities. Let’s explore a combination of primary and advanced SQL interview questions and the best methods to approach and answer them, as follows: Question 1: What is the difference between the WHERE and HAVING clauses? Answer: The WHERE clause filters data based on conditions applied to individual rows, while the HAVING clause filters data based on grouped results. Use WHERE for filtering before aggregating data and HAVING for filtering after aggregating data. Question 2: How do you eliminate duplicate records from a result set? Answer: Use the DISTINCT keyword in the SELECT statement to eliminate duplicate records and retrieve unique values from a column or combination of columns. Question 3: What are primary keys and foreign keys in SQL? Answer: A primary key uniquely identifies each record in a table and ensures data integrity. A foreign key establishes a link between two tables, referencing the primary key of another table to enforce referential integrity and maintain relationships. Question 4: How can you sort data in SQL? Answer: Use the ORDER BY clause in a SELECT statement to sort data based on one or more columns. The ASC (ascending) keyword sorts data in ascending order, while the DESC (descending) keyword sorts it in descending order. Question 5: Explain the difference between UNION and UNION ALL in SQL. Answer: UNION combines and removes duplicate records from the result set, while UNION ALL combines all records without eliminating duplicates. UNION ALL is faster than UNION because it does not involve the duplicate elimination process. Question 6: Can you explain what a self join is in SQL? Answer: A self join is a regular join where a table is joined to itself. This is often useful when the data is related within the same table. To perform a self join, we have to use table aliases to help SQL distinguish the left from the right table. Question 7: How do you optimize a slow-performing SQL query? Answer: Analyze the query execution plan, identify bottlenecks, and consider strategies such as creating appropriate indexes, rewriting the query, or using query optimization techniques such as JOIN order optimization or subquery optimization.  Question 8: What are CTEs, and how do you use them? Answer: CTEs are temporarily named result sets that can be referenced within a query. They enhance query readability, simplify complex queries, and enable recursive queries. Use the WITH keyword to define CTEs in SQL. Question 9: Explain the ACID properties in the context of SQL databases. Answer: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These are basic properties that make sure database operations are reliable and transactional. Atomicity makes sure that a transaction is handled as a single unit, whether it is fully done or not. Consistency makes sure that a transaction moves the database from one valid state to another. Isolation makes sure that transactions that are happening at the same time don’t mess with each other. Durability makes sure that once a transaction is committed, its changes are permanent and can survive system failures. Question 10: How can you handle NULL values in SQL? Answer: Use the IS NULL or IS NOT NULL operator to check for NULL values. Additionally, you can use the COALESCE function to replace NULL values with alternative non-null values. Question 11: What is the purpose of stored procedures and functions in SQL? Answer: Stored procedures and functions are reusable pieces of SQL code encapsulating a set of SQL statements. They promote code modularity, improve performance, enhance security, and simplify database maintenance. Question 12: Explain the difference between a clustered and a non-clustered index. Answer: The physical order of the data in a table is set by a clustered index. This means that a table can only have one clustered index. The data rows of a table are stored in the leaf nodes of a clustered index. A non-clustered index, on the other hand, doesn’t change the order of the data in the table. After sorting the pointers, it keeps a separate object in a table that points back to the original table rows. There can be more than one non-clustered index for a table. Prepare for these interview questions by understanding the underlying concepts, practicing SQL queries, and being able to explain your answers. ConclusionThis article explored the foundational and advanced principles of SQL that empower data engineers to store, manipulate, transform, and migrate data confidently. Understanding these concepts has unlocked the door to seamless data operations, optimized query performance, and insightful data analysis. SQL is the language that bridges the gap between raw data and valuable insights. With a solid grasp of SQL, you possess the skills to navigate databases, write powerful queries, and design efficient data models. Whether preparing for interviews or tackling real-world data engineering challenges, the knowledge you have gained in this chapter will propel you toward success. Remember to continue exploring and honing your SQL skills. Stay updated with emerging SQL technologies, best practices, and optimization techniques to stay at the forefront of the ever-evolving data engineering landscape. Embrace the power of SQL as a critical tool in your data engineering arsenal, and let it empower you to unlock the full potential of your data. Author BioKedeisha Bryan is a data professional with experience in data analytics, science, and engineering. She has prior experience combining both Six Sigma and analytics to provide data solutions that have impacted policy changes and leadership decisions. She is fluent in tools such as SQL, Python, and Tableau.She is the founder and leader at the Data in Motion Academy, providing personalized skill development, resources, and training at scale to aspiring data professionals across the globe. Her other works include another Packt book in the works and an SQL course for LinkedIn Learning.Taamir Ransome is a Data Scientist and Software Engineer. He has experience in building machine learning and artificial intelligence solutions for the US Army. He is also the founder of the Vet Dev Institute, where he currently provides cloud-based data solutions for clients. He holds a master's degree in Analytics from Western Governors University.
Read more
  • 1
  • 0
  • 102

article-image-scientific-analysis-of-donald-trumps-tweets-on-covid-19-with-transformers
Expert Network
19 May 2021
7 min read
Save for later

Scientific Analysis of Donald Trump’s Tweets on COVID-19 with Transformers

Expert Network
19 May 2021
7 min read
It takes time and effort to figure out what is fake news and what isn't. Like children, we have to work our way through something we perceive as fake news. This article is an excerpt from the book Transformers for Natural Language Processing by Denis Rothman – A comprehensive guide for deep learning & NLP practitioners, data analysts and data scientists who want an introduction to AI language understanding to process the increasing amounts of language-driven functions.  In this article, we will focus on the logic of fake news. We will run the BERT model on SRL and visualize the results on AllenNLP.org. Now, let's go through some presidential tweets on COVID-19.  Our goal is certainly not to judge anybody or anything. Fake news involves both opinion and facts. News often depends on the perception of facts by local culture. We will provide ideas and tools to help others gather more information on a topic and find their way in the jungle of information we receive every day. Semantic Role Labeling (SRL)   SRL is an excellent educational tool for all of us. We tend just to read Tweets passively and listen to what others say about them. Breaking messages down with SRL is a good way to develop social media analytical skills to distinguish fake from accurate information.   I recommend using SRL transformers for educational purposes in class. A young student can enter a Tweet and analyze each verb and its arguments. It could help younger generations become active readers on social media. We will first analyze a relatively undivided Tweet and then a conflictual Tweet. Analyzing the undivided Tweet  Let's analyze the latest Tweet found on July 4 while writing the book, Transformers for Natural Language Processing. I took the name of the person who is referred to as a "Black American" out and paraphrased some of the former President's text:   "X is a great American, is hospitalized with coronavirus, and has requested prayer. Would you join me in praying for him today, as well as all those who are suffering from COVID-19?"    Let's go to AllenNLP.org, visualize our SRL using https://demo.allennlp.org/semantic-role-labeling, run the sentence, and look at the result. The verb "hospitalized" shows the member is staying close to the facts:   Figure: SRL arguments of the verb "hospitalized"   The message is simple: "X" + "hospitalized" + "coronavirus."   The verb "requested" shows that the message is becoming political:   Figure: SRL arguments of the verb "requested"   We don't know if the person requested the former President to pray or he decided he would be the center of the request.   A good exercise would be to display an HTML page and ask the users what they think. For example, the users could be asked to look at the results of the SRL task and answer the two following questions:   "Was former President Trump asked to pray, or did he deviate a request made to others for political reasons?"   "Is the fact that former President Trump states that he was indirectly asked to pray for X fake news or not?"  You can think about it and decide for yourself!   Analyzing the Banned Tweet Let's have a look at one that was banned from Twitter. I took the names out and paraphrased it and toned it down. Still, when we run it on AllenNLP.org and visualize the results, we get some surprising SRL outputs.   Here is the toned-down and paraphrased Tweet:   These thugs are dishonoring the memory of X.   When the looting starts, actions must be taken.   Although I suppressed the main part of the original Tweet, we can see that the SRL task shows the bad associations made in the Tweet:   Figure: SRL arguments of the verb "dishonoring"   An educational approach to this would be to explain that we should not associate the arguments "thugs" and "memory" and "looting." They do not fit together at all.   An important exercise would be to ask a user why the SRL arguments do not fit together.   I recommend many such exercises so that the transformer model users develop SRL skills to have a critical view of any topic presented to them.   Critical thinking is the best way to stop the propagation of the fake news pandemic!   We have gone through rational approaches to fake news with transformers, heuristics, and instructive websites. However, in the end, a lot of the heat in fake news debates boils down to emotional and irrational reactions.   In a world of opinion, you will never find an entirely objective transformer model that detects fake news since opposing sides never agree on what the truth is in the first place! One side will agree with the transformer model's output. Another will say that the model is biased and built by enemies of their opinion!   The best approach is to listen to others and try to keep the heat down!       Looking for the silver bullet   Looking for a silver bullet transformer model can be time-consuming or rewarding, depending on how much time and money you want to spend on continually changing models.   For example, a new approach to transformers can be found through disentanglement. Disentanglement in AI allows you to separate the features of a representation to make the training process more flexible. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen designed DeBERTa, a disentangled version of a transformer, and described the model in an interesting article:   DeBERTa: Decoding-enhanced BERT with Disentangled Attention, https://arxiv.org/ abs/2006.03654 The two main ideas implemented in DeBERTa are:   Disentangle the content and position in the transformer model to train the two vectors separately.  Use an absolute position in thedecoderto predict masked tokens in the pretraining process.   The authors provide the code on GitHub: https://github.com/microsoft/DeBERTa DeBERTa exceeds the human baseline on the SuperGLUE leaderboard in December 2020 using 1.5B parameters.   Should you stop everything you are doing on transformers and rush to this model, integrate your data, train the model, test it, and implement it?   It is very probable that by the end of 2021, another model will beat this one and so on. Should you change models all of the time in production? That will be your decision.   You can also choose to design better training methods.   Looking for reliable training methods   Looking for reliable training methods with smaller models such as the PET designed by Timo Schick can also be a solution.   Why? Being in a good position on the SuperGLUE leaderboard does not mean that the model will provide a high quality of decision-making for medical, legal, and other critical areas for sequence predications.   Looking for customized training solutions for a specific topic could be more productive than trying all the best transformers on the SuperGLUE leaderboard.   Take your time to think about implementing transformers to find the best approach for your project.   We will now conclude the article.   Summary   Fake news begins deep inside our emotional history as humans. When an event occurs, emotions take over to help us react quickly to a situation. We are hardwired to react strongly when we are threatened.   We went through raging conflicts over COVID-19, former President Trump, and climate change. In each case, we saw that emotional reactions are the fastest ones to build up into conflicts.   We then designed a roadmap to take the emotional perception of fake news to a rational level. We showed that it is possible to find key information in Tweets, Facebook messages, and other media. The news used in this article is perceived by some as real news and others as fake news to create a rationale for teachers, parents, friends, co-workers, or just people talking.  About the Author Denis Rothman graduated from Sorbonne University and Paris-Diderot University, patenting one of the very first word2matrix embedding solutions. Denis Rothman is the author of three cutting-edge AI solutions: one of the first AI cognitive chatbots more than 30 years ago; a profit-orientated AI resource optimizing system; and an AI APS (Advanced Planning and Scheduling) solution based on cognitive patterns used worldwide in aerospace, rail, energy, apparel, and many other fields. Designed initially as a cognitive AI bot for IBM, it then went on to become a robust APS solution used to this day. 
Read more
  • 0
  • 0
  • 5247

article-image-distributed-training-in-tensorflow-2-x
Expert Network
30 Apr 2021
7 min read
Save for later

Distributed training in TensorFlow 2.x

Expert Network
30 Apr 2021
7 min read
TensorFlow 2 is a rich development ecosystem composed of two main parts: Training and Serving. Training consists of a set of libraries for dealing with datasets (tf.data), a set of libraries for building models, including high-level libraries (tf.Keras and Estimators), low-level libraries (tf.*), and a collection of pretrained models (tf.Hub). Training can happen on CPUs, GPUs, and TPUs via distribution strategies and the result can be saved using the appropriate libraries.  This article is an excerpt from the book, Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor, and Sujit Pal. This book teaches deep learning techniques alongside TensorFlow (TF) and Keras. In this article, we’ll review the addition of the powerful new feature, distributed training, in TensorFlow 2.x.  One very useful addition to TensorFlow 2.x is the possibility to train models using distributed GPUs, multiple machines, and TPUs in a very simple way with very few additional lines of code. tf.distribute.Strategy is the TensorFlow API used in this case and it supports both tf.keras and tf.estimator APIs and eager execution. You can switch between GPUs, TPUs, and multiple machines by just changing the strategy instance. Strategies can be synchronous, where all workers train over different slices of input data in a form of sync data parallel computation, or asynchronous, where updates from the optimizers are not happening in sync. All strategies require that data is loaded in batches via the tf.data.Dataset api.  Note that the distributed training support is still experimental. A roadmap is given in Figure 1:  Figure 1: Distributed training support fr different strategies and APIs  Let’s discuss in detail all the different strategies reported in Figure 1.  Multiple GPUs  TensorFlow 2.x can utilize multiple GPUs. If we want to have synchronous distributed training on multiple GPUs on one machine, there are two things that we need to do: (1) We need to load the data in a way that will be distributed into the GPUs, and (2) We need to distribute some computations into the GPUs too:  In order to load our data in a way that can be distributed into the GPUs, we simply need tf.data.Dataset (which has already been discussed in the previous paragraphs). If we do not have a tf.data.Dataset but we have a normal tensor, then we can easily convert the latter into the former using tf.data.Dataset.from_tensors_slices(). This will take a tensor in memory and return a source dataset, the elements of which are slices of the given tensor. In our toy example, we use NumPy to generate training data x and labels y, and we transform it into tf.data.Dataset with tf.data.Dataset.from_tensor_slices(). Then we apply a shuffle to avoid bias in training across GPUs and then generate SIZE_BATCHES batches:  import tensorflow as tf import numpy as np from tensorflow import keras N_TRAIN_EXAMPLES = 1024*1024 N_FEATURES = 10 SIZE_BATCHES = 256  # 10 random floats in the half-open interval [0.0, 1.0). x = np.random.random((N_TRAIN_EXAMPLES, N_FEATURES)) y = np.random.randint(2, size=(N_TRAIN_EXAMPLES, 1)) x = tf.dtypes.cast(x, tf.float32) print (x) dataset = tf.data.Dataset.from_tensor_slices((x, y)) dataset = dataset.shuffle(buffer_size=N_TRAIN_EXAMPLES).batch(SIZE_BATCHES) In order to distribute some computations to GPUs, we instantiate a distribution = tf.distribute.MirroredStrategy() object, which supports synchronous distributed training on multiple GPUs on one machine. Then, we move the creation and compilation of the Keras model inside the strategy.scope(). Note that each variable in the model is mirrored across all the replicas. Let’s see it in our toy example: # this is the distribution strategy distribution = tf.distribute.MirroredStrategy() # this piece of code is distributed to multiple GPUs with distribution.scope(): model = tf.keras.Sequential()   model.add(tf.keras.layers.Dense(16, activation=‘relu’, input_shape=(N_FEATURES,)))   model.add(tf.keras.layers.Dense(1, activation=‘sigmoid’))   optimizer = tf.keras.optimizers.SGD(0.2)   model.compile(loss=‘binary_crossentropy’, optimizer=optimizer) model.summary()  # Optimize in the usual way but in reality you are using GPUs. model.fit(dataset, epochs=5, steps_per_epoch=10)  Note that each batch of the given input is divided equally among the multiple GPUs. For instance, if using MirroredStrategy() with two GPUs, each batch of size 256 will be divided among the two GPUs, with each of them receiving 128 input examples for each step. In addition, note that each GPU will optimize on the received batches and the TensorFlow backend will combine all these independent optimizations on our behalf. In short, using multiple GPUs is very easy and requires minimal changes to the tf.Keras code used for a single server.  MultiWorkerMirroredStrategy  This strategy implements synchronous distributed training across multiple workers, each one with potentially multiple GPUs. As of September 2019 the strategy works only with Estimators and it has experimental support for tf.Keras. This strategy should be used if you are aiming at scaling beyond a single machine with high performance. Data must be loaded with tf.Dataset and shared across workers so that each worker can read a unique subset.  TPUStrategy  This strategy implements synchronous distributed training on TPUs. TPUs are Google’s specialized ASICs chips designed to significantly accelerate machine learning workloads in a way often more efficient than GPUs. According to this public information (https://github.com/tensorflow/tensorflow/issues/24412):  “the gist is that we intend to announce support for TPUStrategy alongside Tensorflow 2.1. Tensorflow 2.0 will work under limited use-cases but has many improvements (bug fixes, performance improvements) that we’re including in Tensorflow 2.1, so we don’t consider it ready yet.”  ParameterServerStrategy  This strategy implements either multi-GPU synchronous local training or asynchronous multi-machine training. For local training on one machine, the variables of the models are placed on the CPU and operations are replicated across all local GPUs. For multi-machine training, some machines are designated as workers and some as parameter servers with the variables of the model placed on parameter servers. Computation is replicated across all GPUs of all workers. Multiple workers can be set up with the environment variable TF_CONFIG as in the following example:  os.environ[“TF_CONFIG”] = json.dumps({    “cluster”: {        “worker”: [“host1:port”, “host2:port”, “host3:port”],         “ps”: [“host4:port”, “host5:port”]    },    “task”: {“type”: “worker”, “index”: 1} })  In this article, we have seen how it is possible to train models using distributed GPUs, multiple machines, and TPUs in a very simple way with very few additional lines of code. Learn how to build machine and deep learning systems with the newly released TensorFlow 2 and Keras for the lab, production, and mobile devices with Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor and Sujit Pal.  About the Authors  Antonio Gulli is a software executive and business leader with a passion for establishing and managing global technological talent, innovation, and execution. He is an expert in search engines, online services, machine learning, information retrieval, analytics, and cloud computing.   Amita Kapoor is an Associate Professor in the Department of Electronics, SRCASW, University of Delhi and has been actively teaching neural networks and artificial intelligence for the last 20 years. She is an active member of ACM, AAAI, IEEE, and INNS. She has co-authored two books.   Sujit Pal is a technology research director at Elsevier Labs, working on building intelligent systems around research content and metadata. His primary interests are information retrieval, ontologies, natural language processing, machine learning, and distributed processing. He is currently working on image classification and similarity using deep learning models. He writes about technology on his blog at Salmon Run. 
Read more
  • 0
  • 0
  • 6120

article-image-how-to-create-tensors-in-pytorch
Expert Network
20 Apr 2021
6 min read
Save for later

How to Create Tensors in PyTorch

Expert Network
20 Apr 2021
6 min read
A tensor is the fundamental building block of all DL toolkits. The name sounds rather mystical, but the underlying idea is that a tensor is a multi-dimensional array. Building analogy with school math, one single number is like a point, which is zero-dimensional, while a vector is one-dimensional like a line segment, and a matrix is a two-dimensional object. Three-dimensional number collections can be represented by a parallelepiped of numbers, but they don't have a separate name in the same way as a matrix. We can keep this term for collections of higher dimensions, which are named multi-dimensional arrays.  Another thing to note about tensors used in DL is that they are only partially related to tensors used in tensor calculus or tensor algebra. In DL, tensor is any multi-dimensional array, but in mathematics, tensor is a mapping between vector spaces, which might be represented as a multi-dimensional array in some cases but has much more semantical payload behind it. Mathematicians usually frown at everybody who uses well-established mathematical terms to name different things, so, be warned.  Figure 1: Going from a single number to an n-dimension tensor This article is an excerpt from the book Deep Reinforcement Learning Hands-On - Second Edition by Maxim Lapan. This book is an updated and expanded version of the bestselling guide to the very latest RL tools and techniques. In this article, we’ll discuss the fundamental building block of all DL toolkits, tensor.  Creation of tensors  If you're familiar with the NumPy library, then you already know that its central purpose is the handling of multi-dimensional arrays in a generic way. In NumPy, such arrays aren't called tensors, but they are in fact tensors. Tensors are used very widely in scientific computations as generic storage for data. For example, a color image could be encoded as a 3D tensor with dimensions of width, height, and color plane.  Apart from dimensions, a tensor is characterized by the type of its elements. There are eight types supported by PyTorch: three float types (16-bit, 32-bit, and 64-bit) and five integer types (8-bit signed, 8-bit unsigned, 16-bit, 32-bit, and 64-bit). Tensors of different types are represented by different classes, with the most commonly used being torch.FloatTensor (corresponding to a 32-bit float), torch.ByteTensor (an 8-bit unsigned integer), and torch.LongTensor (a 64-bit signed integer). The rest can be found in the PyTorch documentation.  There are three ways to create a tensor in PyTorch:  By calling a constructor of the required type.  By converting a NumPy array or a Python list into a tensor. In this case, the type will be taken from the array's type.  By asking PyTorch to create a tensor with specific data for you. For example, you can use the torch.zeros() function to create a tensor filled with zero values.  To give you examples of these methods, let's look at a simple session:  >>> import torch >>> import numpy as np >>> a = torch.FloatTensor(3, 2) >>> a tensor([[4.1521e+09,  4.5796e-41],        [ 1.9949e-20, 3.0774e-41],        [ 4.4842e-44, 0.0000e+00]]) Here, we imported both PyTorch and NumPy and created an uninitialized tensor of size 3×2. By default, PyTorch allocates memory for the tensor, but doesn't initialize it with anything. To clear the tensor's content, we need to use its operation:  >> a.zero_() tensor([[ 0., 0.],         [ 0., 0.],         [ 0., 0.]]) There are two types of operation for tensors: inplace and functional. Inplace operations have an underscore appended to their name and operate on the tensor's content. After this, the object itself is returned. The functional equivalent creates a copy of the tensor with the performed modification, leaving the original tensor untouched. Inplace operations are usually more efficient from a performance and memory point of view.  Another way to create a tensor by its constructor is to provide a Python iterable (for example, a list or tuple), which will be used as the contents of the newly created tensor:  >>> torch.FloatTensor([[1,2,3],[3,2,1]]) tensor([[ 1., 2., 3.],         [ 3., 2., 1.]])  Here we are creating the same tensor with zeroes using NumPy:  >>> n = np.zeros(shape=(3, 2)) >>> n array([[ 0., 0.],        [ 0., 0.],        [ 0., 0.]]) >>> b = torch.tensor(n) >>> b tensor([[ 0., 0.],         [ 0., 0.],         [ 0., 0.]], dtype=torch.float64)  The torch.tensor method accepts the NumPy array as an argument and creates a tensor of appropriate shape from it. In the preceding example, we created a NumPy array initialized by zeros, which created a double (64-bit float) array by default. So, the resulting tensor has the DoubleTensor type (which is shown in the preceding example with the dtype value). Usually, in DL, double precision is not required and it adds an extra memory and performance overhead. The common practice is to use the 32-bit float type, or even the 16-bit float type, which is more than enough. To create such a tensor, you need to specify explicitly the type of NumPy array: >>> n = np.zeros(shape=(3, 2), dtype=np.float32) >>> torch.tensor(n) tensor([[ 0., 0.],         [ 0., 0.],         [ 0., 0.]])  As an option, the type of the desired tensor could be provided to the torch.tensor function in the dtype argument. However, be careful, since this argument expects to get a PyTorch type specification, not the NumPy one. PyTorch types are kept in the torch package, for example, torch.float32 and torch.uint8.  >>> n = np.zeros(shape=(3,2)) >>> torch.tensor(n, dtype=torch.float32) tensor([[ 0., 0.],         [ 0., 0.],         [ 0., 0.]])  In this article, we saw a quick overview of tensor, the fundamental building block of all DL toolkits. We talked about tensor and how to create it in the PyTorch library. Discover ways to increase efficiency of RL methods both from theoretical and engineering perspective with the book Deep Reinforcement Learning Hands-on, Second Edition by Maxim Lapan.   About the Author  Maxim Lapan is a deep learning enthusiast and independent researcher. He has spent 15 years working as a software developer and systems architect. His projects have ranged from low-level Linux kernel driver development to performance optimization and the design of distributed applications working on thousands of servers.   With his areas of expertise including big data, machine learning, and large parallel distributed HPC and non-HPC systems, Maxim is able to explain complicated concepts using simple words and vivid examples. His current areas of interest are practical applications of deep learning, such as deep natural language processing and deep reinforcement learning. Maxim lives in Moscow, Russian Federation, with his family.  
Read more
  • 0
  • 0
  • 23780

article-image-5-reasons-why-you-should-use-an-open-source-data-analytics-stack-in-2020
Amey Varangaonkar
28 Jan 2020
7 min read
Save for later

5 reasons why you should use an open-source data analytics stack in 2020

Amey Varangaonkar
28 Jan 2020
7 min read
Today, almost every company is trying to be data-driven in some sense or the other. Businesses across all the major verticals such as healthcare, telecommunications, banking, insurance, retail, education, etc. make use of data to better understand their customers, optimize their business processes and, ultimately, maximize their profits. This is a guest post sponsored by our friends at RudderStack. When it comes to using data for analytics, companies face two major challenges: Data tracking: Tracking the required data from a multitude of sources in order to get insights out of it. As an example, tracking customer activity data such as logins, signups, purchases, and even clicks such as bookmarks from platforms such as mobile apps and websites becomes an issue for many eCommerce businesses. Building a link between the Data and Business Intelligence: Once data is acquired, transforming it and making it compatible for a BI tool can often prove to be a substantial challenge. A well designed data analytics stack comes is essential in combating these challenges. It will ensure you're well-placed to use the data at your disposal in more intelligent ways. It will help you drive more value. What does a data analytics stack do? A data analytics stack is a combination of tools which when put together, allows you to bring together all of your data in one platform, and use it to get actionable insights that help in better decision-making. As seen the diagram above illustrates, a data analytics stack is built upon three fundamental steps: Data Integration: This step involves collecting and blending data from multiple sources and transforming them in a compatible format, for storage. The sources could be as varied as a database (e.g. MySQL), an organization’s log files, or event data such as clicks, logins, bookmarks, etc from mobile apps or websites. A data analytics stack allows you to use all of such data together and use it to perform meaningful analytics. Data Warehousing: This next step involves storing the data for the purpose of analytics. As the complexity of data grows, it is feasible to consolidate all the data in a single data warehouse. Some of the popular modern data warehouses include Amazon’s Redshift, Google BigQuery and platforms such as Snowflake and MarkLogic. Data Analytics: In this final step, we use a visualization tool to load the data from the warehouse and use it to extract meaningful insights and patterns from the data, in the form of charts, graphs and reports. Choosing a data analytics stack - proprietary or open-source? When it comes to choosing a data analytics stack, businesses are often left with two choices - buy it or build it. On one hand, there are proprietary tools such as Google Analytics, Amplitude, Mixpanel, etc. - where the vendors alone are responsible for their configuration and management to suit your needs. With the best in class features and services that come along with the tools, your primary focus can just be project management, rather than technology management. While using proprietary tools have their advantages, there are also some major cons to them that revolve mainly around cost, data sharing, privacy concerns, and more. As a result, businesses today are increasingly exploring the open-source alternatives to build their data analytics stack. The advantages of open source analytics tools Let's now look at the 5 main advantages that open-source tools have over these proprietary tools. Open source analytics tools are cost effective Proprietary analytics products can cost hundreds of thousands of dollars beyond their free tier. For small to medium-sized businesses, the return on investment does not often justify these costs. Open-source tools are free to use and even their enterprise versions are reasonably priced compared to their proprietary counterparts. So, with a lower up-front costs, reasonable expenses for training, maintenance and support, and no cost for licensing, open-source analytics tools are much more affordable. More importantly, they're better value for money. Open source analytics tools provide flexibility Proprietary SaaS analytics products will invariably set restrictions on the ways in which they can be used. This is especially the case with the trial or the lite versions of the tools, which are free. For example, full SQL is not supported by some tools. This makes it hard to combine and query external data alongside internal data. You'll also often find that warehouse dumps provide no support either. And when they do, they'll probably cost more and still have limited functionality. Data dumps from Google Analytics, for instance, can only be loaded into Google BigQuery. Also, these dumps are time-delayed. That means the loading process can be very slow.. With open-source software, you get complete flexibility: from the way you use your tools, how you combine to build your stack, and even how you use your data. If your requirements change - which, let's face it, they probably will - you can make the necessary changes without paying extra for customized solutions. Avoid vendor lock-in Vendor lock-in, also known as proprietary lock-in, is essentially a state where a customer becomes completely dependent on the vendor for their products and services. The customer is unable to switch to another vendor without paying a significant switching cost. Some organizations spend a considerable amount of money on proprietary tools and services that they heavily rely on. If these tools aren't updated and properly maintained, the organization using it is putting itself at a real competitive disadvantage. This is almost never the case with open-source tools. Constant innovation and change is the norm. Even if the individual or the organization handling the tool moves on, the community catn take over the project and maintain it. With open-source, you can rest assured that your tools will always be up-to-date without heavy reliance on anyone. Improved data security and privacy Privacy has become a talking point in many data-related discussions of late. This is thanks, in part, to data protection laws such as the GDPR and CCPA coming into force. High-profile data leaks have also kept the issue high on the agenda. An open-source stack analytics running inside your cloud or on-prem environment gives complete control of your data. This lets you decide which data is to be used when, and how. It lets you dictate how third parties can access and use your data, if at all. Open-source is the present It's hard to counter the fact that open-source is now mainstream. Companies like Microsoft, Apple, and IBM are now not only actively participating in the open-source community, they're also contributing to it. Open-source puts you on the front foot when it comes to innovation. With it, you'll be able to leverage the power of a vibrant developer community to develop better products in more efficient ways. How RudderStack helps you build an ideal open-source data analytics stack RudderStack is a completely open-source, enterprise-ready platform to simplify data management in the most secure and reliable way. It works as a perfect data integration platform by routing your event data from data sources such as websites, mobile apps and servers, to multiple destinations of your choice - thus helping you save time and effort. RudderStack integrates effortlessly with a multitude of destinations such as Google Analytics, Amplitude, MixPanel, Salesforce, HubSpot, Facebook Ads, and more, as well as popular data warehouses such as Amazon Redshift or S3. If performing efficient clickstream analytics is your goal, RudderStack offers you the perfect data pipeline to collect and route your data securely. Learn more about Rudderstack by visiting the RudderStack website, or check out its GitHub page to find out how it works.
Read more
  • 0
  • 0
  • 10557
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-3-different-types-of-generative-adversarial-networks-gans-and-how-they-work
Packt Editorial Staff
08 Jan 2020
6 min read
Save for later

3 different types of generative adversarial networks (GANs) and how they work

Packt Editorial Staff
08 Jan 2020
6 min read
Generative adversarial networks (GANs) have been greeted with real excitement since their creation back in 2014 by Ian Goodfellow and his research team. Yann LeCun, Facebook's Director of AI Research went as far as describing GANs as "the most interesting idea in the last 10 years in ML." With all this excitement, however, it can be easy to miss the subtle diversity of GANs; there are a number of different types of generative adversarial networks, each one working in slightly different ways and helping engineers to achieve slightly different results. To give you a deeper insight on GANs, in this article we'll look at three different generative adversarial networks: SRGANs, CycleGANs, and InfoGANs. We'll explore how these different GANs work and how they can be used. This should give you a solid foundation to explore GANs in more depth and begin to apply them in your own experiments and projects. This article is an excerpt from the book, Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor, and Sujit Pal.  SRGAN - Super Resolution GANs Remember seeing a crime-thriller where our hero asks the computer guy to magnify the faded image of the crime scene? With the zoom we are able to see the criminal’s face in detail, including the weapon used and anything engraved upon it! Well, SRGAN can perform similar magic. Here a GAN is trained in such a way that it can generate a photorealistic high-resolution image when given a low-resolution image. The SRGAN architecture consists of three neural networks: a very deep generator network, a discriminator network, and a pretrained VGG-16 network. How do SRGANs work? SRGANs use the perceptual loss function (developed by Johnson et al, Perceptual Losses for Real-Time Style Transfer and Super-Resolution). The difference in the feature map activations in high layers of a VGG network between the network output part and the high-resolution part comprises the perceptual loss function. Besides perceptual loss, the authors further added content loss and an adversarial loss so that images generated look more natural and the finer details more artistic. The perceptual loss is defined as the weighted sum of content loss and adversarial loss: lSR = lSR X+ 10−3×lSRGen The first term on the right-hand side is the content loss, obtained using the feature maps generated by pretrained VGG 19. Mathematically it is the Euclidean distance between the feature map of the reconstructed image (that is the one generated by the generator) and the original high-resolution reference image. The second term on the right-hand side is the adversarial loss. It is the standard generative loss term, designed to ensure that images generated by the generator are able to fool the discriminator. You can see in the following figure taken from the original paper that the image generated by SRGAN is much closer to the original high-resolution image: [caption id="attachment_31006" align="aligncenter" width="907"] image via https://arxiv.org/pdf/1609.04802.pdf[/caption] CycleGAN Another noteworthy architecture is CycleGAN; proposed in 2017, it can perform the task of image translation. Once trained you can translate an image from one domain to another domain. For example, when trained on horse and zebra data set, if you give it an image with horses in the ground, the CycleGAN can convert the horses to zebra with the same background. How does CycleGAN work? Have you ever imagined how a scenery would look if Van Gogh or Manet had painted it? We have many sceneries, and many landscapes painted by Gogh/Manet, but we do not have any collection of input-output pairs. CycleGAN performs the image translation, that is, transfers an image given in one domain (scenery for example) to another domain (Van Gogh painting of the same scene, for instance) in the absence of training examples. CycleGAN’s ability to perform image translation in the absence of training pairs is what makes it unique. To achieve image translation the authors of CycleGAN used a very simple and yet effective procedure. They made use of two GANs, the generator of each GAN performing the image translation from one domain to another. To elaborate, let us say the input is X, then the generator of the first GAN performs a mapping G: X → Y, thus its output would be Y = G(X). The generator of the second GAN performs an inverse mapping F: Y → X, resulting in X = F(Y). Each discriminator is trained to distinguish between real images and synthesized images. The idea is shown as follows: To train the combined GANs, the authors added beside the conventional GAN adversarial loss a forward cycle consistency loss (left figure) and a backward cycle consistency loss (right figure). This ensures that if an image X is given as input, then after the two translations F(G(X)) ~ X the obtained image is the same X (similarly the backward cycle consistency loss ensures the G(F(Y)) ~ Y). Following are some of the successful image translations by CycleGAN: Following are few more examples, you can see the translation of seasons (summer → winter), photo → painting and vice versa, horses → zebra: InfoGAN The GAN architectures that we have considered up to now provide us with little or no control over the generated images. InfoGAN changes this; it provides control over various attributes of the images generated. The InfoGAN uses concepts from information theory such that the noise term is transformed into latent codes which provide predictable and systematic control over the output. How does InfoGAN work? The generator in InfoGAN takes two inputs the latent space Z and a latent code c, thus the output of generator is G(Z,c). The GAN is trained such that it maximizes the mutual information between the latent code c and the generated image G(Z,c). The following figure shows the architecture of InfoGAN:   The concatenated vector (Z,c) is fed to the Generator. Q(c|X) is also a neural network, combined with the generator it works to form a mapping between random noise Z and its latent code c_hat, it aims to estimate c given X. This is achieved by adding a regularization term to the objective function of conventional GAN: minDmaxG VI(D,G) = VG(D,G) −λI(c;G(Z,c)) The term VG(D,G) is the loss function of conventional GAN, and the second term is the regularization term, where λ is a constant. Its value was set to 1 in the paper, and I(c;G(Z,c)) is the mutual information between the latent code c and the Generator generated image G(Z,c). Below is the results of InfoGAN on the MNIST dataset: That concludes our brief look at three different types of generative adversarial networks. You can find the book from which this article was taken on the Packt store or you can read the first chapter for free on the Packt subscription platform.
Read more
  • 0
  • 0
  • 16713

article-image-emmanuel-tsukerman-on-why-a-malware-solution-must-include-a-machine-learning-component
Savia Lobo
30 Dec 2019
11 min read
Save for later

Emmanuel Tsukerman on why a malware solution must include a machine learning component

Savia Lobo
30 Dec 2019
11 min read
Machine learning is indeed the tech of present times! Security, which is a growing concern for many organizations today and machine learning is one of the solutions to deal with it. ML can help cybersecurity systems analyze patterns and learn from them to help prevent similar attacks and respond to changing behavior. To know more about machine learning and its application in Cybersecurity, we had a chat with Emmanuel Tsukerman, a Cybersecurity Data Scientist and the author of Machine Learning for Cybersecurity Cookbook. The book also includes modern AI to create powerful cybersecurity solutions for malware, pentesting, social engineering, data privacy, and intrusion detection. In 2017, Tsukerman's anti-ransomware product was listed in the Top 10 ransomware products of 2018 by PC Magazine. In his interview, Emmanuel talked about how ML algorithms help in solving problems related to cybersecurity, and also gave a brief tour through a few chapters of his book. He also touched upon the rise of deepfakes and malware classifiers. On using machine learning for cybersecurity Using Machine learning in Cybersecurity scenarios will enable systems to identify different types of attacks across security layers and also help to take a correct POA. Can you share some examples of the successful use of ML for cybersecurity you have seen recently? A recent and interesting development in cybersecurity is that the bad guys have started to catch up with technology; in particular, they have started utilizing Deepfake tech to commit crime; for example,they have used AI to imitate the voice of a CEO in order to defraud a company of $243,000. On the other hand, the use of ML in malware classifiers is rapidly becoming an industry standard, due to the incredible number of never-before-seen samples (over 15,000,000) that are generated each year. On staying updated with developments in technology to defend against attacks Machine learning technology is not only used by ethical humans, but also by Cybercriminals who use ML for ML-based intrusions. How can organizations counter such scenarios and ensure the safety of confidential organizational/personal data? The main tools that organizations have at their disposal to defend against attacks are to stay current and to pentest. Staying current, of course, requires getting educated on the latest developments in technology and its applications. For example, it’s important to know that hackers can now use AI-based voice imitation to impersonate anyone they would like. This knowledge should be propagated in the organization so that individuals aren’t caught off-guard. The other way to improve one’s security is by performing regular pen tests using the latest attack methodology; be it by attempting to avoid the organization’s antivirus, sending phishing communications, or attempting to infiltrate the network. In all cases, it is important to utilize the most dangerous techniques, which are often ML-based On how ML algorithms and GANs help in solving cybersecurity problems In your book, you have mentioned various algorithms such as clustering, gradient boosting, random forests, and XGBoost. How do these algorithms help in solving problems related to cybersecurity? Unless a machine learning model is limited in some way (e.g., in computation, in time or in training data), there are 5 types of algorithms that have historically performed best: neural networks, tree-based methods, clustering, anomaly detection and reinforcement learning (RL). These are not necessarily disjoint, as one can, for example, perform anomaly detection via neural networks. Nonetheless, to keep it simple, let’s stick to these 5 classes. Neural networks shine with large amounts of data on visual, auditory or textual problems. For that reason, they are used in Deepfakes and their detection, lie detection and speech recognition. Many other applications exist as well. But one of the most interesting applications of neural networks (and deep learning) is in creating data via Generative adversarial networks (GANs). GANs can be used to generate password guesses and evasive malware. For more details, I’ll refer you to the Machine Learning for Cybersecurity Cookbook. The next class of models that perform well are tree-based. These include Random Forests and gradient boosting trees. These perform well on structured data with many features. For example, the PE header of PE files (including malware) can be featurized, yielding ~70 numerical features. It is convenient and effective to construct an XGBoost model (a gradient-boosting model) or a Random Forest model on this data, and the odds are good that performance will be unbeatable by other algorithms. Next there is clustering. Clustering shines when you would like to segment a population automatically. For example, you might have a large collection of malware samples, and you would like to classify them into families. Clustering is a natural choice for this problem. Anomaly detection lets you fight off unseen and unknown threats. For instance, when a hacker utilizes a new tactic to intrude on your network, an anomaly detection algorithm can protect you even if this new tactic has not been documented. Finally, RL algorithms perform well on dynamic problems. The situation can be, for example, a penetration test on a network. The DeepExploit framework, covered in the book, utilizes an RL agent on top of metasploit to learn from prior pen tests and becomes better and better at finding vulnerabilities. Generative Adversarial Networks (GANs) are a popular branch of ML used to train systems against counterfeit data. How can these help in malware detection and safeguarding systems to identify correct intrusion? A good way to think about GANs is as a pair of neural networks, pitted against each other. The loss of one is the objective of the other. As the two networks are trained, each becomes better and better at its job. We can then take whichever side of the “tug of war” battle, separate it from its rival, and use it. In other cases, we might choose to “freeze” one of the networks, meaning that we do not train it, but only use it for scoring. In the case of malware, the book covers how to use MalGAN, which is a GAN for malware evasion. One network, the detector, is frozen. In this case, it is an implementation of MalConv. The other network, the adversarial network, is being trained to modify malware until the detection score of MalConv drops to zero. As it trains, it becomes better and better at this. In a practical situation, we would want to unfreeze both networks. Then we can take the trained detector, and use it as part of our anti-malware solution. We would then be confident knowing that it is very good at detecting evasive malware. The same ideas can be applied in a range of cybersecurity contexts, such as intrusion and deepfakes. On how Machine Learning for Cybersecurity Cookbook can help with easy implementation of ML for Cybersecurity problems What are some of the tools/ recipes mentioned in your book that can help cybersecurity professionals to easily implement machine learning and make it a part of their day-to-day activities? The Machine Learning for Cybersecurity Cookbook offers an astounding 80+ recipes. Themost applicable recipes will vary between individual professionals, and even for each individual different recipes will be applicable at different times in their careers. For a cybersecurity professional beginning to work with malware, the fundamentals chapter, chapter 2:ML-based Malware Detection, provides a solid and excellent start to creating a malware classifier. For more advanced malware analysts, Chapter 3:Advanced Malware Detection will offer more sophisticated and specialized techniques, such as dealing with obfuscation and script malware. Every cybersecurity professional would benefit from getting a firm grasp of chapter 4, “ML for Social Engineering”. In fact, anyone at all should have an understanding of how ML can be used to trick unsuspecting users, as part of their cybersecurity education. This chapter really shows that you have to be cautious because machines are becoming better at imitating humans. On the other hand, ML also provides the tools to know when such an attack is being performed. Chapter 5, “Penetration Testing Using ML” is a technical chapter, and is most appropriate to cybersecurity professionals that are concerned with pen testing. It covers 10 ways in which pen testing can be improved by using ML, including neural network-assisted fuzzing and DeepExploit, a framework that utilizes a reinforcement learning (RL) agent on top of metasploit to perform automatic pen testing. Chapter 6, “Automatic Intrusion Detection” has a wider appeal, as a lot of cybersecurity professionals have to know how to defend a network from intruders. They would benefit from seeing how to leverage ML to stop zero-day attacks on their network. In addition, the chapter covers many other use cases, such as spam filtering, Botnet detection and Insider Threat detection, which are more useful to some than to others. Chapter 7, “Securing and Attacking Data with ML” provides great content to cybersecurity professionals interested in utilizing ML for improving their password security and other forms of data security. Chapter 8, “Secure and Private AI”, is invaluable to data scientists in the field of cybersecurity. Recipes in this chapter include Federated Learning and differential privacy (which allow to train an ML model on clients’ data without compromising their privacy) and testing adversarial robustness (which allows to improve the robustness of ML models to adversarial attacks). Your book talks about using machine learning to generate custom malware to pentest security. Can you elaborate on how this works and why this matters? As a general rule, you want to find out your vulnerabilities before someone else does (who might be up to no-good). For that reason, pen testing has always been an important step in providing security. To pen test your Antivirus well, it is important to use the latest techniques in malware evasion, as the bad guys will certainly try them, and these are deep learning-based techniques for modifying malware. On Emmanuel’s personal achievements in the Cybersecurity domain Dr. Tsukerman, in 2017, your anti-ransomware product was listed in the ‘Top 10 ransomware products of 2018’ by PC Magazine. In your experience, why are ransomware attacks on the rise and what makes an effective anti-ransomware product? Also, in 2018,  you designed an ML-based, instant-verdict malware detection system for Palo Alto Networks' WildFire service of over 30,000 customers. Can you tell us more about this project? If you monitor cybersecurity news, you would see that ransomware continues to be a huge threat. The reason is that ransomware offers cybercriminals an extremely attractive weapon. First, it is very difficult to trace the culprit from the malware or from the crypto wallet address. Second, the payoffs can be massive, be it from hitting the right target (e.g., a HIPAA compliant healthcare organization) or a large number of targets (e.g., all traffic to an e-commerce web page). Thirdly, ransomware is offered as a service, which effectively democratizes it! On the flip side, a lot of the risk of ransomware can be mitigated through common sense tactics. First, backing up one’s data. Second, having an anti-ransomware solution that provides guarantees. A generic antivirus can provide no guarantee - it either catches the ransomware or it doesn’t. If it doesn’t, your data is toast. However, certain anti-ransomware solutions, such as the one I have developed, do offer guarantees (e.g., no more than 0.1% of your files lost). Finally, since millions of new ransomware samples are developed each year, the malware solution must include a machine learning component, to catch the zero-day samples, which is another component of the anti-ransomware solution I developed. The project at Palo Alto Networks is a similar implementation of ML for malware detection. The one difference is that unlike the anti-ransomware service, which is an endpoint security tool, it offers protection services from the cloud. Since Palo Alto Networks is a firewall-service provider, that makes a lot of sense, since ideally, the malicious sample will be stopped at the firewall, and never even reach the endpoint. To learn how to implement the techniques discussed in this interview, grab your copy of the Machine Learning for Cybersecurity Cookbook Don’t wait - the bad guys aren’t waiting. Author Bio Emmanuel Tsukerman graduated from Stanford University and obtained his Ph.D. from UC Berkeley. In 2017, Dr. Tsukerman's anti-ransomware product was listed in the Top 10 ransomware products of 2018 by PC Magazine. In 2018, he designed an ML-based, instant-verdict malware detection system for Palo Alto Networks' WildFire service of over 30,000 customers. In 2019, Dr. Tsukerman launched the first cybersecurity data science course. About the book Machine Learning for Cybersecurity Cookbook will guide you through constructing classifiers and features for malware, which you'll train and test on real samples. You will also learn to build self-learning, reliant systems to handle cybersecurity tasks such as identifying malicious URLs, spam email detection, intrusion detection, network protection, and tracking user and process behavior, and much more! DevSecOps and the shift left in security: how Semmle is supporting software developers [Podcast] Elastic marks its entry in security analytics market with Elastic SIEM and Endgame acquisition Businesses are confident in their cybersecurity efforts, but weaknesses prevail
Read more
  • 0
  • 0
  • 5443

article-image-key-skills-for-data-professionals-to-learn-in-2020
Richard Gall
20 Dec 2019
6 min read
Save for later

Key skills for data professionals to learn in 2020

Richard Gall
20 Dec 2019
6 min read
It’s easy to fall into the trap of thinking about your next job, or even the job after that. It’s far more useful, however, to think more about the skills you want and need to learn now. This will focus your mind and ensure that you don’t waste time learning things that simply aren’t helpful. It also means you can make use of the things you’re learning almost immediately. This will make you more productive and effective - and who knows, maybe it will make the pathway to your future that little bit clearer. So, to help you focus, here are some of the things you should focus on learning as a data professional. Reinforcement learning Reinforcement learning is one of the most exciting and cutting-edge areas of machine learning. Although the area itself is relatively broad, the concept itself is fundamentally about getting systems to ‘learn’ through a process of reward. Because reinforcement learning focuses on making the best possible decision at a given moment, it naturally finds many applications where decision making is important. This includes things like robotics, digital ad-bidding, configuring software systems, and even something as prosaic as traffic light control. Of course, the list of potential applications for reinforcement learning could be endless. To a certain extent, the real challenge with it is finding new use cases that are relevant to you. But to do that, you need to learn and master it - so make 2020 the year you do just that. Get to grips with reinforcement learning with Reinforcement Learning Algorithms with Python. Learn neural networks Neural networks are closely related to reinforcement learning - they’re essentially another element within machine learning. However, neural networks are even more closely aligned with what we think of as typical artificial intelligence. Indeed, even the name itself hints at the fact that these systems are supposed to in some way mimic the human brain. Like reinforcement learning, there are a number of different applications for neural networks. These include image and language processing, as well as forecasting. The complexity of relationships that can be figured inside neural networks systems is useful for handling data with many different variables and intricacies that would otherwise be difficult to capture. If you want to find out how artificial intelligence really works under the hood, make sure you learn neural networks in 2020. Learn how to build real-world neural networks projects with Neural Network Projects with Python. Meta-learning Metalearning is another area of machine learning. It’s designed to help engineers and analysts to use the right machine learning algorithms for specific problems - it’s particularly important in automatic machine learning, where removing human agency from the analytical process can lead to the wrong systems being used on data. Meta learning does this by being applied to metadata about machine learning projects. This metadata will include information about the data, such as algorithm features, performance measures, and patterns identified previously. Once meta learning algorithms have ‘learned’ from this data, they should, in theory, be well optimized to run on other sets of data. It has been said that meta learning is important in the move towards generalized artificial intelligence, or AGI (intelligence that is more akin to human intelligence). This is because getting machines to learn about learning allow systems to move between different problems - something that is incredibly difficult with even the most sophisticated neural networks. Whether it will actually get us any closer to AGI is certainly open to debate, but if you want to be a part of the cutting edge of AI development, getting stuck into meta learning is a good place to begin in 2020. Find out how meta learning works in Hands-on Meta Learning with Python. Learn a new programming language Python is now the undisputed language of data. But that’s far from the end of the story - R still remains relevant in the field, and there are even reasons to use other languages for machine learning. It might not be immediately obvious - especially if you’re content to use R or Python for analytics and algorithmic projects - but because machine learning is shifting into many different fields, from mobile development to cybersecurity, learning how other programming languages can be used to build machine learning algorithms could be incredibly valuable. From the perspective of your skill set, it gives you a level of flexibility that will not only help you to solve a wider range of problems, but also stand out from the crowd when it comes to the job market. The most obvious non-obvious languages to learn for machine learning practitioners and other data professionals are Java and Julia. But even new and emerging languages are finding their way into machine learning - Go and Swift, for example, could be interesting routes to explore, particularly if you’re thinking about machine learning in production software and systems. Find out how to use Go for machine learning with Go Machine Learning Projects. Learn new frameworks For data professionals there are probably few things more important than learning new frameworks. While it’s useful to become a polyglot, it’s nevertheless true that learning new frameworks and ecosystem tools are going to have a more immediate impact on your work. PyTorch and TensorFlow should almost certainly be on your list for 2020. But we’ve mentioned them a lot recently, so it’s probably worth highlighting other frameworks worth your focus: Pandas, for data wrangling and manipulation, Apache Kafka, for stream-processing, scikit-learn for machine learning, and Matplotlib for data visualization. The list could be much, much longer: however, the best way to approach learning a new framework is to start with your immediate problems. What’s causing issues? What would you like to be able to do but can’t? What would you like to be able to do faster? Explore TensorFlow eBooks and videos on the Packt store. Learn how to develop and communicate a strategy It’s easy to just roll your eyes when someone talks about how important ‘soft skills’ are for data professionals. Except it’s true - being able to strategize, communicate, and influence, are what mark you out as a great data pro rather than a merely competent one. The phrase ‘soft skills’ is often what puts people off - ironically, despite the name they’re often even more difficult to master than technical skill. This is because, of course, soft skills involve working with humans in all their complexity. However, while learning these sorts of skills can be tough, it doesn’t mean it's impossible. To a certain extent it largely just requires a level of self-awareness and reflexivity, as well as a sensitivity to wider business and organizational problems. A good way of doing this is to step back and think of how problems are defined, and how they relate to other parts of the business. Find out how to deliver impactful data science projects with Managing Data Science. If you can master these skills, you’ll undoubtedly be in a great place to push your career forward as the year continues.
Read more
  • 0
  • 0
  • 4769

article-image-why-choose-opencv-over-matlab-for-your-next-computer-vision-project
Vincy Davis
20 Dec 2019
6 min read
Save for later

Why choose OpenCV over MATLAB for your next Computer Vision project

Vincy Davis
20 Dec 2019
6 min read
Scientific Computing relies on executing computer algorithms coded in different programming languages. One such interdisciplinary scientific field is the study of Computer Vision, often abbreviated as CV. Computer Vision is used to develop techniques that can automate tasks like acquiring, processing, analyzing and understanding digital images. It is also utilized for extracting high-dimensional data from the real world to produce symbolic information. In simple words, Computer Vision gives computers the ability to see, understand and process images and videos like humans. The vast advances in hardware, machine learning tools, and frameworks have resulted in the implementation of Computer Vision in various fields like IoT, manufacturing, healthcare, security, etc. Major tech firms like Amazon, Google, Microsoft, and Facebook are investing immensely in the research and development of this field. Out of the many tools and libraries available for Computer Vision nowadays, there are two major tools OpenCV and Matlab that stand out in terms of their speed and efficiency. In this article, we will have a detailed look at both of them. Further Reading [box type="shadow" align="" class="" width=""]To learn how to build interesting image recognition models like setting up license plate recognition using OpenCV, read the book “Computer Vision Projects with OpenCV and Python 3” by author Matthew Rever. The book will also guide you to design and develop production-grade Computer Vision projects by tackling real-world problems.[/box] OpenCV: An open-source multiplatform solution tailored for Computer Vision OpenCV, developed by Intel and now supported by Willow Garage, is released under the BSD 3-Clause license and is free for commercial use. It is one of the most popular computer vision tools aimed at providing a well-optimized, well tested, and open-source (C++)-based implementation for computer vision algorithms. The open-source library has interfaces for multiple languages like C++, Python, and Java and supports Linux, macOS, Windows, iOS, and Android. Many of its functions are implemented on GPU. The first stable release of OpenCV version 1.0 was in the year 2006. The OpenCV community has grown rapidly ever since and with its latest release, OpenCV version 4.1.1, it also brings improvements in the dnn (Deep Neural Networks) module, which is a popular module in the library that implements forward pass (inferencing) with deep networks, which are pre-trained using popular deep learning frameworks.  Some of the features offered by OpenCV include: imread function to read the images in the BGR (Blue-Green-Red) format by default. Easy up and downscaling for resizing an image. Supports various interpolation and downsampling methods like INTER_NEAREST to represent the nearest neighbor interpolation. Supports multiple variations of thresholding like adaptive thresholding, bitwise operations, edge detection, image filtering, image contours, and more. Enables image segmentation (Watershed Algorithm) to classify each pixel in an image to a particular class of background and foreground. Enables multiple feature-matching algorithms, like brute force matching, knn feature matching, among others. With its active community and regular updates for Machine Learning, OpenCV is only going to grow by leaps and bounds in the field of Computer Vision projects.  MATLAB: A licensed quick prototyping tool with OpenCV integration One disadvantage of OpenCV, which makes novice computer vision users tilt towards Matlab is the former's complex nature. OpenCV is comparatively harder to learn due to lack of documentation and error handling codes. Matlab, developed by MathWorks is a proprietary programming language with a multi-paradigm numerical computing environment. It has over 3 million users worldwide and is considered one of the easiest and most productive software for engineers and scientists. It has a very powerful and swift matrix library.  Matlab also works in integration with OpenCV. This enables MATLAB users to explore, analyze, and debug designs that incorporate OpenCV algorithms. The support package of MATLAB includes the data type conversions necessary for MATLAB and OpenCV. MathWorks provided Computer Vision Toolbox renders algorithms, functions, and apps for designing and testing computer vision, 3D vision, and video processing systems. It also allows detection, tracking, feature extraction, and matching of objects. Matlab can also train custom object detectors using deep learning and machine learning algorithms such as YOLO v2, Faster R-CNN, and ACF. Most of the toolbox algorithms in Matlab support C/C++ code generation for integrating with existing code, desktop prototyping, and embedded vision system deployment. However, Matlab does not contain as many functions for computer vision as OpenCV, which has more of its functions implemented on GPU. Another issue with Matlab is that it's not open-source, it’s license is costly and the programs are not portable.  Another important factor which matters a lot in computer vision is the performance of a code, especially when working on real-time video processing.  Which has a faster execution time? OpenCV or Matlab? Along with Computer Vision, other fields also require faster execution while choosing a programming language or library for implementing any function. This factor is analyzed in detail in a paper titled “Matlab vs. OpenCV: A Comparative Study of Different Machine Learning Algorithms”.  The paper provides a very practical comparative study between Matlab and OpenCV using 20 different real datasets. The differentiation is based on the execution time for various machine learning algorithms like Classification and Regression Trees (CART), Naive Bayes, Boosting, Random Forest and K-Nearest Neighbor (KNN). The experiments were run on an Intel core 2 duo P7450 machine, with 3GB RAM, and Ubuntu 11.04 32-bit operating system on Matlab version 7.12.0.635 (R2011a), and OpenCV C++ version 2.1.  The paper states, “To compare the speed of Matlab and OpenCV for a particular machine learning algorithm, we run the algorithm 1000 times and take the average of the execution times. Averaging over 1000 experiments is more than necessary since convergence is reached after a few hundred.” The outcome of all the experiments revealed that though Matlab is a successful scientific computing environment, it is outrun by OpenCV for almost all the experiments when their execution time is considered. The paper points out that this could be due to a combination of a number of dimensionalities, sample size, and the use of training sets. One of the listed machine learning algorithms KNN produced a log time ratio of 0.8 and 0.9 on datasets D16 and D17 respectively.  Clearly, Matlab is great for exploring and fiddling with computer vision concepts as researchers and students at universities that can afford the software. However, when it comes to building production-ready real-world computer vision projects, OpenCV beats Matlab hand down. You can learn about building more Computer Vision projects like human pose estimation using TensorFlow from our book ‘Computer Vision Projects with OpenCV and Python 3’. Master the art of face swapping with OpenCV and Python by Sylwek Brzęczkowski, developer at TrustStamp NVIDIA releases Kaolin, a PyTorch library to accelerate research in 3D computer vision and AI Generating automated image captions using NLP and computer vision [Tutorial] Computer vision is growing quickly. Here’s why. Introducing Intel’s OpenVINO computer vision toolkit for edge computing
Read more
  • 0
  • 0
  • 5276
article-image-uber-ai-labs-senior-research-scientist-ankit-jain-tensorflow-updates-learning-machine-learning
Sugandha Lahoti
19 Dec 2019
10 min read
Save for later

Uber AI Labs senior research scientist, Ankit Jain on TensorFlow updates and learning machine learning by doing [Interview]

Sugandha Lahoti
19 Dec 2019
10 min read
No doubt, TensorFlow is one of the most popular machine learning libraries right now. However, newbie developers who want to experiment with TensorFlow often face difficulties in learning TensorFlow, relying just on tutorials.  Recently, we sat down with Ankit Jain, senior research scientist at Uber AI Labs and one of the authors of the book, TensorFlow Machine Learning Projects. Ankit talked about how real-world implementations can be a good way to learn for those developing TF models, specifically the ‘learn by doing’ approach. Talking about TensorFlow 2.0, he considers ‘eager execution by default’ a major paradigm shift and is all game for interoperability between TF 2.0 and other machine learning frameworks. He also gave us an insight into the limitations of AI algorithms (generalization, AI ethics, labeled data to name a few). Continue reading the full interview for a detailed perspective. On why TensorFlow 2 upgrade is paradigm-shifting in more ways than one TensorFlow 2 was released last month. What are some of your top features in TensorFlow 2.0? How do you think it has upgraded the machine learning ecosystem? TF 2.0 is a major upgrade from its predecessor in many ways. It addressed many of the shortcomings of TF 1.x and with this release, the difference between Pytorch and TF has narrowed. One of the biggest paradigm shifts in TF 2.0 is eager execution by default. This means you don’t have to pre-define a static computation graph, create sessions, deal with the unintuitive interface or have painful experience in debugging your deep learning model code. However, you lose on some performance in run time when you switch to complete eager mode. For that purpose, they have introduced tf.function decorator which can help you translate your Python functions to Tensorflow graphs. This way you can retain both code readability and ease of debugging while getting the performance of TensorFlow graphs.  Another major update is that many confusing redundancies have been consolidated and many functions are now integrated with Keras API. This will help to standardize the communication of data/models among various components of TensorFlow ecosystem. TF 2.0 also comes with backward compatibility to TF 1.X with an easy optional way to convert your TF 1.X code into TF 2.0. TF 1.X suffered from a lack of standardization in how we load/save trained machine learning models. TF 2.0 fixed this by defining a single API SavedModels. As SavedModels is integrated with the Tensorflow ecosystem, it becomes much easier to deploy models using Tensorflow Lite, Tensorflow.js to other devices/applications.   With the onset of TensorFlow 2, Tensorflow and Keras are integrated into one module (tf.keras). TF 2.0 now delivers Keras as the central high-level API used to build and train models. What is the future/benefits of TensorFlow + Keras?  Keras has been a very popular high-level API for faster prototyping and production and even for research. As the field of AI/ML is in nascent stages, ease of development can have a huge impact for people getting started in machine learning.  Previously, a developer new to machine learning started from Keras while an experienced researcher used only Tensorflow 1.x due to its flexibility to build custom models. With Keras integrated as a high level API for TF 2.0, we can expect both beginners and experts working on the same framework which can lead to better collaboration and better exchange of ideas in the community.  Additionally, a single high level easy to use API reduces confusion and streamlines consistency across use cases of production and research.  Overall, I think it’s a great step in the right direction by Google which will enable more developers to hop on the Tensorflow ecosystem.  On TensorFlow, NLP and structured learning Recently, Transformers 2.0, a popular OS NLP library, was released that provides TF 2.0 and PyTorch deep interoperability. What are your views on this development? One of the areas where deep learning has made an immense impact is Natural Language Processing (NLP). Research in NLP is moving very fast and it is hard to keep up with all the papers and code releases by various research groups around the world.  Hugging Face, the company behind the library “Transformers” has really eased the usage of state of the art (SOTA) models and process of building new models by simplifying the preprocessing and model building pipeline through an easy to use Keras like interface. “Transformers 2.0” is the recent release from the company and the most important feature is the interoperability between Pytorch and TF 2.0. TF 2.0 is more production-ready while Pytorch is more oriented towards research. With this upgrade, you can pretty much move from one framework to another for training, validation, and deployment of the model.  Interoperability between frameworks is very important for the AI community as it enables development velocity. Moreover, as none of the frameworks can be perfect at everything, it makes the framework developers focus more on their strengths and make those features seamless. This will create greater efficiency going forward. Overall, I think this is a great development and I expect other libraries in domains like Computer Vision, Graph Learning etc. to follow suit. This will enable a lot more application of state of the art models to production.  Google recently launched Neural Structured Learning (NSL), an open-source Tensorflow based framework for training neural networks with graphs and structured data. What are some of the potential applications of NSL? What do you think can be some Machine Learning Projects based around NSL? Neural structured learning is a concept of learning neural network parameters with structured signals other than features. Many real-world datasets contain some structured information like Knowledge graphs or molecular graphs in biology. Incorporating these signals can lead to a more accurate and robust model. From an implementation perspective, it boils down to adding a regularizer to the loss function such that the representation of neighboring nodes in the graph is similar.  Any application where the amount of labeled data is limited but has structural information like Knowledge Graph that can be exploited is a good candidate for these types of models. A possible example could be fraud detection in online systems. Fraud data generally has sparse labels and fraudsters create multiple accounts that are connected to each other through some information like devices etc. This structured information can be utilized to learn a better representation of fraud accounts.  There can be other applications is molecular data and other problems involving the knowledge graph. On Ankit’s experience working on his book, TensorFlow Machine Learning Project Tell us the motivation behind writing your book TensorFlow Machine Learning Projects. Why is TensorFlow ideal for building ML projects? What are some of your favorite machine learning projects from this book? When I started learning Tensorflow, I stumbled upon many tutorials (including the official ones) which explained various concepts on how Tensorflow works. While that was helpful in understanding the basics, most of my learning came from building projects with Tensorflow. That is when I realized the need for a resource that teaches using a ‘learn by doing’ approach. This book is unique in the way that it teaches machine learning theory, Tensorflow utilities and programming concepts all while developing a project in which you can have fun building and is also of practical use.  My favorite chapter from the book is “Generating Uncertainty in Traffic Signs Classifier using Bayesian Neural Networks”. With the development of self-driving cars, traffic signs detection is a major problem that needs to be solved. This chapter explains an advanced AI concept of Bayesian Neural Networks and shows step by step how to use those to detect traffic signs using Tensorflow. Some of the readers of the book have started to use this concept in their practical applications already. Machine Learning challenges and advice to those developing TensorFlow models What are the biggest challenges today in the field of Machine Learning and AI? What do you see as the greatest technology disruptors in the next 5 years? While AI and machine learning has seen huge success in recent years, there are few limitations of AI algorithms as we see today. Some of the major ones are: Labeled Data: Most of the success of AI has come from supervised learning. Many of the recent supervised deep learning algorithms require huge quantities of labeled data which is expensive to obtain. For example, obtaining huge amounts of clinical trial data for healthcare prediction is very challenging. The good news is that there is some research around building good ML models using sparse data labels. Explainability: Deep learning models are essentially a “black box” where you don’t know what factor(s) led to the prediction. For some applications like money lending, disease diagnosis, fraud detection etc. the explanations of predictions become very important. Currently, we see some nascent work in this direction with LIME and SHAP libraries. Generalization: In the current state of AI, we build one model for each application. We still don’t have good generality of models from one task to another. Generalization, if solved, can lead us to truly Artificial General Intelligence (AGI). Thankfully approaches like transfer learning and meta-learning are trying to solve this challenge. Bias, Fairness, and Ethics: An output of the machine learning model is heavily based on the input training data. Many a time, training data can have biases towards particular ethnicities, classes, religions, etc. We need more solutions in this direction to build trust in AI algorithms. Overall, I feel, AI is becoming mainstream and in the next 5 years we will see many traditional industries adopt AI to solve critical business problems and achieve more automation. At the same time, tooling for AI will keep on improving which will also help in its adoption. What is your advice for those developing machine learning projects on TensorFlow? Building projects with new techniques and technologies is a hard process. It requires patience, dealing with failures and hard work. For that reason, it is very important to pick up a project that you are passionate about. This way, you will continue building even if you are stuck somewhere. The selection of the right projects is by far the most important criterion in the project-based learning method.  About the Author Ankit currently works as a Senior Research Scientist at Uber AI Labs, the machine learning research arm of Uber. His work primarily involves the application of Deep Learning methods to a variety of Uber’s problems ranging from food recommendation system, forecasting to self-driving cars.  Previously, he has worked in a variety of data science roles at Bank of America, Facebook and other startups. Additionally, he has been a featured speaker in many of the top AI conferences and universities across the US, including UC Berkeley, OReilly AI conference etc. He completed his MS from UC Berkeley and a BS from IIT Bombay (India). You can find him on Linkedin, Twitter, and GitHub. About the Book With the help of this book, TensorFlow Machine Learning Projects you’ll not only learn how to build advanced projects using different datasets but also be able to tackle common challenges using a range of libraries from the TensorFlow ecosystem. To start with, you’ll get to grips with using TensorFlow for machine learning projects; you’ll explore a wide range of projects using TensorForest and TensorBoard for detecting exoplanets, TensorFlow.js for sentiment analysis, and TensorFlow Lite for digit classification. As you make your way through the book, you’ll build projects in various real-world domains. By the end of this book, you’ll have gained the required expertise to build full-fledged machine learning projects at work.  
Read more
  • 0
  • 0
  • 4220

article-image-data-science-and-machine-learning-what-to-learn-in-2020
Richard Gall
19 Dec 2019
5 min read
Save for later

Data science and machine learning: what to learn in 2020

Richard Gall
19 Dec 2019
5 min read
It’s hard to keep up with the pace of change in the data science and machine learning fields. And when you’re under pressure to deliver projects, learning new skills and technologies might be the last thing on your mind. But if you don’t have at least one eye on what you need to learn next you run the risk of falling behind. In turn this means you miss out on new solutions and new opportunities to drive change: you might miss the chance to do things differently. That’s why we want to make it easy for you with this quick list of what you need to watch out for and learn in 2020. The growing TensorFlow ecosystem TensorFlow remains the most popular deep learning framework in the world. With TensorFlow 2.0 the Google-based development team behind it have attempted to rectify a number of issues and improve overall performance. Most notably, some of the problems around usability have been addressed, which should help the project’s continued growth and perhaps even lower the barrier to entry. Relatedly TensorFlow.js is proving that the wider TensorFlow ecosystem is incredibly healthy. It will be interesting to see what projects emerge in 2020 - it might even bring JavaScript web developers into the machine learning fold. Explore Packt's huge range of TensorFlow eBooks and videos on the store. PyTorch PyTorch hasn’t quite managed to topple TensorFlow from its perch, but it’s nevertheless growing quickly. Easier to use and more accessible than TensorFlow, if you want to start building deep learning systems quickly your best bet is probably to get started on PyTorch. Search PyTorch eBooks and videos on the Packt store. End-to-end data analysis on the cloud When it comes to data analysis, one of the most pressing issues is to speed up pipelines. This is, of course, notoriously difficult - even in organizations that do their best to be agile and fast, it’s not uncommon to find that their data is fragmented and diffuse, with little alignment across teams. One of the opportunities for changing this is cloud. When used effectively cloud platforms can dramatically speed up analytics pipelines and make it much easier for data scientists and analysts to deliver insights quickly. This might mean that we need increased collaboration between data professionals, engineers, and architects, but if we’re to really deliver on the data at our disposal, then this shift could be massive. Learn how to perform analytics on the cloud with Cloud Analytics with Microsoft Azure. Data science strategy and leadership While cloud might help to smooth some of the friction that exists in our organizations when it comes to data analytics, there’s no substitute for strong and clear leadership. The split between the engineering side of data and the more scientific or interpretive aspect has been noted, which means that there is going to be a real demand for people that have a strong understanding of what data can do, what it shows, and what it means in terms of action. Indeed, the article just linked to also mentions that there is likely to be an increasing need for executive level understanding. That means data scientists have the opportunity to take a more senior role inside their organizations, by either working closely with execs or even moving up to that level. Learn how to build and manage a data science team and initiative that delivers with Managing Data Science. Going back to the algorithms In the excitement about the opportunities of machine learning and artificial intelligence, it’s possible that we’ve lost sight of some of the fundamentals: the algorithms. Indeed, given the conversation around algorithmic bias, and unintended consequences it certainly makes sense to place renewed attention on the algorithms that lie right at the center of our work. Even if you’re not an experienced data analyst or data scientist, if you’re a beginner it’s just as important to dive deep into algorithms. This will give you a robust foundation for everything else you do. And while statistics and mathematics will feel a long way from the supposed sexiness of data science, carefully considering what role they play will ensure that the models you build are accurate and perform as they should. Get stuck into algorithms with Data Science Algorithms in a Week. Computer vision and natural language processing Computer vision and Natural Language Processing are two of the most exciting aspects of modern machine learning and artificial intelligence. Both can be used for analytics projects, but they also have applications in real world digital products. Indeed, with augmented reality and conversational UI becoming more and more common, businesses need to be thinking very carefully about whether this could give them an edge in how they interact with customers. These sorts of innovations can be driven from many different departments - but technologists and data professionals should be seizing the opportunity to lead the way on how innovation can transform customer relationships. For more technology eBooks and videos to help you prepare for 2020, head to the Packt store.
Read more
  • 0
  • 0
  • 5831

article-image-thomas-munro-from-enterprisedb-on-parallelism-in-postgresql
Bhagyashree R
17 Dec 2019
7 min read
Save for later

Thomas Munro from EnterpriseDB on parallelism in PostgreSQL

Bhagyashree R
17 Dec 2019
7 min read
PostgreSQL is a powerful, open-source object-relational database system. Since its introduction, it has been well-received by developers for its reliability, feature robustness, data-integrity, better licensing, and much more. However, one of its limitations has been the lack of support for parallelism, which changed in the subsequent releases. At PostgresOpen 2018, Thomas Munro, a programmer at EnterpriseDB and PostgreSQL contributor talked about how parallelism has evolved in PostgreSQL over the years. In this article, we will see some of the key parallelism-specific features that Munro discussed in his talk. [box type="shadow" align="" class="" width=""] Further Learning This article gives you a glimpse of query parallelism in PostgreSQL. If you want to explore it further along with other concepts like data replication, and database performance, check out our book Mastering PostgreSQL 11 - Second Edition by Hans-Jürgen Schönig. This second edition of Mastering PostgreSQL 11 helps you build dynamic database solutions for enterprise applications using PostgreSQL, which enables database analysts to design both the physical and technical aspects of the system architecture with ease. [/box] Evolution of parallelism in PostgreSQL PostgreSQL uses a process-based architecture instead of a thread-based one. On startup, it launches a “postmaster” process and after that creates a new process for every database session. Previously, it did not support parallelism in a single connection and each query used to run serially. The absence of “intra-query parallelism” in PostgreSQL was a huge limitation for answering the queries faster. Parallelism here means allowing a single process to have multiple threads to query the system and utilize the increasing CPU core counts. The foundation for parallelism in PostgreSQL was laid out in the 9.4 and 9.5 releases. These came with infrastructure updates like dynamic shared memory segments, shared memory queues, and background workers. PostgreSQL 9.6 was actually the first release that came with user-visible features for parallel query execution. It supported executor nodes: gather, parallel sequential scan, partial aggregate, and finalize aggregate. However, this was not enabled by default. Then in 2017, PostgreSQL 10 was released, which had parallelism enabled by default. It had a few more executor nodes including gather merge, parallel index scan, and parallel bitmap heap scan. Last year, PostgreSQL 11 came out with a couple of more executor nodes including parallel append and parallel hash join. It also introduced partition-wise joins and parallel CREATE INDEX. Key parallelism-specific features in PostgreSQL Parallel sequential scans Parallel sequential scans was the very first feature for parallel query execution. Introduced in PostgreSQL 9.6, this scan distributes blocks of a table among different processes. This assignment is done one after the other to ensure that the access to the table remains sequential. The processes that run in parallel and scan the tuples of a table are called parallel workers. There is one special worker called leader, which is responsible for coordinating and collecting the output of the scan from each of the worker. The leader may or may not participate in scanning the database depending on its load in dividing and combining processes. Parallel index scan Parallel index scan is based on the same concept as parallel sequential scan, but it involves more communication and waiting. Currently, the parallel index scans are supported only for B-Tree indexes. In a parallel index scan, index pages are scanned in parallel. Each process will scan a single index block and return all tuples referenced by that block. Meanwhile, other processes will also scan different index blocks and return the tuples. The results of a parallel B-Tree scan are then returned in sorted order. Parallel bitmap heap scan Again, this also has the same concept as the parallel sequential scan. Explaining the difference, Munro said, “You’ve got a big bitmap and you are skipping ahead to the pages that contain interesting tuples.” In parallel bitmap heap scan, one process is chosen as the leader, who performs a scan of one or more indexes and creates bitmap indicating which table blocks need to be visited. These table blocks are then divided among the worker processes as in a parallel sequential scan. Here the heap scan is done in parallel, but the underlying index scan is not. Parallel joins PostgreSQL supports all three join strategies in parallel query plans: nested loop join, hash join, or merge join. However, there is no parallelism supported in the inner loop. The entire loop is scanned as a whole, and the parallelism comes into play when each worker executes the inner loop as a whole. The results of each join are sent to gather node to produce the final results. Nested loop join: The nested loop is the most basic way for PostgreSQL to perform a join. Though it is considered to be slow, it can be efficient if the inner side is an index scan. This is because the outer tuples and hence the loops that loop up values in the index will be divided among worker processes. Merge join: The inner side is executed in full. It can be inefficient when sort needs to be performed because the work and resulting data are duplicated in every cooperating process. Hash join: In this join as well, the inner side is executed in full by every worker process to build identical copies of the hash table. It is inefficient in cases when the hash table is large or the plan is expensive. However, in parallel hash join, the inner side is a parallel hash that divides the work of building a shared hash table over the cooperating processes. This is the only join in which we can have parallelism on both sides. Partition-wise join Partition-wise join is a new feature introduced in PostgreSQL 11. In partition-wise join, the planner knows that both sides of the join have matching partition schemes. Here a join between two similarly partitioned tables are broken down into joins between their matching partitions if there is an equi-join condition between the partition key of joining tables. Munro explains, “It becomes parallelizable with the advent of parallel append, which can then run different branches of that query plan in different processes. But if you do that then granularity of parallelism is partitioned, which is in some ways good and in some ways bad compared to block-based granularity.” He further adds, “It means when the last worker runs out of work to do everyone else has to wait for that before the query is finished. Whereas, if you use block-based parallelism you don’t have the problem but there are some advantages as a result of that as well.” Parallel aggregation in PostgreSQL Calculating aggregates can be very expensive and when evaluated in a single process it could take a considerable amount of time. This problem was solved in PostgreSQL 9.6 with the introduction of parallel aggregation. This is essentially a divide and conquer strategy where multiple workers calculate a part of aggregate before the final value based on these calculations is calculated by the leader. This article walked you through some of the parallelism-specific features in PostgreSQL presented by Munro in his PostgresOpen 2018 talk.  If you want to get to grips with other advanced PostgreSQL features and SQL functions, do have a look at our Mastering PostgreSQL 11 - Second Edition book by Hans-Jürgen Schönig. By the end of this book, you will be able to use your database to its utmost capacity by implementing advanced administrative tasks with ease. PostgreSQL committer Stephen Frost shares his vision for PostgreSQL version 12 and beyond Introducing PostgREST, a REST API for any PostgreSQL database written in Haskell Percona announces Percona Distribution for PostgreSQL to support open source databases 
Read more
  • 0
  • 0
  • 3894
article-image-15-things-every-bi-professional-should-know-about-tableau
Fatema Patrawala
17 Dec 2019
8 min read
Save for later

15 things every BI professional should know about Tableau

Fatema Patrawala
17 Dec 2019
8 min read
“The art and practice of visualizing data is becoming ever more important in bridging the human-computer gap to mediate analytical insight in a meaningful way.” ―Edd Dumbill Tableau is a powerful data visualization and discovery tool. It is an important part of a data analyst or data scientist’s - skill set, with many organizations specifying it as a key skill in job adverts. In this article, we’ll take a look at few things in Tableau you need to know to successfully make a mark in your business intelligence career. While architecture of traditional BI tools has hardware limitations, Tableau does not have such dependencies and it can function independently and requires minimum hardware support. Traditional tools are based on a complex set of technologies when Tableau is based on Associative Search technology making it intuitive, fast and dynamic. Tableau supports in-memory, multi-thread and multi-core computing and more advanced capabilities while traditional BI tools do not offer such functionalities. Various Tableau products Tableau Desktop is a self service business analytics and data visualization suite that anyone can use. With tableau desktop, you can extract massive data offline from your data warehouse for live up to date data analysis. Tableau Online / Tableau Server is an online hosting platform designed for enterprise users. It lets users working on Tableau publish and share dashboards across organization and teams. Tableau Reader is a free desktop application that enables you to open and view visualizations that are built in Tableau Desktop. Tableau Public is a free Tableau software which you can use to make visualizations but you will need to save your workbook or worksheets in the Tableau Server for anyone else to view them. Different data types in Tableau All fields in a data source have a data type. The data type reflects the kind of information stored in that field, for example integers (410), dates (1/23/2015) and strings (“Wisconsin”). The data type of a field is identified in the Data pane by one of the icons shown below. Data type icons in Tableau Icon Data type Text (string) values Date values Date & Time values Numerical values Boolean values (relational only) for example True/False Geographic values (used with maps) Cluster Group   Source: Tableau website Measures and Dimensions in Tableau Measures contain numeric, quantitative values that you can measure. Measures can be aggregated. When you drag a measure into the view, Tableau applies an aggregation to that measure (by default). Dimensions, on the other hand, contain qualitative values (such as names, dates, or geographical data). You can use dimensions to categorize, segment, and reveal the details in your data. Dimensions affect the level of detail in the view. Ways to connect data in Tableau We can either connect live to your data set or extract data into Tableau. Live: Connecting live to a data set leverages its computational processing and storage. New queries will go to the database and will be reflected as new or updated within the data. Extract: The Extract API allows you to programmatically extract and combine any data sources for use in Tableau. There can be multiple data source connections to different sources in the same workbook. Each connection will show up under the Data tab on the left sidebar. The benefit of Tableau extract over live connection is that extract can be used anywhere without any connection and you can build your own visualization without connecting to database. You can read a complete section on how to extract data in Tableau from this book, Learning Tableau 2019 - Third Edition, written by Joshua Milligan. This book takes you through the foundations of the Tableau 2019 paradigm to the advanced topics.  Joins and Blends in Tableau Joining tables and blending data sources are two different ways to link related data together in Tableau. Joins are performed to link tables of data together on a row-by-row basis. Blends are performed to link together multiple data sources at an aggregate level.  Different filters in Tableau and different use cases in which these filters are more relevant than others In Tableau, filters are used to restrict the data from database. Often, you will want to filter data in Tableau in order to perform an analysis on a subset of data, narrow your focus, or drill into detail. Tableau offers multiple ways to filter data. If you want to limit the scope of your analysis to a subset of data, you can filter the data at the source using one of the following techniques: Data Source Filters are applied before all other filters and are useful when you want to limit your analysis to a subset of data. These filters are applied before any other filters. Extract Filters limit the data that is stored in an extract (.tde or .hyper). Data source filters are often converted into extract filters if they are present when you extract the data. Custom SQL Filters can be accomplished using a live connection with custom SQL, which has a Tableau parameter in the WHERE clause.    Dual axis in Tableau Dual Axis is an excellent phenomenon supported by Tableau that helps users view two scales of two measures in the same graph. Many websites like Indeed.com and other make use of dual axis to show the comparison between two measures and their growth rate in a septic set of years. Dual axis let you compare multiple measures at once, having two independent axis layered on top of one another.  Key components of a Tableau Dashboard Horizontal – Horizontal layout containers allow the designer to group worksheets and dashboard components left to right across your page and edit the height of all elements at once. Vertical – Vertical containers allow the user to group worksheets and dashboard components top to bottom down your page and edit the width of all elements at once. Text – All textual fields. Image Extract  – A Tableau workbook is in XML format. In order to extract images, Tableau applies some codes to extract an image which can be stored in XML. Web [URL ACTION] – A URL action is a hyperlink that points to a Web page, file, or other web-based resource outside of Tableau. You can use URL actions to link to more information about your data that may be hosted outside of your data source. To make the link relevant to your data, you can substitute field values of a selection into the URL as parameters. If you want to learn how to design dashboards in Tableau, this book Learning Tableau 2019, will give you a step by step process for designing dashboards.  Why automate reports in Tableau Once you have automated reporting, you’ll have time to spend on innovative projects. What can be done manually could be performed by automation, delivering the same results in a fraction of the time. Reducing such a time-consuming and repetitive task will make you more productive, and more efficient.  What is story in Tableau? Why would create a story and what are they used for? A story is a sheet that contains a sequence of worksheets or dashboards that work together to convey information. You can create stories to show how facts are connected, provide context, demonstrate how decisions relate to outcomes, or simply make a compelling case. Each individual sheet in a story is called a story point. The primary objective of creating stories in Tableau is to communicate data to a certain audience with an intended result.  How can you create stories in Tableau? There is a feature in Tableau named as Stories that allows you to tell a story using interactive snapshots of dashboards and views. The snapshots become points in a story. This allows you to construct guided narrative or even an entire presentation. Read this chapter, ‘Telling a Data Story with Dashboards’ from this book, Learning Tableau 2019, to create insightful dashboards in Tableau.    How to embed views into Webpages? You can embed interactive Tableau views and dashboards into web pages, blogs, wiki pages, web applications, and intranet portals. Embedded views update as the underlying data changes, or as their workbooks are updated on Tableau Server. Embedded views follow the same licensing and permission restrictions used on Tableau Server. That is, to see a Tableau view that’s embedded in a web page, the person accessing the view must also have an account on Tableau Server. Alternatively, if your organization uses a core-based license on Tableau Server, a Guest account is available. This allows people in your organization to view and interact with Tableau views embedded in web pages without having to sign in to the server. Contact your server or site administrator to find out if the Guest user is enabled for the site you publish to.  What is Tableau Prep? Can we clean messy data with Tableau? Tableau Prep extends the Tableau platform with robust options for cleaning and structuring data for analysis in Tableau. In the same way that Tableau Desktop provides a hands-on, visual experience for visualizing and analyzing data, Tableau Prep provides a hands-on, visual experience for cleaning and shaping data. If you wish to know more about Tableau Prep or how to clean messy data to create powerful data visualizations and unlock intelligent business insights, read this book Learning Tableau 2019, written by Joshua N. Milligan. ‘Tableau Day’ highlights: Augmented Analytics, Tableau Prep Builder and Conductor, and more! Alteryx vs. Tableau: Choosing the right data analytics tool for your business How to do data storytelling well with Tableau [Video]
Read more
  • 0
  • 0
  • 9795

article-image-artificial-intelligence-data-science-and-big-data-in-2019-what-really-mattered
Richard Gall
16 Dec 2019
6 min read
Save for later

Artificial intelligence, data science, and big data in 2019: what really mattered

Richard Gall
16 Dec 2019
6 min read
The techlash hasn’t died down - it’s just become normalized. Barely a day passes without a new scandal emerging, from questionable surveillance to racist AI algorithms. But it hasn’t all been bad: while negatives get a lot of attention (and so they should - the consequences of tech can be lethal, both societally and literally), there was still plenty to get excited about. And for those working in the data profession - as analysts, scientists, and engineers, there were several important trends that really helped to define where we are now from a purely practical perspective - as well as hinting at where we might go in the future. With just a few weeks left to go of the year (and the decade!), let’s look at some of the key things that defined this year in the field of data science and data engineering. The growth of PyTorch TensorFlow is undoubtedly the most popular deep learning framework. You might even say that its role in popularizing deep learning and artificial intelligence has been understated. But while TensorFlow has held its place for some time, 2019 was the year when things started to change. Look, for example at this Google Trends graph (and yes, I know it’s not in any way scientific): As you can see TensorFlow hit its stride pretty early on. It’s only in the last 12 months or so that PyTorch has been narrowing the gap. One of the reasons for this is the fact that PyTorch 1.0 was released at the end of last year. This has been the foundation that has spurred its growth over the last 12 months, effectively announcing its ‘official’ arrival on the scene. With Facebook (PyTorch’s creator) building on this foundation throughout the year with a few small but important releases. PyTorch 1.3, for example, which was released at the PyTorch Developer Conference in October, included a number of ‘experimental’ new features, including named tensors and PyTorch Mobile. Another reason for PyTorch’s growth this year is that it is finding traction in the research field. This article provides some hard data that proves that PyTorch is starting to grow in this area, citing the tool’s comparable simplicity, API and performance as the reasons that it’s undermining TensorFlow’s utter dominance of the field. Find our PyTorch bundle, and other data bundles, here. Grab 5 titles for just $25. TensorFlow 2.0 While PyTorch has grown significantly in 2019, TensorFlow is nevertheless still holding its place at the top of the deep learning rankings. And TensorFlow 2.0 has undoubtedly cemented its position. With the alpha release getting developers excited since March, the full launch of 2.0 marked an important milestone for the project. The key difference between TensorFlow 2.0 and 1.0 is ultimately accessibility and ease of use. Despite its massive popularity, TensorFlow 1.0 always had a reputation for being a little more difficult to use than many other deep learning tools. The team were clearly aware of this and have done a lot to make life easier for TensorFlow developers. “With tight integration of Keras into TensorFlow, eager execution by default, and Pythonic function execution,” the team write in the release notes, “TensorFlow 2.0 makes the experience of developing applications as familiar as possible for Python developers.” When placed alongside the exciting development of PyTorch, it’s clear that these two tools are going to be defining deep learning in the year - or years - to come. Get up to date with what's new in TensorFlow 2.0 with TensorFlow 2.0 Quick Start Guide. Stream processing with Kafka, Flink, and others Dealing with large quantities of data in real-time is now the cutting-edge of big data. It’s for this reason that this year we’ve started to see stream processing gain headway in the mainstream. Although it’s been an important technique for organizations with data-intensive needs, the use of cloud and hybrid solutions - as well as an overall awareness of the opportunities of real-time data - has become truly mainstream. In turn, this is giving new prominence to a range of stream-processing platforms. Kafka, Spark, and Flink are just three of the most well-known names in this space, but the market is undoubtedly growing. Another key driver here is Nvidia - as one of the leading hardware companies, it deserves a lot of credit for helping to make massive processing power accessible to organizations that wouldn’t have had a chance just a few years ago. With CUDA, Nvidia’s parallel programming paradigm for GPUs, the company is helping all sorts of users to leverage stream processing in different ways. Get started with Apache Kafka with Apache Kafka Quick Start Guide. Data analysis on the cloud Although I've already mentioned how influential TensorFlow was in popularizing deep learning, today public cloud is going even further. It’s making artificial intelligence and analytics accessible to new roles (thinking here about tools like Azure Machine Learning Studio and Amazon SageMaker), as well as making it easier to build and deploy machine learning models in applications and products. In recent weeks, Microsoft has made another step in its bid to eat into AWS’s market share with Azure Synapse. Essentially a next generation Azure SQL Warehouse, Synapse is designed to bridge the gap between data lake and data warehouse - so, offering massive scale, and improving analytical speed. It will be interesting to see how this plays with the wider market. AWS might respond with something similar - but the onus remains on Microsoft to shift mindshare; AWS will want to consolidate its powerful position. Security It would be wrong to suggest that security is a new issue in the world of data engineering and analytics. But in 2019 it’s become almost impossible to think about the two domains as separate from one another. This cuts two different ways: on the one hand the emphasis on securing data and protecting privacy has never been greater. On the other hand, artificial intelligence and machine learning have started to play a critical part in the way that we monitor and identify threats to our systems. To a certain extent this expresses the double bind that data poses: the amount of data at our disposal is a nightmare from a governance and architectural perspective, but it is, at the same time, a way of mitigating that very nightmare. All in all, then, a bit of a vicious cycle, but nevertheless a reminder that however big our data gets, and however much we try to automate, there will always be a need for humans to think creatively and strategically about how we actually go about solving problems. Explore Packt's security bundles now. For more technology eBooks and videos to prepare you for 2020, head to the Packt store.
Read more
  • 0
  • 0
  • 4388