Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-decoding-the-reasons-behind-alphabets-record-high-earnings-in-q2-2018

25 Jul 2018

7 min read

Decoding the reasons behind Alphabet’s record high earnings in Q2 2018

25 Jul 2018

Alphabet, Google’s parent company, saw its stock price rise quickly after it announced its Q2 2018 earning results, shocking analysts (in a good way) all over the world. Shares of Alphabet have jumped more than 5% in after-hours trading Monday, hitting a new record high. Source: NASDAQ It would seem that the EU’s fine of €4.34 billion on Google for breaching EU antitrust laws had little effect on its progress in terms of Q2 earnings. According to Ruth Porat, Google's CFO, Alphabet generated revenue of $32.66 billion during Q2 2018, compared to $26.01 billion during the same quarter last year. Excluding the fine, Alphabet still booked a net income of $3.2 billion, which equals earnings of $4.54 per share. Had the EU decision gone the other way, Alphabet would have had $32.6 billion in revenue and a profit of $8.2 billion. “We want Google to be the source you think of when you run into a problem.” - Sundar Pichai, Google CEO, in the Q2 2018 Earnings Call In Monday afternoon’s earnings call, CEO Sundar Pichai focused on three major domains that have helped Alphabet achieve its Q2 earnings. First, he claimed that machine learning and AI was becoming a crucial unifying component across all of Google's products and offerings helping to cement and consolidate its position in the market. Second, Pichai suggested that investments in computing, video, cloud and advertising platforms have helped push Google into new valuable markets. And third, the company's investment in new businesses and emerging markets was proving to be a real growth driver which should secure Google's future success. Let us look at the various facets of Google’s growth strategy that have proven to be successful this quarter. Investing in AI With the world spinning around the axis of AI, Alphabet is empowering all of its product and service offerings with AI and machine learning. At its annual developer conference earlier this year, Google I/O, Google announced new updates to their products that rely on machine learning. For example, the revamped Google news app uses machine learning to provide relevant news stories for users, and improvements to Google assistant also helped the organization strengthen its position in that particular market. (By the end of 2018, it will be available in more than 30 languages in 80 countries.) This is another smart move by Alphabet in its plan to make information accessible to all while generating more revenue-generating options for themselves and expanding their partnerships to new vendors and enterprise clients. Google Translate also saw a huge bump in volume especially during the World Cup, as fans all over the world traveled to Russia to witness the football gala. Another smart decision was adding updates to Google Maps. This has achieved a 50% year-on-year growth in Indonesia, India, and Nigeria, three very big and expanding markets. Defending its Android ecosystem and business model The first Android Phone arrived in 2008. The project was built on the simple idea of a mobile platform that was free and open to everyone. Today, there are more than 24,000 Android-powered devices from over 1400 phone manufacturers. Google’s decision to build a business model that encourages this open ecosystem to thrive has been a clever strategy. It not only generates significant revenue for the company but it also brings a world of developers and businesses into its ecosystem. It's vendor lock-in with a friendly face. Of course, with the EU watching closely, Google has to be careful to follow regulation. Failure to comply could mean the company would face penalty payments of up to 5% of its average daily worldwide turnover of Alphabet. According to Brian Wieser, an analyst at Pivotal Research Group, however, “There do not appear to be any signs that should cause a meaningful slow down anytime soon, as fines from the EU are not likely to hamper Alphabet’s growth rate. Conversely, regulatory changes such as GDPR in Europe (and similar laws implemented elsewhere) could have the effect of reinforcing Alphabet’s growth.” Forming new partnerships Google has always been very keen to form new partnerships and strategic alliances with a wide variety of companies and startups. It has been very smart in systematically looking for partners that will complement their strengths and bring the end product to the market. Partnering also provides flexibility; instead of developing new solutions and tools in-house, Google can instead bring interesting innovations into the Google ecosystem simply thanks to its financial clout. For example, Google has partnered with many electronic companies to expand the number of devices compatible with Google assistant. Furthermore, its investment in computing platforms and AI has also helped the organization to generate considerable momentum in their Made by Google hardware business across Pixel, Home, Nest, and Chromecast. Interestingly, we also saw an acceleration in business adoption of Chromebooks. Chromebooks are the most cost-efficient and secure way for businesses to enable their employees to work in the cloud. The unit sales of managed Chromebooks in Q2 grew by more than 175% year-on-year. “Advertising on Youtube has always been an incredibly strong and growing source of income for its creators. Now Google is also building new ways for creators to source income such as paid channel memberships, merchandise shelves on Youtube channels, and endorsements opportunities through Famebit.”, said Pichai. Famebit is a startup they acquired in 2016 which uses data analytics to build tools to connect brands with the right creators. This acquisition proved to be quite successful as almost half of the creators that used Famebit in 2018 doubled their revenue in the first 3 months. Google has also made significant strides in developing new shopping and commerce partnerships such as with leading global retailers like Carrefour, designed to give people the power to shop wherever and however they want. Such collaborations are great for Google as it brings their shopping, ads, and cloud products under one hood. The success of Google Cloud’s vertical strategy and customer-centric approach was illustrated by key wins including Domino's Pizza, Soundcloud, and PwC moving to GCP this quarter. Target, the chain of department store retailers in the US, is also migrating key areas of it’s business to GCP. AirAsia has also expanded its relationship with Google for using ML and data analytics. This shows that the cloud business is only going to grow further. Further, Google Cloud Platform catering to clients from across very different industries and domains signals a robust way to expand their cloud empire. Supporting future customers Google is not just thinking about its current customer base but also working on specialized products to support the next wave of people which are coming online for the first time, enabled the rise in accessibility of mobile devices. They have established high-speed public WiFi in 400 train stations in India in collaboration with the Indian railways and proposed the system in Indonesia and Mexico as well. They have also announced Google AI research center in Ghana Africa to spur AI innovation with researchers and engineers from Africa. They have also expanded the Google IT support professional certificate program to more than 25 community colleges in the US. This massive uproar by Alphabet even in the midst of EU antitrust case was the most talked about news among Wall Street analysts. Most of them consider it to be buy-in terms of stocks. For the next quarter, Google wants to continue fueling its growing cloud business “We are investing for the long run.” Pichai said. They also don’t plan to dramatically alter their Android strategy and continue to give the OS for free. Pichai said, “I’m confident that we will find a way to make sure Android is available at scale to users everywhere.” A quick look at E.U.’s antitrust case against Google’s Android Is Google planning to replace Android with Project Fuchsia? Google Cloud Launches Blockchain Toolkit to help developers build apps easily

0
0
36075

article-image-time-series-modeling-what-is-it-why-it-matters-how-its-used

Sunith Shetty

10 Aug 2018

11 min read

Time series modeling: What is it, Why it matters and How it's used

Sunith Shetty

10 Aug 2018

11 min read

A series can be defined as a number of events, objects, or people of a similar or related kind coming one after another; if we add the dimension of time, we get a time series. A time series can be defined as a series of data points in time order. In this article, we will understand what time series is and why it is one of the essential characteristics for forecasting. This article is an excerpt from a book written by Harish Gulati titled SAS for Finance. The importance of time series What importance, if any, does time series have and how will it be relevant in the future? These are just a couple of fundamental questions that any user should find answers to before delving further into the subject. Let's try to answer this by posing a question. Have you heard the terms big data, artificial intelligence (AI), and machine learning (ML)? These three terms make learning time series analysis relevant. Big data is primarily about a large amount of data that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interaction. AI is a kind of technology that is being developed by data scientists, computational experts, and others to enable processes to become more intelligent, while ML is an enabler that is helping to implement AI. All three of these terms are interlinked with the data they use, and a lot of this data is time series in its nature. This could be either financial transaction data, the behavior pattern of individuals during various parts of the day, or related to life events that we might experience. An effective mechanism that enables us to capture the data, store it, analyze it, and then build algorithms to predict transactions, behavior (and life events, in this instance) will depend on how big data is utilized and how AI and MI are leveraged. A common perception in the industry is that time series data is used for forecasting only. In practice, time series data is used for: Pattern recognition Forecasting Benchmarking Evaluating the influence of a single factor on the time series Quality control For example, a retailer may identify a pattern in clothing sales every time it gets a celebrity endorsement, or an analyst may decide to use car sales volume data from 2012 to 2017 to set a selling benchmark in units. An analyst might also build a model to quantify the effect of Lehman's crash at the height of the 2008 financial crisis in pushing up the price of gold. Variance in the success of treatments across time periods can also be used to highlight a problem, the tracking of which may enable a hospital to take remedial measures. These are just some of the examples that showcase how time series analysis isn't limited to just forecasting. In this chapter, we will review how the financial industry and others use forecasting, discuss what a good and a bad forecast is, and hope to understand the characteristics of time series data and its associated problems. Forecasting across industries Since one of the primary uses of time series data is forecasting, it's wise that we learn about some of its fundamental properties. To understand what the industry means by forecasting and the steps involved, let's visit a common misconception about the financial industry: only lending activities require forecasting. We need forecasting in order to grant personal loans, mortgages, overdrafts, or simply assess someone's eligibility for a credit card, as the industry uses forecasting to assess a borrower's affordability and their willingness to repay the debt. Even deposit products such as savings accounts, fixed-term savings, and bonds are priced based on some forecasts. How we forecast and the rationale for that methodology is different in borrowing or lending cases, however. All of these areas are related to time series, as we inevitably end up using time series data as part of the overall analysis that drives financial decisions. Let's understand the forecasts involved here a bit better. When we are assessing an individual's lending needs and limits, we are forecasting for a single person yet comparing the individual to a pool of good and bad customers who have been offered similar products. We are also assessing the individual's financial circumstances and behavior through industry-available scoring models or by assessing their past behavior, with the financial provider assessing the lending criteria. In the case of deposit products, as long as the customer is eligible to transact (can open an account and has passed know your customer (KYC), anti-money laundering (AML), and other checks), financial institutions don't perform forecasting at an individual level. However, the behavior of a particular customer is primarily driven by the interest rate offered by the financial institution. The interest rate, in turn, is driven by the forecasts the financial institution has done to assess its overall treasury position. The treasury is the department that manages the central bank's money and has the responsibility of ensuring that all departments are funded, which is generated through lending and attracting deposits at a lower rate than a bank lends. The treasury forecasts its requirements for lending and deposits, while various teams within the treasury adhere to those limits. Therefore, a pricing manager for a deposit product will price the product in such a way that the product will attract enough deposits to meet the forecasted targets shared by the treasury; the pricing manager also has to ensure that those targets aren't overshot by a significant margin, as the treasury only expects to manage a forecasted target. In both lending and deposit decisions, financial institutions do tend to use forecasting. A lot of these forecasts are interlinked, as we saw in the example of the treasury's expectations and the subsequent pricing decision for a deposit product. To decide on its future lending and borrowing positions, the treasury must have used time series data to determine what the potential business appetite for lending and borrowing in the market is and would have assessed that with the current cash flow situation within the relevant teams and institutions. Characteristics of time series data Any time series analysis has to take into account the following factors: Seasonality Trend Outliers and rare events Disruptions and step changes Seasonality Seasonality is a phenomenon that occurs each calendar year. The same behavior can be observed each year. A good forecasting model will be able to incorporate the effect of seasonality in its forecasts. Christmas is a great example of seasonality, where retailers have come to expect higher sales over the festive period. Seasonality can extend into months but is usually only observed over days or weeks. When looking at time series where the periodicity is hours, you may find a seasonality effect for certain hours of the day. Some of the reasons for seasonality include holidays, climate, and changes in social habits. For example, travel companies usually run far fewer services on Christmas Day, citing a lack of demand. During most holidays people love to travel, but this lack of demand on Christmas Day could be attributed to social habits, where people tend to stay at home or have already traveled. Social habit becomes a driving factor in the seasonality of journeys undertaken on Christmas Day. It's easier for the forecaster when a particular seasonal event occurs on a fixed calendar date each year; the issue comes when some popular holidays depend on lunar movements, such as Easter, Diwali, and Eid. These holidays may occur in different weeks or months over the years, which will shift the seasonality effect. Also, if some holidays fall closer to other holiday periods, it may lead to individuals taking extended holidays and travel sales may increase more than expected in such years. The coffee shop near the office may also experience lower sales for a longer period. Changes in the weather can also impact seasonality; for example, a longer, warmer summer may be welcome in the UK, but this would impact retail sales in the autumn as most shoppers wouldn't need to buy a new wardrobe. In hotter countries, sales of air-conditioners would increase substantially compared to the summer months' usual seasonality. Forecasters could offset this unpredictability in seasonality by building in a weather forecast variable. We will explore similar challenges in the chapters ahead. Seasonality shouldn't be confused with a cyclic effect. A cyclic effect is observed over a longer period of generally two years or more. The property sector is often associated with having a cyclic effect, where it has long periods of growth or slowdown before the cycle continues. Trend A trend is merely a long-term direction of observed behavior that is found by plotting data against a time component. A trend may indicate an increase or decrease in behavior. Trends may not even be linear, but a broad movement can be identified by analyzing plotted data. Outliers and rare events Outliers and rare events are terminologies that are often used interchangeably by businesses. These concepts can have a big impact on data, and some sort of outlier treatment is usually applied to data before it is used for modeling. It is almost impossible to predict an outlier or rare event but they do affect a trend. An example of an outlier could be a customer walking into a branch to deposit an amount that is 100 times the daily average of that branch. In this case, the forecaster wouldn't expect that trend to continue. Disruptions Disruptions and step changes are becoming more common in time series data. One reason for this is the abundance of available data and the growing ability to store and analyze it. Disruptions could include instances when a business hasn't been able to trade as normal. Flooding at the local pub may lead to reduced sales for a few days, for example. While analyzing daily sales across a pub chain, an analyst may have to make note of a disruptive event and its impact on the chain's revenue. Step changes are also more common now due to technological shifts, mergers and acquisitions, and business process re-engineering. When two companies announce a merger, they often try to sync their data. They might have been selling x and y quantities individually, but after the merger will expect to sell x + y + c (where c is the positive or negative effect of the merger). Over time, when someone plots sales data in this case, they will probably spot a step change in sales that happened around the time of the merger, as shown in the following screenshot: In the trend graph, we can see that online travel bookings are increasing. In the step change and disruptions chart, we can see that Q1 of 2012 saw a substantive increase in bookings, where Q1 of 2014 saw a substantive dip. The increase was due to the merger of two companies that took place in Q1 of 2012. The decrease in Q1 of 2014 was attributed to prolonged snow storms in Europe and the ash cloud disruption from volcanic activity over Iceland. While online bookings kept increasing after the step change, the disruption caused by the snow storm and ash cloud only had an effect on sales in Q1 of 2014. In this case, the modeler will have to treat the merger and the disruption differently while using them in the forecast, as disruption could be disregarded as an outlier and treated accordingly. Also note that the seasonality chart shows that Q4 of each year sees almost a 20% increase in travel bookings, and this pattern continues each calendar year. In this article, we defined time series and learned why it is important for forecasting. We also looked at the characteristics of time series data. To know more how to leverage the analytical power of SAS to perform financial analysis efficiently, you can check out the book SAS for Finance. Read more Getting to know SQL Server options for disaster recovery Implementing a simple Time Series Data Analysis in R Training RNNs for Time Series Forecasting

0
0
35859

article-image-hand-gesture-recognition-using-kinect-depth-sensor

Packt

06 Oct 2015

26 min read

Hand Gesture Recognition Using a Kinect Depth Sensor

Packt

06 Oct 2015

26 min read

0
0
35824

article-image-getting-to-know-and-manipulate-tensors-in-tensorflow

Sunith Shetty

29 Dec 2017

5 min read

Getting to know and manipulate Tensors in TensorFlow

Sunith Shetty

29 Dec 2017

5 min read

[box type="note" align="" class="" width=""]This article is a book excerpt written by Rodolfo Bonnin, titled Building Machine Learning Projects with TensorFlow. In this book, you will learn to build powerful machine learning projects to tackle complex data for gaining valuable insights. [/box] Today, you will learn everything about Tensors, their properties and how they are used to represent data. What are tensors? TensorFlow bases its data management on tensors. Tensors are concepts from the field of mathematics, and are developed as a generalization of the linear algebra terms of vectors and matrices. Talking specifically about TensorFlow, a tensor is just a typed, multidimensional array, with additional operations, modeled in the tensor object. Tensor properties - ranks, shapes, and types TensorFlow uses tensor data structure to represent all data. Any tensor has a static type and dynamic dimensions, so you can change a tensor's internal organization in real-time. Another property of tensors, is that only objects of the tensor type can be passed between nodes in the computation graph. Let's now see what the properties of tensors are (from now on, every time we use the word tensor, we'll be referring to TensorFlow's tensor objects). Tensor rank Tensor ranks represent the dimensional aspect of a tensor, but is not the same as a matrix rank. It represents the quantity of dimensions in which the tensor lives, and is not a precise measure of the extension of the tensor in rows/columns or spatial equivalents. A rank one tensor is the equivalent of a vector, and a rank one tensor is a matrix. For a rank two tensor you can access any element with the syntax t[i, j]. For a rank three tensor you would need to address an element with t[i, j, k], and so on. In the following example, we will create a tensor, and access one of its components: import tensorflow as tf sess = tf.Session() tens1 = tf.constant([[[1,2],[2,3]],[[3,4],[5,6]]]) print sess.run(tens1)[1,1,0] Output: 5 This is a tensor of rank three, because in each element of the containing matrix, there is a vector element: Rank Math entity Code definition example 0 Scalar scalar = 1000 1 Vector Vector vector = [2, 8, 3] 2 Matrix matrix = [[4, 2, 1], [5, 3, 2], [5, 5, 6]] 3 3-tensor tensor = [[[4], [3], [2]], [[6], [100], [4]], [[5], [1], [4]]] n n-tensor … Tensor shape The TensorFlow documentation uses three notational conventions to describe tensor dimensionality: rank, shape, and dimension number. The following table shows how these relate to one another: Rank Shape Dimension number Example 0 [] 0 4 1 [D0] 1 [2] 2 [D0, D1] 2 [6, 2] 3 [D0, D1, D2] 3 [7, 3, 2] n [D0, D1, … Dn-1] n-D A tensor with shape [D0, D1, … Dn-1] In the following example, we create a sample rank three tensor, and print the shape of it: Tensor data types In addition to dimensionality, tensors have a fixed data type. You can assign any one of the following data types to a tensor: Data type Python type Description DT_FLOAT tf.float32 32 bits floating point. DT_DOUBLE tf.float64 64 bits floating point. DT_INT8 tf.int8 8 bits signed integer. DT_INT16 tf.int16 16 bits signed integer. DT_INT32 tf.int32 32 bits signed integer. DT_INT64 tf.int64 64 bits signed integer. DT_UINT8 tf.uint8 8 bits unsigned integer. DT_STRING tf.string Variable length byte arrays. Each element of a tensor is a byte array. DT_BOOL tf.bool Boolean. Creating new tensors We can either create our own tensors, or derivate them from the well-known numpy library. In the following example, we create some numpy arrays, and do some basic math with them: import tensorflow as tf import numpy as np x = tf.constant(np.random.rand(32).astype(np.float32)) y= tf.constant ([1,2,3]) x Y Output: <tf.Tensor 'Const_2:0' shape=(3,) dtype=int32> From numpy to tensors and vice versa TensorFlow is interoperable with numpy, and normally the eval() function calls will return a numpy object, ready to be worked with the standard numerical tools. We must note that the tensor object is a symbolic handle for the result of an operation, so it doesn't hold the resulting values of the structures it contains. For this reason, we must run the eval() method to get the actual values, which is the equivalent to Session.run(tensor_to_eval). In this example, we build two numpy arrays, and convert them to tensors: import tensorflow as tf #we import tensorflow import numpy as np #we import numpy sess = tf.Session() #start a new Session Object x_data = np.array([[1.,2.,3.],[3.,2.,6.]]) # 2x3 matrix x = tf.convert_to_tensor(x_data, dtype=tf.float32) print (x) Output: Tensor("Const_3:0", shape=(2, 3), dtype=float32) Useful method: tf.convert_to_tensor: This function converts Python objects of various types to tensor objects. It accepts tensorobjects, numpy arrays, Python lists, and Python scalars. Getting things done - interacting with TensorFlow As with the majority of Python's modules, TensorFlow allows the use of Python's interactive console: In the previous figure, we call the Python interpreter (by simply calling Python) and create a tensor of constant type. Then we invoke it again, and the Python interpreter shows the shape and type of the tensor. We can also use the IPython interpreter, which will allow us to employ a format more compatible with notebook-style tools, such as Jupyter: When talking about running TensorFlow Sessions in an interactive manner, it's better to employ the InteractiveSession object. Unlike the normal tf.Session class, the tf.InteractiveSession class installs itself as the default session on construction. So when you try to eval a tensor, or run an operation, it will not be necessary to pass a Session object to indicate which session it refers to. To summarize, we have learned about tensors, the key data structure in TensorFlow and simple operations we can apply to the data. To know more about different machine learning techniques and algorithms that can be used to build efficient and powerful projects, you can refer to the book Building Machine Learning Projects with TensorFlow.

0
0
35803

article-image-what-is-a-convolutional-neural-network-cnn-video

Richard Gall

25 Sep 2018

5 min read

What is a convolutional neural network (CNN)? [Video]

Richard Gall

25 Sep 2018

5 min read

0
0
35775

article-image-4-common-challenges-web-scraping-handle

Amarabha Banerjee

08 Mar 2018

13 min read

4 common challenges in Web Scraping and how to handle them

Amarabha Banerjee

08 Mar 2018

13 min read

[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] In this article, we will explore primary challenges of Web Scraping and how to get away with it easily. Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds. Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands: docker pull mheydt/pywebscrapecookbook docker run -p 5001:5001 pywebscrapecookbook Retrying failed page downloads Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes: [500, 502, 503, 504, 408] The process can be further configured using the following parameters: RETRY_ENABLED (True/False - default is True) RETRY_TIMES (# of times to retry on any errors - default is 2) RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408]) How to do it The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.retry.RetryMiddleware": 500 }, 'RETRY_ENABLED': True, 'RETRY_TIMES': 3 }) process.crawl(Spider) process.start() How it works Scrapy will pick up the configuration for retries as specified when the spider is run. When encountering errors, Scrapy will retry up to three times before giving up. Supporting page redirects Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters: REDIRECT_ENABLED: (True/False - default is True) REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20) How to do it The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are a few lines demonstrating the output: Parsing: <200 https://www.nasa.gov/content/earth-expeditions-above/> ['http://www.nasa.gov/content/earth-expeditions-above', 'https://www.nasa.gov/content/earth-expeditions-above'] This particular URL was processed after one redirection, from an HTTP to an HTTPS version of the URL. The list defines all of the URLs that were involved in the redirection. You will also be able to see where redirection exceeded the specified level (2) in the output pages. The following is one example: 2017-10-22 17:55:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://www.nasa.gov/topics/journeytomars/news/index.html>: max redirections reached How it works The spider is defined as the following: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] def parse(self, response): print("Parsing: ", response) print (response.request.meta.get('redirect_urls')) This is identical to our previous NASA sitemap based crawler, with the addition of one line printing the redirect_urls. In any call to parse, this metadata will contain all redirects that occurred to get to this page. The crawling process is configured with the following code: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 500 }, 'REDIRECT_ENABLED': True, 'REDIRECT_MAX_TIMES': 2 }) Redirect is enabled by default, but this sets the maximum number of redirects to 2 instead of the default of 20. Waiting for content to be available in Selenium A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asynchronously to the page after loading. Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button: When pressing the button, we are presented with a progress bar for five seconds: And when this is completed, we are presented with Hello World! Now suppose we want to scrape this page to get the content that is exposed only after the button is pressed and after the wait? How do we do this? How to do it We can do this using Selenium. We will use two features of Selenium. The first is the ability to click on page elements. The second is the ability to wait until an element with a specific ID is available on the page. First, we get the button and click it. The button's HTML is the following: <div id='start'> <button>Start</button> </div> When the button is pressed and the load completes, the following HTML is added to the document: <div id='finish'> <h4>Hello World!"</h4> </div> We will use the Selenium driver to find the Start button, click it, and then wait until a div with an ID of 'finish' is available. Then we get that element and return the text in the enclosed <h4> tag. You can try this by running 06/03_press_and_wait.py. It's output will be the following: clicked Hello World! Now let's see how it worked. How it works Let us break down the explanation: We start by importing the required items from Selenium: from selenium import webdriver from selenium.webdriver.support import ui Now we load the driver and the page: driver = webdriver.PhantomJS() driver.get("http://the-internet.herokuapp.com/dynamic_loading/2") With the page loaded, we can retrieve the button: button = driver.find_element_by_xpath("//*/div[@id='start']/button") And then we can click the button: button.click() print("clicked") Next we create a WebDriverWait object: wait = ui.WebDriverWait(driver, 10) With this object, we can request Selenium's UI wait for certain events. This also sets a maximum wait of 10 seconds. Now using this, we can wait until we meet a criterion; that an element is identifiable using the following XPath: wait.until(lambda driver: driver.find_element_by_xpath("//*/div[@id='finish']")) When this completes, we can retrieve the h4 element and get its enclosing text: finish_element=driver.find_element_by_xpath("//*/div[@id='finish']/ h4") print(finish_element.text) Limiting crawling to a single domain We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class. How to do it The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov. How it works The code is the same as previous NASA site crawlers except that we include allowed_domains=['nasa.gov']: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] allowed_domains=['nasa.gov'] def parse(self, response): print("Parsing: ", response) The NASA site is fairly consistent with staying within its root domain, but there are occasional links to other sites such as content on boeing.com. This code will prevent moving to those external sites. Processing infinitely scrolling pages Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart. While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page. Let's look at spidyquotes.herokuapp.com/scroll as an example. Getting ready Open http://spidyquotes.herokuapp.com/scroll in your browser. This page will load additional content when you scroll to the bottom of the page: Screenshot of the quotes to scrape Once the page is open, go into your developer tools and select the network panel. Then, scroll to the bottom of the page. You will see new content in the network panel: When we click on one of the links, we can see the following JSON: { "has_next": true, "page": 2, "quotes": [{ "author": { "goodreads_link": "/author/show/82952.Marilyn_Monroe", "name": "Marilyn Monroe", "slug": "Marilyn-Monroe" }, "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "u201cThis life is what you make it...." }, { "author": { "goodreads_link": "/author/show/1077326.J_K_Rowling", "name": "J.K. Rowling", "slug": "J-K-Rowling" }, "tags": ["courage", "friends"], "text": "u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.u201d" }, This is great because all we need to do is continually generate requests to /api/quotes?page=x, increasing x until the has_next tag exists in the reply document. If there are no more pages, then this tag will not be in the document. How to do it The 06/05_scrapy_continuous.py file contains a Scrapy agent, which crawls this set of pages. Run it with your Python interpreter and you will see output similar to the following (the following is multiple excerpts from the output): <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'Sisters']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'Understand']} When this gets to page 10 it will stop as it will see that there is no next page flag set in the Content. How it works Let's walk through the spider to see how this works. The spider starts with the following definition of the start URL: class Spider(scrapy.Spider): name = 'spidyquotes' quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes' start_urls = [quotes_base_url] download_delay = 1.5 The parse method then prints the response and also parses the JSON into the data variable: def parse(self, response): print(response) data = json.loads(response.body) Then it loops through all the items in the quotes element of the JSON objects. For each item, it yields a new Scrapy item back to the Scrapy engine: for item in data.get('quotes', []): yield { 'text': item.get('text'), 'author': item.get('author', {}).get('name'), 'tags': item.get('tags'), } It then checks to see if the data JSON variable has a 'has_next' property, and if so it gets the next page and yields a new request back to Scrapy to parse the next page: if data['has_next']: next_page = data['page'] + 1 yield scrapy.Request(self.quotes_base_url + "?page=%s" % next_page) There's more... It is also possible to process infinite, scrolling pages using Selenium. The following code is in 06/06_scrape_continuous_twitter.py: from selenium import webdriver import time driver = webdriver.PhantomJS() print("Starting") driver.get("https://twitter.com") scroll_pause_time = 1.5 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: print(last_height) # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(scroll_pause_time) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") print(new_height, last_height) if new_height == last_height: break last_height = new_height The output would be similar to the following: Starting 4882 8139 4882 8139 11630 8139 11630 15055 11630 15055 15055 15055 Process finished with exit code 0 This code starts by loading the page from Twitter. The call to .get() will return when the page is fully loaded. The scrollHeight is then retrieved, and the program scrolls to that height and waits for a moment for the new content to load. The scrollHeight of the browser is retrieved again, and if different than last_height, it will loop and continue processing. If the same as last_height, no new content has loaded and you can then continue on and retrieve the HTML for the completed page. We have discussed the common challenges faced in performing Web Scraping using Python and got to know their workaround. If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Python and perform efficient web scraping.

0
0
35552

article-image-deepmind-alphago-zero-game-changer-for-ai-research

Guest Contributor

09 May 2019

10 min read

Why DeepMind AlphaGo Zero is a game changer for AI research

Guest Contributor

09 May 2019

10 min read

DeepMind, a London based artificial intelligence (AI) company currently owned by Alphabet, recently made great strides in AI with its AlphaGo program. It all began in October 2015 when the program beat the European Go champion Fan Hui 5-0, in a game of Go. This was the very first time an AI defeated a professional Go player. Earlier, computers were only known to have played Go at the "amateur" level. Then, the company made headlines again in 2016 after its AlphaGo program beat Lee Sedol, a professional Go player (a world champion) with a score of 4-1 in a five-game match. Furthermore, in late 2017, an improved version of the program called AlphaGo Zero defeated AlphaGo 100 games to 0. The best part? AlphaGo Zero's strategies were self-taught i.e it was trained without any data from human games. AlphaGo Zero was able to defeat its predecessor in only three days time with lesser processing power than AlphaGo. However, the original AlphaGo, on the other hand required months to learn how to play. All these facts beg the questions: what makes AlphaGo Zero so exceptional? Why is it such a big deal? How does it even work? So, without further ado, let’s dive into the what, why, and how of DeepMind’s AlphaGo Zero. What is DeepMind AlphaGo Zero? Simply put, AlphaGo Zero is the strongest Go program in the world (with the exception of AlphaZero). As mentioned before, it monumentally outperforms all previous versions of AlphaGo. Just check out the graph below which compares the Elo rating of the different versions of AlphaGo. Source: DeepMind The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess and Go. It is named after its creator Arpad Elo, a Hungarian-American physics professor. Now, all previous versions of AlphaGo were trained using human data. The previous versions learned and improved upon the moves played by human experts/professional Go players. But AlphaGo Zero didn’t use any human data whatsoever. Instead, it had to learn completely from playing against itself. According to DeepMind's Professor David Silver, the reason that playing against itself enables it to do so much better than using strong human data is that AlphaGo always has an opponent of just the right level. So it starts off extremely naive, with perfectly random play. And yet at every step of the learning process, it has an opponent (a “sparring partner”) that’s exactly calibrated to its current level of performance. That is, to begin with, these players are terribly weak but over time they become progressively stronger and stronger. Why is reinforcement learning such a big deal? People tend to assume that machine learning is all about big data and massive amounts of computation. But actually, with AlphaGo Zero, AI scientists at DeepMind realized that algorithms matter much more than the computing processing power or data availability. AlphaGo Zero required less computation than previous versions and yet it was able to perform at a much higher level due to using much more principled algorithms than before. It is a system which is trained completely from scratch, starting from random behavior, and progressing from first principles to really discover tabula rasa, in playing the game of Go. It is, therefore, no longer constrained by the limits of human knowledge. Note that AlphaGo Zero did not use zero-shot learning which essentially is the ability of the machine to solve a task despite not having received any training for that task. How does it work? AlphaGo Zero is able to achieve all this by employing a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. As explained previously, the system starts off with a single neural network that knows absolutely nothing about the game of Go. By combining this neural network with a powerful search algorithm, it then plays games against itself. As it plays more and more games, the neural network is updated and tuned to predict moves, and even the eventual winner of the games. This revised neural network is then recombined with the search algorithm to generate a new, stronger version of AlphaGo Zero, and the process repeats. With each iteration, the performance of the system enhances with each iteration, and the quality of the self-play games’ advances, leading to increasingly accurate neural networks and ever-more powerful versions of AlphaGo Zero. Now, let’s dive into some of the technical details that make this version of AlphaGo so much better than all its forerunners. AlphaGo Zero's neural network was trained using TensorFlow, with 64 GPU workers and 19 CPU parameter servers. Only four Tensor Processing Units (TPUs) were used for inference. And of course, the neural network initially knew nothing about Go beyond the rules. Both AlphaGo and AlphaGo Zero took a general approach to play Go. Both evaluated the Go board and chose moves using a combination of two methods: Conducting a “lookahead” search: This means looking ahead several moves by simulating games, and hence seeing which current move is most likely to lead to a “good” position in the future. Assessing positions based on an “intuition” of whether a position is “good” or “bad” and is likely to result in a win or a loss. Go is a truly intricate game which means computers can’t merely search all possible moves using a brute force approach to discover the best one. Method 1: Lookahead Before AlphaGo, all the finest Go programs tackled this issue by using “Monte Carlo Tree Search” or MCTS. This process involves initially exploring numerous possible moves on the board and then focusing this search over time as certain moves are found to be more likely to result in wins than others. Source: LOC Both AlphaGo and AlphaGo Zero apply a fairly elementary version of MCTS for their “lookahead” to correctly maintain the tradeoff between exploring new sequences of moves or more deeply explore already-explored sequences. Although MCTS has been at the heart of all effective Go programs preceding AlphaGo, it was DeepMind’s smart coalescence of this method with a neural network-based “intuition” that enabled it to attain superhuman performance. Method 2: Intuition DeepMind’s pivotal innovation with AlphaGo was to utilize deep neural networks to identify the state of the game and then use this knowledge to effectively guide the search of the MCTS. In particular, they trained networks that could record: The current board position Which player was playing The sequence of recent moves (in order to rule out certain moves as “illegal”) With this data, the neural networks could propose: Which move should be played If the current player is likely to win or not So how did DeepMind train neural networks to do this? Well, AlphaGo and AlphaGo Zero used rather different approaches in this case. AlphaGo had two separately trained neural networks: Policy Network and Value Network. Source: AlphaGo’s Nature Paper DeepMind then fused these two neural networks with MCTS — that is, the program’s “intuition” with its brute force “lookahead” search — in an ingenious way. It used the network that had been trained to predict: Moves to guide which branches of the game tree to search Whether a position was “winning” to assess the positions it encountered during its search This let AlphaGo to intelligently search imminent moves and eventually beat the world champion Lee Sedol. AlphaGo Zero, however, took this principle to the next level. Its neural network’s “intuition” was trained entirely differently from that of AlphaGo. More specifically: The neural network was trained to play moves that exhibited the improved evaluations from performing the “lookahead” search The neural network was tweaked so that it was more likely to play moves like those that led to wins and less likely to play moves similar to those that led to losses during the self-play games Much was made of the fact that no games between humans were used to train AlphaGo Zero. Thus, for a given state of a Go agent, it can constantly be made smarter by performing MCTS-based lookahead and using the results of that lookahead to upgrade the agent. This is how AlphaGo Zero was able to perpetually improve, from when it was an “amateur” all the way up to when it better than the best human players. Moreover, AlphaGo Zero’s neural network architecture can be referred to as a “two-headed” architecture. Source: Hacker Noon Its first 20 layers were “blocks” of a typically seen in modern neural net architectures. These layers were followed by two “heads”: One head that took the output of the first 20 layers and presented probabilities of the Go agent making certain moves Another head that took the output of the first 20 layers and generated a probability of the current player winning. What’s more, AlphaGo Zero used a more “state of the art” neural network architecture as opposed to AlphaGo. Particularly, it used a “residual” neural network architecture rather than a plainly “convolutional” architecture. Deep residual learning was pioneered by Microsoft Research in late 2015, right around the time work on the first version of AlphaGo would have been concluded. So, it is quite reasonable that DeepMind did not use them in the initial AlphaGo program. Notably, each of these two neural network-related acts — switching from separate-convolutional to the more advanced dual-residual architecture and using the “two-headed” neural network architecture instead of separate neural networks — would have resulted in nearly half of the increase in playing strength as was realized when both were coupled. Source: AlphaGo’s Nature Paper Wrapping it up According to DeepMind: “After just three days of self-play training, AlphaGo Zero emphatically defeated the previously published version of AlphaGo - which had itself defeated 18-time world champion Lee Sedol - by 100 games to 0. After 40 days of self-training, AlphaGo Zero became even stronger, outperforming the version of AlphaGo known as “Master”, which has defeated the world's best players and world number one Ke Jie. Over the course of millions of AlphaGo vs AlphaGo games, the system progressively learned the game of Go from scratch, accumulating thousands of years of human knowledge during a period of just a few days. AlphaGo Zero also discovered new knowledge, developing unconventional strategies and creative new moves that echoed and surpassed the novel techniques it played in the games against Lee Sedol and Ke Jie.” Further, the founder and CEO of DeepMind, Dr. Demis Hassabis believes AlphaGo's algorithms are likely to most benefit to areas that need an intelligent search through an immense space of possibilities. Author Bio Gaurav is a Senior SEO and Content Marketing Analyst at The 20 Media, a Content Marketing agency that specializes in data-driven SEO. He has more than seven years of experience in Digital Marketing and along with that loves to read and write about AI, Machine Learning, Data Science and much more about the emerging technologies. In his spare time, he enjoys watching movies and listening to music. Connect with him on Twitter and LinkedIn. DeepMind researchers provide theoretical analysis on recommender system, ‘echo chamber’ and ‘filter bubble effect’ What if AIs could collaborate using human-like values? DeepMind researchers propose a Hanabi platform. Google DeepMind’s AI AlphaStar beats StarCraft II pros TLO and MaNa; wins 10-1 against the gamers

0
0
35304

article-image-top-10-mysql-8-performance-benchmarking-aspects-to-know

Amey Varangaonkar

27 Apr 2018

5 min read

Top 10 MySQL 8 performance benchmarking aspects to know

Amey Varangaonkar

27 Apr 2018

5 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide, co-authored by Chintan Mehta, Ankit Bhavsar, Hetal Oza and Subhash Shah. This book presents an in-depth view of the newly released features of MySQL 8 and how you can leverage them to administer a high-performance MySQL solution.[/box] Following the best practices for the configuration of MySQL helps us design and manage efficient database, and are quite a cherry on top - without which, it might seem a bit incomplete. In addition to configuration, benchmarking helps us validate and find bottlenecks in the database system and address them. In this article, we look at specific areas that will help us understand the best practices for configuration and performance benchmarking. 1. Resource utilization IO activity, CPU, and memory usage is something that you should not miss out. These metrics help us know how the system is performing while doing benchmarking and at the time of scaling. It also helps us derive impacts per transaction. 2. Stretching your benchmarking timelines We may often like to have a quick glance at performance metrics; however, ensuring that MySQL behaves in the same way for a longer duration of testing is also a key element. There is some basic stuff that might impact on performance when you stretch your benchmark timelines, such as memory fragmentation, degradation of IO, impact after data accumulation, cache management, and so on. We don't want our database to get restarted just to clean up junk items, correct? Therefore, it is suggested to run benchmarking for a long duration for stability and performance Validation. 3. Replicating production settings Let's benchmark in a production-replicated environment. Wait! Let's disable database replication in a replica environment until we are done with benchmarking. Gotcha! We have got some good numbers! It often happens that we don't simulate everything completely that we are going to configure in the production environment. It could prove to be costly, as we might unintentionally be benchmarking something in an environment that might have an adverse impact when it's in production. Replicate production settings, data, workload, and so on in your replicated environment while you do benchmarking. 4. Consistency of throughput and latency Throughput and latency go hand in hand. It is important to keep your eyes primarily focused on throughput; however, latency over time might be something to look out for. Performance dips, slowness, or stalls were noticed in InnoDB in its earlier days. It has improved a lot since then, but as there might be other cases depending on your workload, it is always good to keep an eye on throughput along with latency. 5. Sysbench can do more Sysbench is a wonderful tool to simulate your workloads, whether it be thousands of tables, transaction intensive, data in-memory, and so on. It is a splendid tool to simulate and gives you nice representation. 6. Virtualization world I would like to keep this simple; bare metal as compared to virtualization isn't the same. Hence, while doing benchmarking, measure your resources according to your environment. You might be surprised to see the difference in results if you compare both. 7. Concurrency Big data is seated on heavy data workload; high concurrency is important. MySQL 8 is extending its maximum CPU core support in every new release, optimizing concurrency based on your requirements and hardware resources should be taken care of. 8. Hidden workloads Do not miss out factors that run in the background, such as reporting for big data analytics, backups, and on-the-fly operations while you are benchmarking. The impact of such hidden workloads or obsolete benchmarking workloads can make your days (and nights) Miserable. 9. Nerves of your query Oops! Did we miss the optimizer? Not yet. An optimizer is a powerful tool that will read the nerves of your query and provide recommendations. It's a tool that I use before making changes to a query in production. It's a savior when you have complex queries to be optimized. These are a few areas that we should look out for. Let's now look at a few benchmarks that we did on MySQL 8 and compare them with the ones on MySQL 5.7. 10. Benchmarks To start with, let's fetch all the column names from all the InnoDB tables. The following is the query that we executed: SELECT t.table_schema, t.table_name, c.column_name FROM information_schema.tables t, information_schema.columns c WHERE t.table_schema = c.table_schema AND t.table_name = c.table_name AND t.engine='InnoDB'; The following figure shows how MySQL 8 performed a thousand times faster when having four instances: Following this, we also performed a benchmark to find static table metadata. The following is the query that we executed: SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE, ENGINE, ROW_FORMAT FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA LIKE 'chintan%'; The following figure shows how MySQL 8 performed around 30 times faster than MySQL 5.7: It made us eager to go into a bit more detail. So, we thought of doing one last test to find dynamic table metadata. The following is the query that we executed: SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA LIKE 'chintan%'; The following figure shows how MySQL 8 performed around 30 times faster than MySQL 5.7: MySQL 8.0 brings enormous performance improvement to the table. Scaling from one to million tables, is a need for big data requirements, which is now achievable. We look forward to more benchmarks being officially released once MySQL 8 is available for general purpose. If you found this post useful, make sure to check out the book MySQL 8 Administrator’s Guide for more tips and tricks to manage MySQL 8 effectively. MySQL 8.0 is generally available with added features New updates to Microsoft Azure services for SQL Server, MySQL, and PostgreSQL

0
0
35254

article-image-distributed-training-in-tensorflow-2-x

Expert Network

30 Apr 2021

7 min read

Distributed training in TensorFlow 2.x

Expert Network

30 Apr 2021

7 min read

TensorFlow 2 is a rich development ecosystem composed of two main parts: Training and Serving. Training consists of a set of libraries for dealing with datasets (tf.data), a set of libraries for building models, including high-level libraries (tf.Keras and Estimators), low-level libraries (tf.*), and a collection of pretrained models (tf.Hub). Training can happen on CPUs, GPUs, and TPUs via distribution strategies and the result can be saved using the appropriate libraries. This article is an excerpt from the book, Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor, and Sujit Pal. This book teaches deep learning techniques alongside TensorFlow (TF) and Keras. In this article, we’ll review the addition of the powerful new feature, distributed training, in TensorFlow 2.x. One very useful addition to TensorFlow 2.x is the possibility to train models using distributed GPUs, multiple machines, and TPUs in a very simple way with very few additional lines of code. tf.distribute.Strategy is the TensorFlow API used in this case and it supports both tf.keras and tf.estimator APIs and eager execution. You can switch between GPUs, TPUs, and multiple machines by just changing the strategy instance. Strategies can be synchronous, where all workers train over different slices of input data in a form of sync data parallel computation, or asynchronous, where updates from the optimizers are not happening in sync. All strategies require that data is loaded in batches via the tf.data.Dataset api. Note that the distributed training support is still experimental. A roadmap is given in Figure 1: Figure 1: Distributed training support fr different strategies and APIs Let’s discuss in detail all the different strategies reported in Figure 1. Multiple GPUs TensorFlow 2.x can utilize multiple GPUs. If we want to have synchronous distributed training on multiple GPUs on one machine, there are two things that we need to do: (1) We need to load the data in a way that will be distributed into the GPUs, and (2) We need to distribute some computations into the GPUs too: In order to load our data in a way that can be distributed into the GPUs, we simply need tf.data.Dataset (which has already been discussed in the previous paragraphs). If we do not have a tf.data.Dataset but we have a normal tensor, then we can easily convert the latter into the former using tf.data.Dataset.from_tensors_slices(). This will take a tensor in memory and return a source dataset, the elements of which are slices of the given tensor. In our toy example, we use NumPy to generate training data x and labels y, and we transform it into tf.data.Dataset with tf.data.Dataset.from_tensor_slices(). Then we apply a shuffle to avoid bias in training across GPUs and then generate SIZE_BATCHES batches: import tensorflow as tf import numpy as np from tensorflow import keras N_TRAIN_EXAMPLES = 1024*1024 N_FEATURES = 10 SIZE_BATCHES = 256 # 10 random floats in the half-open interval [0.0, 1.0). x = np.random.random((N_TRAIN_EXAMPLES, N_FEATURES)) y = np.random.randint(2, size=(N_TRAIN_EXAMPLES, 1)) x = tf.dtypes.cast(x, tf.float32) print (x) dataset = tf.data.Dataset.from_tensor_slices((x, y)) dataset = dataset.shuffle(buffer_size=N_TRAIN_EXAMPLES).batch(SIZE_BATCHES) In order to distribute some computations to GPUs, we instantiate a distribution = tf.distribute.MirroredStrategy() object, which supports synchronous distributed training on multiple GPUs on one machine. Then, we move the creation and compilation of the Keras model inside the strategy.scope(). Note that each variable in the model is mirrored across all the replicas. Let’s see it in our toy example: # this is the distribution strategy distribution = tf.distribute.MirroredStrategy() # this piece of code is distributed to multiple GPUs with distribution.scope(): model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(16, activation=‘relu’, input_shape=(N_FEATURES,))) model.add(tf.keras.layers.Dense(1, activation=‘sigmoid’)) optimizer = tf.keras.optimizers.SGD(0.2) model.compile(loss=‘binary_crossentropy’, optimizer=optimizer) model.summary() # Optimize in the usual way but in reality you are using GPUs. model.fit(dataset, epochs=5, steps_per_epoch=10) Note that each batch of the given input is divided equally among the multiple GPUs. For instance, if using MirroredStrategy() with two GPUs, each batch of size 256 will be divided among the two GPUs, with each of them receiving 128 input examples for each step. In addition, note that each GPU will optimize on the received batches and the TensorFlow backend will combine all these independent optimizations on our behalf. In short, using multiple GPUs is very easy and requires minimal changes to the tf.Keras code used for a single server. MultiWorkerMirroredStrategy This strategy implements synchronous distributed training across multiple workers, each one with potentially multiple GPUs. As of September 2019 the strategy works only with Estimators and it has experimental support for tf.Keras. This strategy should be used if you are aiming at scaling beyond a single machine with high performance. Data must be loaded with tf.Dataset and shared across workers so that each worker can read a unique subset. TPUStrategy This strategy implements synchronous distributed training on TPUs. TPUs are Google’s specialized ASICs chips designed to significantly accelerate machine learning workloads in a way often more efficient than GPUs. According to this public information (https://github.com/tensorflow/tensorflow/issues/24412): “the gist is that we intend to announce support for TPUStrategy alongside Tensorflow 2.1. Tensorflow 2.0 will work under limited use-cases but has many improvements (bug fixes, performance improvements) that we’re including in Tensorflow 2.1, so we don’t consider it ready yet.” ParameterServerStrategy This strategy implements either multi-GPU synchronous local training or asynchronous multi-machine training. For local training on one machine, the variables of the models are placed on the CPU and operations are replicated across all local GPUs. For multi-machine training, some machines are designated as workers and some as parameter servers with the variables of the model placed on parameter servers. Computation is replicated across all GPUs of all workers. Multiple workers can be set up with the environment variable TF_CONFIG as in the following example: os.environ[“TF_CONFIG”] = json.dumps({ “cluster”: { “worker”: [“host1:port”, “host2:port”, “host3:port”], “ps”: [“host4:port”, “host5:port”] }, “task”: {“type”: “worker”, “index”: 1} }) In this article, we have seen how it is possible to train models using distributed GPUs, multiple machines, and TPUs in a very simple way with very few additional lines of code. Learn how to build machine and deep learning systems with the newly released TensorFlow 2 and Keras for the lab, production, and mobile devices with Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor and Sujit Pal. About the Authors Antonio Gulli is a software executive and business leader with a passion for establishing and managing global technological talent, innovation, and execution. He is an expert in search engines, online services, machine learning, information retrieval, analytics, and cloud computing. Amita Kapoor is an Associate Professor in the Department of Electronics, SRCASW, University of Delhi and has been actively teaching neural networks and artificial intelligence for the last 20 years. She is an active member of ACM, AAAI, IEEE, and INNS. She has co-authored two books. Sujit Pal is a technology research director at Elsevier Labs, working on building intelligent systems around research content and metadata. His primary interests are information retrieval, ontologies, natural language processing, machine learning, and distributed processing. He is currently working on image classification and similarity using deep learning models. He writes about technology on his blog at Salmon Run.

0
0
35240

article-image-worried-about-deepfakes-check-out-the-new-algorithm-that-manipulate-talking-head-videos-by-altering-the-transcripts

Vincy Davis

07 Jun 2019

6 min read

Worried about Deepfakes? Check out the new algorithm that manipulate talking-head videos by altering the transcripts

Vincy Davis

07 Jun 2019

6 min read

Last week, a team of researchers from Stanford University, Max Planck Institute for Informatics, Princeton University and Adobe Research published a paper titled “Text-based Editing of Talking-head Video”. This paper proposes a method to edit a talking-head video based on its transcript to produce a realistic output video, in which the dialogue of the speaker has been modified. Basically, the editor modifies a video using a text transcript, to add new words, delete unwanted ones or completely rearrange the pieces by dragging and dropping. This video will maintain a seamless audio-visual flow, without any jump cuts and will look almost flawless to the untrained eye. The researchers want this kind of text-based editing approach to lay the foundation for better editing tools, in post production of movies and television. Actors often botch small bits of performance or leave out a critical word. This algorithm can help video editors fix that, which has until now involves expensive reshoots. It can also help in easy adaptation of audio-visual video content to specific target audiences. The tool supports three types of edit operations- add new words, rearrange existing words, delete existing words. Ohad Fried, a researcher in the paper says that “This technology is really about better storytelling. Instructional videos might be fine-tuned to different languages or cultural backgrounds, for instance, or children’s stories could be adapted to different ages.” https://youtu.be/0ybLCfVeFL4 How does the application work? The method uses an input talking-head video and a transcript to perform text-based editing. The first step is to align phonemes to the input audio and track each input frame to construct a parametric head model. Next, a 3D parametric face model with each frame of the input talking-head video is registered. This helps in selectively blending different aspects of the face. Then, a background sequence is selected and is used for pose data and background pixels. The background sequence allows editors to edit challenging videos with hair movement and slight camera motion. As Facial expressions are an important parameter, the researchers have tried to preserve the retrieved expression parameters as much as possible, by smoothing out the transition between them. This provides an output of edited parameter sequence which describes the new desired facial motion and a corresponding retimed background video clip. This is forwarded to a ‘neural face rendering’ approach. This step changes the facial motion of the retimed background video to match the parameter sequence. Thus the rendering procedure produces photo-realistic video frames of the subject, appearing to speak the new phrase.These localized edits seamlessly blends into the original video, producing an edited result. Lastly to add the audio, the resulted video is retimed to match the recording at the level of phones. The researchers have used the performers own voice in all their synthesis results. Image Source: Text-based Editing of Talking-head Video The researchers have tested the system with a series of complex edits including adding, removing and changing words, as well as translations to different languages. When the application was tried in a crowd-sourced study with 138 participants, the edits were rated as “real”, almost 60% of the time. Fried said that “The visual quality is such that it is very close to the original, but there’s plenty of room for improvement.” Ethical considerations: Erosion of truth, confusion and defamation Even though the application is quite useful for video editors and producers, it raises important and valid concerns about its potential for misuse. The researchers have also agreed that such a technology might be used for illicit purposes. “We acknowledge that bad actors might use such technologies to falsify personal statements and slander prominent individuals. We are concerned about such deception and misuse.” They have recommended certain precautions to be taken to avoid deception and misuse such as using watermarking. “The fact that the video is synthesized may be obvious by context, directly stated in the video or signaled via watermarking. We also believe that it is essential to obtain permission from the performers for any alteration before sharing a resulting video with a broad audience.” They urge the community to continue to develop forensics, fingerprinting and verification techniques to identify manipulated video. They also support the creation of appropriate regulations and laws that would balance the risks of misuse of these tools against the importance of creative, consensual use cases. The public however remain dubious pointing out valid arguments on why the ‘Ethical Concerns’ talked about in the paper, fail. A user on Hacker News comments, “The "Ethical concerns" section in the article feels like a punt. The author quoting "this technology is really about better storytelling" is aspirational -- the technology's story will be written by those who use it, and you can bet people will use this maliciously.” https://twitter.com/glenngabe/status/1136667296980701185 Another user feels that such kind of technology will only result in “slow erosion of video evidence being trustworthy”. Others have pointed out how the kind of transformation mentioned in the paper, does not come under the broad category of ‘video-editing’ ‘We need more words to describe this new landscape’ https://twitter.com/BrianRoemmele/status/1136710962348617728 Another common argument is that the algorithm can be used to generate terrifyingly real Deepfake videos. A Shallow Fake video was Nancy Pelosi’s altered video, which circulated recently, that made it appear she was slurring her words by slowing down the video. Facebook was criticized for not acting faster to slow the video’s spread. Not just altering speeches of politicians, altered videos like these can also, for instance, be used to create fake emergency alerts, or disrupt elections by dropping a fake video of one of the candidates before voting starts. There is also the issue of defaming someone on a personal capacity. Sam Gregory, Program Director at Witness, tweets that one of the main steps in ensuring effective use of such tools would be to “ensure that any commercialization of synthetic media tools has equal $ invested in detection/safeguards as in detection.; and to have a grounded conversation on trade-offs in mitigation”. He has also listed more interesting recommendations. https://twitter.com/SamGregory/status/1136964998864015361 For more details, we recommend you to read the research paper. OpenAI researchers have developed Sparse Transformers, a neural network which can predict what comes next in a sequence ‘Facial Recognition technology is faulty, racist, biased, abusive to civil rights; act now to restrict misuse’ say experts to House Oversight and Reform Committee Now there’s a CycleGAN to visualize the effects of climate change. But is this enough to mobilize action?

0
0
35226

article-image-how-to-learn-data-science-from-data-mining-to-machine-learning

Richard Gall

04 Sep 2019

6 min read

How to learn data science: from data mining to machine learning

Richard Gall

04 Sep 2019

6 min read

Data science is a field that’s complex and diverse. If you’re trying to learn data science and become a data scientist it can be easy to fall down a rabbit hole of machine learning or data processing. To a certain extent, that’s good. To be an effective data scientist you need to be curious. You need to be prepared to take on a range of different tasks and challenges. But that’s not always that efficient: if you want to learn quickly and effectively, you need a clear structure - a curriculum - that you can follow. This post will show you what you need to learn and how to go about it. Statistics Statistics is arguably the cornerstone of data science. Nate Silver called data scientists “sexed up statisticians”, a comment that was perhaps unfair but still nevertheless contains a kernel of truth in it: that data scientists are always working in the domain of statistics. Once you understand this everything else you need to learn will follow easily. Machine learning, data manipulation, data visualization - these are all ultimately technological methods for performing statistical analysis really well. Best Packt books and videos content for learning statistics Statistics for Data Science R Statistics Cookbook Statistical Methods and Applied Mathematics in Data Science [Video] Before you go any deeper into data science, it’s critical that you gain a solid foundation in statistics. Data mining and wrangling This is an important element of data science that often gets overlooked with all the hype about machine learning. However, without effective data collection and cleaning, all your efforts elsewhere are going to be pointless at best. At worst they might even be misleading or problematic. Sometimes called data manipulation or data munging, it's really all about managing and cleaning data from different sources so it can be used for analytics projects. To do it well you need to have a clear sense of where you want to get to - do you need to restructure the data? Sort or remove certain parts of a data set? Once you understand this, it’s much easier to wrangle data effectively. Data mining and wrangling tools There are a number of different tools you can use for data wrangling. Python and R are the two key programming languages, and both have some useful tools for data mining and manipulation. Python in particular has a great range of tools for data mining and wrangling, such as pandas and NLTK (Natural Language Toolkit), but that isn’t to say R isn’t powerful in this domain. Other tools are available too - Weka and Apache Mahout, for example, are popular. Weka is written in Java so is a good option if you have experience with that programming language, while Mahout integrates well with the Hadoop ecosystem. Data mining and data wrangling books and videos If you need to learn data mining, wrangling and manipulation, Packt has a range of products. Here are some of the best: Data Wrangling with R Data Wrangling with Python Python Data Mining Quick Start Guide Machine Learning for Data Mining Machine learning and artificial intelligence Although Machine learning and artificial intelligence are huge trends in their own right, they are nevertheless closely aligned with data science. Indeed, you might even say that their prominence today has grown out of the excitement around data science that we first we witnessed just under a decade ago. It’s a data scientist’s job to use machine learning and artificial intelligence in a way that can drive business value. That could, for example, be to recommend products or services to customers, perhaps to gain a better understanding into existing products, or even to better manage strategic and financial risks through predictive modelling. So, while we can see machine learning in a massive range of digital products and platforms - all of which require smart development and design - for it to work successfully, it needs to be supported by a capable and creative data scientist. Machine learning and artificial intelligence books for data scientists Machine Learning Algorithms Machine Learning with R - Third Edition Machine Learning with Apache Spark Quick Start Guide Machine Learning with TensorFlow 1.x Keras Deep Learning Cookbook Data visualization A talented data scientist isn’t just a great statistician and engineer, they’re also a great communicator. This means so-called soft skills are highly valuable - the ability to communicate insights and ideas with key stakeholders is essential. But great communication isn’t just about soft skills, it’s also about data visualization. Data visualization is, at a fundamental level, about organizing and presenting data in a way that tells a story, clarifies a problem, or illustrates a solution. It’s essential that you don’t overlook this step. Indeed, spending time learning about effective data visualization can also help you to develop your soft skills. The principles behind storytelling and communication through visualization are, in truth, exactly the same when applied to other scenarios. Data visualization tools There are a huge range of data visualization tools available. As with machine learning, understanding the differences between them and working out what solution will work for you is actually an important part of the learning process. For that reason, don’t be afraid to spend a little bit of time with a range of data visualization tools. Many of the most popular data visualization tools are paid for products. Perhaps the best known of these is Tableau (which, incidentally was bought by Salesforce earlier this year). Tableau and its competitors are very user friendly, which means the barrier to entry is pretty low. They allow you to create some pretty sophisticated data visualizations fairly easily. However, sticking to these tools is not only expensive, it can also limit your abilities. We’d recommend trying a number of different data visualization tools, such as Seabor, D3.js, Matplotlib, and ggplot2. Data visualization books and videos for data scientists Applied Data Visualization with R and ggplot2 Tableau 2019.1 for Data Scientists [Video] D3.js Data Visualization Projects [Video] Tableau in 7 Steps [Video] Data Visualization with Python If you want to learn data science, just get started! As we've seen, data science requires a number of very different skills and takes in a huge breadth of tools. That means that if you're going to be a data scientist, you need to be prepared to commit to learning forver: you're never going to reach a point where you know everything. While that might sound intimidating, it's important to have confidence. With a sense of direction and purpose, and a learning structure that works for you, it's possible to develop and build your data science capabilities in a way that could unlock new opportunities and act as the basis for some really exciting projects.

0
0
35213

article-image-google-confirms-it-paid-135-million-as-exit-packages-to-senior-execs-accused-of-sexual-harassment

Natasha Mathur

12 Mar 2019

4 min read

Google confirms it paid $135 million as exit packages to senior execs accused of sexual harassment

Natasha Mathur

12 Mar 2019

4 min read

According to a complaint filed in a lawsuit yesterday, Google paid $135 million in total as exit packages to top two senior execs, namely Andy Rubin (creator of Android) and Amit Singhal (former senior VP of Google search) after they were accused of sexual misconduct in the company. The lawsuit was filed by an Alphabet shareholder, James Martin, in the Santa Clara, California Court. Google also confirmed paying the exit packages to senior execs to The Verge, yesterday. Speaking of the lawsuit, the complaint is against certain directors and officers of Alphabet, Google’s parent company, for their active and direct participation in “multi-year scheme” to hide sexual harassment and discrimination at Alphabet. It also states that the misconduct by these directors has caused severe financial and reputational damage to Alphabet. The exit packages for Rubin and Singhal were approved by the Leadership Development and Compensation Committee (LLDC). The news of Google paying high exit packages to its top execs first came to light last October, after the New York Times released a report on Google, stating that the firm paid $90 million to Rubin and $15 million to Singhal. Rubin had previously also received an offer for a $150 million stock grant, which he then further use to negotiate the $90 million in severance pay, even though he should have been fired for cause without any pay, states the lawsuit. To protest against the handling of sexual misconduct within Google, more than 20,000 Google employees along with vendors, and contractors, temps, organized Google “walkout for real change” and walked out of their offices in November 2018. Googlers also launched an industry-wide awareness campaign to fight against forced arbitration in January, where they shared information about arbitration on their Twitter and Instagram accounts throughout the day. Last year in November, Google ended its forced arbitration ( a move that was soon followed by Facebook) for its employees (excluding temps, vendors, etc) and only in the case of sexual harassment. This led to contractors writing an open letter on Medium to Sundar Pichai, CEO, Google, in December, demanding him to address their demands of better conditions and equal benefits for contractors. In response to the Google walkout and the growing public pressure, Google finally decided to end its forced arbitration policy for all employees (including contractors) and for all kinds of discrimination within Google, last month. The changes will go into effect for all the Google employees starting March 21st, 2019. Yesterday, the Google walkout for real change group tweeted condemning the multi-million dollar payouts and has asked people to use the hashtag #Googlepayoutsforall to highlight other better ways that money could have been used. https://twitter.com/GoogleWalkout/status/1105450565193121792 “The conduct of Rubin and other executives was disgusting, illegal, immoral, degrading to women and contrary to every principle that Google claims it abides by”, reads the lawsuit. James Martin also filed a lawsuit against Alphabet’s board members, Larry Page, Sergey Brin, and Eric Schmidt earlier this year in January for covering up the sexual harassment allegations against the former top execs at Google. Martin had sued Alphabet for breaching its fiduciary duty to shareholders, unjust enrichment, abuse of power, and corporate waste. “The directors’ wrongful conduct allowed illegal conduct to proliferate and continue. As such, members of the Alphabet’s board were knowing direct enables of sexual harassment and discrimination”, reads the lawsuit. It also states that the board members not only violated the California and federal law but it also violated the ethical standards and guidelines set by Alphabet. Public reaction to the news is largely negative with people condemning Google’s handling of sexual misconduct: https://twitter.com/awesome/status/1105295877487263744 https://twitter.com/justkelly_ok/status/1105456081663225856 https://twitter.com/justkelly_ok/status/1105457965790707713 https://twitter.com/conradwt/status/1105386882135875584 https://twitter.com/mer__edith/status/1105464808831361025 For more information, check out the official lawsuit here. Recode Decode #GoogleWalkout interview shows why data and evidence don’t always lead to right decisions in even the world’s most data-driven company Liz Fong Jones, prominent ex-Googler shares her experience at Google and ‘grave concerns’ for the company Google’s pay equity analysis finds men, not women, are underpaid; critics call out design flaws in the analysis

0
0
35139

article-image-key-skills-every-database-programmer-should-have

Sugandha Lahoti

05 Sep 2019

7 min read

Key skills every database programmer should have

Sugandha Lahoti

05 Sep 2019

7 min read

According to Robert Half Technology’s 2019 IT salary report, ‘Database programmer’ is one of the 13 most in-demand tech jobs for 2019. For an entry-level programmer, the average salary is $98,250 which goes up to $167,750 for a seasoned expert. A typical database programmer is responsible for designing, developing, testing, deploying, and maintaining databases. In this article, we will list down the top critical tech skills essential to database programmers. #1 Ability to perform Data Modelling The first step is to learn to model the data. In Data modeling, you create a conceptual model of how data items relate to each other. In order to efficiently plan a database design, you should know the organization you are designing the database from. This is because Data models describe real-world entities such as ‘customer’, ‘service’, ‘products’, and the relation between these entities. Data models provide an abstraction for the relations in the database. They aid programmers in modeling business requirements and in translating business requirements into relations. They are also used for exchanging information between the developers and business owners. During the design phase, the database developer should pay great attention to the underlying design principles, run a benchmark stack to ensure performance, and validate user requirements. They should also avoid pitfalls such as data redundancy, null saturation, and tight coupling. #2 Know a database programming language, preferably SQL Database programmers need to design, write and modify programs to improve their databases. SQL is one of the top languages that are used to manipulate the data in a database and to query the database. It's also used to define and change the structure of the data—in other words, to implement the data model. Therefore it is essential that you learn SQL. In general, SQL has three parts: Data Definition Language (DDL): used to create and manage the structure of the data Data Manipulation Language (DML): used to manage the data itself Data Control Language (DCL): controls access to the data Considering, data is constantly inserted into the database, changed, or retrieved DML is used more often in day-to-day operations than the DDL, so you should have a strong grasp on DML. If you plan to grow in a database architect role in the near future, then having a good grasp of DDL will go a long way. Another reason why you should learn SQL is that almost every modern relational database supports SQL. Although different databases might support different features and implement their own dialect of SQL, the basics of the language remain the same. If you know SQL, you can quickly adapt to MySQL, for example. At present, there are a number of categories of database models predominantly, relational, object-relational, and NoSQL databases. All of these are meant for different purposes. Relational databases often adhere to SQL. Object-relational databases (ORDs) are also similar to relational databases. NoSQL, which stands for "not only SQL," is an alternative to traditional relational databases useful for working with large sets of distributed data. They provide benefits such as availability, schema-free, and horizontal scaling, but also have limitations such as performance, data retrieval constraints, and learning time. For beginners, it is advisable to first start with experimenting on relational databases learning SQL, gradually transitioning to NoSQL DBMS. #3 Know how to Extract, Transform, Load various data types and sources A database programmer should have a good working knowledge of ETL (Extract, Transform Load) programming. ETL developers basically extract data from different databases, transform it and then load the data into the Data Warehouse system. A Data Warehouse provides a common data repository that is essential for business needs. A database programmer should know how to tune existing packages, tables, and queries for faster ETL processing. They should conduct unit tests before applying any change to the existing ETL process. Since ETL takes data from different data sources (SQL Server, CSV, and flat files), a database developer should have knowledge on how to deal with different data sources. #4 Design and test Database plans Database programmers o perform regular tests to identify ways to solve database usage concerns and malfunctions. As databases are usually found at the lowest level of the software architecture, testing is done in an extremely cautious fashion. This is because changes in the database schema affect many other software components. A database developer should make sure that when changing the database structure, they do not break existing applications and that they are using the new structures properly. You should be proficient in Unit testing your database. Unit tests are typically used to check if small units of code are functioning properly. For databases, unit testing can be difficult. So the easiest way to do all of that is by writing the tests as SQL scripts. You should also know about System Integration Testing which is done on the complete system after the hardware and software modules of that system have been integrated. SIT validates the behavior of the system and ensures that modules in the system are functioning suitably. #5 Secure your Database Data protection and security are essential for the continuity of business. Databases often store sensitive data, such as user information, email addresses, geographical addresses, and payment information. A robust security system to protect your database against any data breach is therefore necessary. While a database architect is responsible for designing and implementing secure design options, a database admin must ensure that the right security and privacy policies are in place and are being observed. However, this does not absolve database programmers from adopting secure coding practices. Database programmers need to ensure that data integrity is maintained over time and is secure from unauthorized changes or theft. They need to especially be careful about Table Permissions i.e who can read and write to what tables. You should be aware of who is allowed to perform the 4 basic operations of INSERT, UPDATE, DELETE and SELECT against which tables. Database programmers should also adopt authentication best practices depending on the infrastructure setup, the application's nature, the user's characteristics, and data sensitivity. If the database server is accessed from the outside world, it is beneficial to encrypt sessions using SSL certificates to avoid packet sniffing. Also, you should secure database servers that trust all localhost connections, as anyone who accesses the localhost can access the database server. #6 Optimize your database performance A database programmer should also be aware of how to optimize their database performance to achieve the best results. At the basic level, they should know how to rewrite SQL queries and maintain indexes. Other aspects of optimizing database performance, include hardware configuration, network settings, and database configuration. Generally speaking, tuning database performance requires knowledge about the system's nature. Once the database server is configured you should calculate the number of transactions per second (TPS) for the database server setup. Once the system is up and running, and you should set up a monitoring system or log analysis, which periodically finds slow queries, the most time-consuming queries, etc. #7 Develop your soft skills Apart from the above technical skills, a database programmer needs to be comfortable communicating with developers, testers and project managers while working on any software project. A keen eye for detail and critical thinking can often spot malfunctions and errors that may otherwise be overlooked. A database programmer should be able to quickly fix issues within the database and streamline the code. They should also possess quick-thinking to prioritize tasks and meet deadlines effectively. Often database programmers would be required to work on documentation and technical user guides so strong writing and technical skills are a must. Get started If you want to get started with becoming a Database programmer, Packt has a range of products. Here are some of the best: PostgreSQL 11 Administration Cookbook Learning PostgreSQL 11 - Third Edition PostgreSQL 11 in 7 days [ Video ] Using MySQL Databases With Python [ Video ] Basic Relational Database Design [ Video ] How to learn data science: from data mining to machine learning How to ace a data science interview 5 barriers to learning and technology training for small software development teams

0
0
34749

article-image-bitbucket-to-no-longer-support-mercurial-users-must-migrate-to-git-by-may-2020

Fatema Patrawala

21 Aug 2019

6 min read

Bitbucket to no longer support Mercurial, users must migrate to Git by May 2020

Fatema Patrawala

21 Aug 2019

6 min read

Yesterday marked an end of an era for Mercurial users, as Bitbucket announced to no longer support Mercurial repositories after May 2020. Bitbucket, owned by Atlassian, is a web-based version control repository hosting service, for source code and development projects. It has used Mercurial since the beginning in 2008 and then Git since October 2011. Now almost after ten years of sharing its journey with Mercurial, the Bitbucket team has decided to remove the Mercurial support from the Bitbucket Cloud and its API. The official announcement reads, “Mercurial features and repositories will be officially removed from Bitbucket and its API on June 1, 2020.” The Bitbucket team also communicated the timeline for the sunsetting of the Mercurial functionality. After February 1, 2020 users will no longer be able to create new Mercurial repositories. And post June 1, 2020 users will not be able to use Mercurial features in Bitbucket or via its API and all Mercurial repositories will be removed. Additionally all current Mercurial functionality in Bitbucket will be available through May 31, 2020. The team said the decision was not an easy one for them and Mercurial held a special place in their heart. But according to a Stack Overflow Developer Survey, almost 90% of developers use Git, while Mercurial is the least popular version control system with only about 3% developer adoption. Apart from this Mercurial usage on Bitbucket saw a steady decline, and the percentage of new Bitbucket users choosing Mercurial fell to less than 1%. Hence they decided on removing the Mercurial repos. How can users migrate and export their Mercurial repos Bitbucket team recommends users to migrate their existing Mercurial repos to Git. They have also extended support for migration, and kept the available options open for discussion in their dedicated Community thread. Users can discuss about conversion tools, migration, tips, and also offer troubleshooting help. If users prefer to continue using the Mercurial system, there are a number of free and paid Mercurial hosting services for them. The Bitbucket team has also created a Git tutorial that covers everything from the basics of creating pull requests to rebasing and Git hooks. Community shows anger and sadness over decision to discontinue Mercurial support There is an outrage among the Mercurial users as they are extremely unhappy and sad with this decision by Bitbucket. They have expressed anger not only on one platform but on multiple forums and community discussions. Users feel that Bitbucket’s decision to stop offering Mercurial support is bad, but the decision to also delete the repos is evil. On Hacker News, users speculated that this decision was influenced by potential to market rather than based on technically superior architecture and ease of use. They feel GitHub has successfully marketed Git and that's how both have become synonymous to the developer community. One of them comments, “It's very sad to see bitbucket dropping mercurial support. Now only Facebook and volunteers are keeping mercurial alive. Sometimes technically better architecture and user interface lose to a non user friendly hard solutions due to inertia of mass adoption. So a lesson in Software development is similar to betamax and VHS, so marketing is still a winner over technically superior architecture and ease of use. GitHub successfully marketed git, so git and GitHub are synonymous for most developers. Now majority of open source projects are reliant on a single proprietary solution Github by Microsoft, for managing code and project. Can understand the difficulty of bitbucket, when Python language itself moved out of mercurial due to the same inertia. Hopefully gitlab can come out with mercurial support to migrate projects using it from bitbucket.” Another user comments that Mercurial support was the only reason for him to use Bitbucket when GitHub is miles ahead of Bitbucket. Now when it stops supporting Mercurial too, Bitbucket will end soon. The comment reads, “Mercurial support was the one reason for me to still use Bitbucket: there is no other Bitbucket feature I can think of that Github doesn't already have, while Github's community is miles ahead since everyone and their dog is already there. More importantly, Bitbucket leaves the migration to you (if I read the article correctly). Once I download my repo and convert it to git, why would I stay with the company that just made me go through an annoying (and often painful) process, when I can migrate to Github with the exact same command? And why isn't there a "migrate this repo to git" button right there? I want to believe that Bitbucket has smart people and that this choice is a good one. But I'm with you there - to me, this definitely looks like Bitbucket will die.” On Reddit, programming folks see this as a big change from Bitbucket as they are the major mercurial hosting provider. And they feel Bitbucket announced this at a pretty short notice and they require more time for migration. Apart from the developer community forums, on Atlassian community blog as well users have expressed displeasure. A team of scientists commented, “Let's get this straight : Bitbucket (offering hosting support for Mercurial projects) was acquired by Atlassian in September 2010. Nine years later Atlassian decides to drop Mercurial support and delete all Mercurial repositories. Atlassian, I hate you :-) The image you have for me is that of a harmful predator. We are a team of scientists working in a university. We don't have computer scientists, we managed to use a version control simple as Mercurial, and it was a hard work to make all scientists in our team to use a version control system (even as simple as Mercurial). We don't have the time nor the energy to switch to another version control system. But we will, forced and obliged. I really don't want to check out Github or something else to migrate our projects there, but we will, forced and obliged.” Atlassian Bitbucket, GitHub, and GitLab take collective steps against the Git ransomware attack Attackers wiped many GitHub, GitLab, and Bitbucket repos with ‘compromised’ valid credentials leaving behind a ransom note BitBucket goes down for over an hour

0
0
34327

article-image-master-the-art-of-face-swapping-with-opencv-and-python-by-sylwek-brzeczkowski-developer-at-truststamp

Vincy Davis

12 Dec 2019

8 min read

Master the art of face swapping with OpenCV and Python by Sylwek Brzęczkowski, developer at TrustStamp

Vincy Davis

12 Dec 2019

8 min read

No discussion on image processing can be complete without talking about OpenCV. Its 2500+ algorithms, extensive documentation and sample code are considered world-class for exploring real-time computer vision. OpenCV supports a wide variety of programming languages such as C++, Python, Java, etc., and is also available on different platforms including Windows, Linux, OS X, Android, and iOS. OpenCV-Python, the Python API for OpenCV is one of the most popular libraries used to solve computer vision problems. It combines the best qualities of OpenCV, C++ API, and the Python language. The OpenCV-Python library uses Numpy, which is a highly optimized library for numerical operations with a MATLAB-style syntax. This makes it easier to integrate the Python API with other libraries that use Numpy such as SciPy and Matplotlib. This is the reason why it is used by many developers to execute different computer vision experiments. Want to know more about OpenCV with Python? [box type="shadow" align="" class="" width=""]If you are interested in developing your computer vision skills, you should definitely master the algorithms in OpenCV 4 and Python explained in our book ‘Mastering OpenCV 4 with Python’ written by Alberto Fernández Villán. This book will help you build complete projects in relation to image processing, motion detection, image segmentation, and many other tasks by exploring the deep learning Python libraries and also by learning the OpenCV deep learning capabilities.[/box] At the PyData Warsaw 2018 conference, Sylwek Brzęczkowski walked through how to implement a face swap using OpenCV and Python. Face swaps are used by apps like Snapchat to dispense various face filters. Brzęczkowski is a Python developer at TrustStamp. Steps to implement face swapping with OpenCV and Python #1 Face detection using histogram of oriented gradients (HOG) Histogram of oriented gradients (HOG) is a feature descriptor that is used to detect objects in computer vision and image processing. Brzęczkowski demonstrated the working of a HOG using square patches which when hovered over an array of images produces a histogram of oriented gradients feature vectors. These feature vectors are then passed to the classifier to generate a result having the highest matching samples. In order to implement face detection using HOG in Python, the image needs to be imported using import OpenCV. Next a frontal face detector object is created for the loaded image detector=dlib.get_frontal_face_detector(). The detector then produces the vector with the detected face. #2 Facial landmark detection aka face alignment Face landmark detection is the process of finding points of interest in an image of a human face. When dlib is used for facial landmark detection, it returns 68 unique fashion landmarks for the whole face. After the first iteration of the algorithm, the value of T equals 0. This value increases linearly such that at the end of the iteration, T gets the value 10. The image evolved at this stage produces the ‘ground truth’, which means that the iteration can stop now. Due to this working, this stage of the process is also called as face alignment. To implement this stage, Brzęczkowski showed how to add a predictor in the Python program with the values shape_predictor_68_face_landmarks.dat such that it produces a model of around 100 megabytes. This process generally takes up a long time as we tend to pick the biggest clearer image for detection. #3 Finding face border using convex hull The convex hull is a set of points defined as the smallest convex polygon, which encloses all of the points in the set. This means that for a given set of points, the convex hull is the subset of these points such that all the given points are inside the subset. To find the face border in an image, we need to change the structure a bit. The structure is first passed to the convex hull function with return points to false, this means that we get an output of indexes. Brzęczkowski then exhibited the face border in the image in blue color using the find_convex_hull.py function. #4 Approximating nonlinear operations with linear operations In a linear filtering of an image, the value of an output pixel is a linear combination of the values of the pixels. Brzęczkowski put forth the example of Affine transformation which is a type of linear mapping method and is used to preserve points, straight lines, and planes. On the other hand, a non-linear filtering produces an output which is not a linear function of its input. He then goes on to unveil both the transitions using his own image. Brzęczkowski then advised users to check the website learnOpenCV.com to learn how to create a nonlinear operation with a linear one. #5 Finding triangles in an image using Delaunay triangulation A Delaunay triangulation subdivides a set of points in a plane into triangles such that the points become vertices of the triangles. This means that this method subdivides the space or the surface into triangles in such a way that if you look at any triangle on the image, it will not have another point inside the triangle. Brzęczkowski then demonstrates how the image developed in the previous stage contained “face points from which you can identify my teeth and then create sub div to the object, insert all these points that I created or all detected.” Next, he deploys Delaunay triangulation to produce a list of two angles. This list is then used to obtain the triangles in the image. Post this step, he uses the delaunay_triangulation.py function to generate these triangles on the images. #6 Blending one face into another To recap, we started from detecting a face using HOG and finding its border using convex hull, followed it by adding mouth points to indicate specific indexes. Next, Delaunay triangulation was implemented to obtain all the triangles on the images. Next, Brzęczkowski begins the blending of images using seamless cloning. A seamless cloning combines the attributes of other cloning methods to create a unique solution to allow “sequence-independent and scarless insertion of one or more fragments of DNA into a plasmid vector.” This cloning method also provides a variety of skin colors to choose from. Brzęczkowski then explains a feature called ‘pass on edit image’ in the Poisson image editing which uses the value of the gradients instead of the identities or the values of the pixels of the image. To implement the same method in OpenCV, he further demonstrates how information like source, destination, source image destination, mask and center (which is the location where the cloned part should be placed) is required to blend the two faces. Brzęczkowski then depicts a string of illustrations to transform his image with the images of popular artists like Jamie Foxx, Clint Eastwood, and others. #7 Stabilization using optical flow with the Lucas-Kanade method In computer vision, the Lucas-Kanade method is a widely used differential method for optical flow estimation. It assumes that the flow is essentially constant in a local neighborhood of the pixel under consideration, and solves the basic optical flow equations for all the pixels in that neighborhood, by the least-squares criterion. Thus by combining information from several nearby pixels, the Lucas–Kanade method resolves the inherent ambiguity of the optical flow equation. This method is also less sensitive to noises in an image. By using this method to implement the stabilization of the face swapped image, it is assumed that the optical flow is essentially constant in a local neighborhood of the pixel under consideration in human language. This means that “if we have a red point in the center we assume that all the points around, let's say in this example is three on three pixels we assume that all of them have the same optical flow and thanks to that assumption we have nine equations and only two unknowns.” This makes the computation fairly easy to solve. By using this assumption the optical flow works smoothly if we have the previous gray position of the image. This means that for face swapping images using OpenCV, a user needs to have details of the previous points of the image along with the current points of the image. By combining all this information, the actual point becomes a combination of the detected landmark and the predicted landmark. Thus by implementing the Lucas-Kanade method for stabilizing the image, Brzęczkowski implements a non-shaky version of his face-swapped image. Watch Brzęczkowski’s full video to see a step-by-step implementation of a face-swapping task. You can learn advanced applications like facial recognition, target tracking, or augmented reality from our book, ‘Mastering OpenCV 4 with Python’ written by Alberto Fernández Villán. This book will also help you understand the application of artificial intelligence and deep learning techniques using popular Python libraries like TensorFlow and Keras. Getting to know PyMC3, a probabilistic programming framework for Bayesian Analysis in Python How to perform exception handling in Python with ‘try, catch and finally’ Implementing color and shape-based object detection and tracking with OpenCV and CUDA [Tutorial] OpenCV 4.0 releases with experimental Vulcan, G-API module and QR-code detector among others

0
0
34232

How-To Tutorials - Data

Decoding the reasons behind Alphabet’s record high earnings in Q2 2018

Time series modeling: What is it, Why it matters and How it's used

Hand Gesture Recognition Using a Kinect Depth Sensor

Getting to know and manipulate Tensors in TensorFlow

What is a convolutional neural network (CNN)? [Video]

4 common challenges in Web Scraping and how to handle them

Why DeepMind AlphaGo Zero is a game changer for AI research

Top 10 MySQL 8 performance benchmarking aspects to know

Distributed training in TensorFlow 2.x

Worried about Deepfakes? Check out the new algorithm that manipulate talking-head videos by altering the transcripts

Trending Topics

How to learn data science: from data mining to machine learning

Google confirms it paid $135 million as exit packages to senior execs accused of sexual harassment

Key skills every database programmer should have

Bitbucket to no longer support Mercurial, users must migrate to Git by May 2020

Master the art of face swapping with OpenCV and Python by Sylwek Brzęczkowski, developer at TrustStamp

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access