Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-threejs-materials-and-texture

06 Feb 2015

11 min read

Three.js - Materials and Texture

06 Feb 2015

In this article by Jos Dirksen author of the book Three.js Cookbook, we will learn how Three.js offers a large number of different materials and supports many different types of textures. These textures provide a great way to create interesting effects and graphics. In this article, we'll show you recipes that allow you to get the most out of these components provided by Three.js. (For more resources related to this topic, see here.) Using HTML canvas as a texture Most often when you use textures, you use static images. With Three.js, however, it is also possible to create interactive textures. In this recipe, we will show you how you can use an HTML5 canvas element as an input for your texture. Any change to this canvas is automatically reflected after you inform Three.js about this change in the texture used on the geometry. Getting ready For this recipe, we need an HTML5 canvas element that can be displayed as a texture. We can create one ourselves and add some output, but for this recipe, we've chosen something else. We will use a simple JavaScript library, which outputs a clock to a canvas element. The resulting mesh will look like this (see the 04.03-use-html-canvas-as-texture.html example): The JavaScript used to render the clock was based on the code from this site: http://saturnboy.com/2013/10/html5-canvas-clock/. To include the code that renders the clock in our page, we need to add the following to the head element: <script src="../libs/clock.js"></script> How to do it... To use a canvas as a texture, we need to perform a couple of steps: The first thing we need to do is create the canvas element: var canvas = document.createElement('canvas'); canvas.width=512; canvas.height=512; Here, we create an HTML canvas element programmatically and define a fixed width. Now that we've got a canvas, we need to render the clock that we use as the input for this recipe on it. The library is very easy to use; all you have to do is pass in the canvas element we just created: clock(canvas); At this point, we've got a canvas that renders and updates an image of a clock. What we need to do now is create a geometry and a material and use this canvas element as a texture for this material: var cubeGeometry = new THREE.BoxGeometry(10, 10, 10); var cubeMaterial = new THREE.MeshLambertMaterial(); cubeMaterial.map = new THREE.Texture(canvas); var cube = new THREE.Mesh(cubeGeometry, cubeMaterial); To create a texture from a canvas element, all we need to do is create a new instance of THREE.Texture and pass in the canvas element we created in step 1. We assign this texture to the cubeMaterial.map property, and that's it. If you run the recipe at this step, you might see the clock rendered on the sides of the cubes. However, the clock won't update itself. We need to tell Three.js that the canvas element has been changed. We do this by adding the following to the rendering loop: cubeMaterial.map.needsUpdate = true; This informs Three.js that our canvas texture has changed and needs to be updated the next time the scene is rendered. With these four simple steps, you can easily create interactive textures and use everything you can create on a canvas element as a texture in Three.js. How it works... How this works is actually pretty simple. Three.js uses WebGL to render scenes and apply textures. WebGL has native support for using HTML canvas element as textures, so Three.js just passes on the provided canvas element to WebGL and it is processed as any other texture. Making part of an object transparent You can create a lot of interesting visualizations using the various materials available with Three.js. In this recipe, we'll look at how you can use the materials available with Three.js to make part of an object transparent. This will allow you to create complex-looking geometries with relative ease. Getting ready Before we dive into the required steps in Three.js, we first need to have the texture that we will use to make an object partially transparent. For this recipe, we will use the following texture, which was created in Photoshop: You don't have to use Photoshop; the only thing you need to keep in mind is that you use an image with a transparent background. Using this texture, in this recipe, we'll show you how you can create the following (04.08-make-part-of-object-transparent.html): As you can see in the preceeding, only part of the sphere is visible, and you can look through the sphere to see the back at the other side of the sphere. How to do it... Let's look at the steps you need to take to accomplish this: The first thing we do is create the geometry. For this recipe, we use THREE.SphereGeometry: var sphereGeometry = new THREE.SphereGeometry(6, 20, 20); Just like all the other recipes, you can use whatever geometry you want. In the second step, we create the material: var mat = new THREE.MeshPhongMaterial(); mat.map = new THREE.ImageUtils.loadTexture( "../assets/textures/partial-transparency.png"); mat.transparent = true; mat.side = THREE.DoubleSide; mat.depthWrite = false; mat.color = new THREE.Color(0xff0000); As you can see in this fragment, we create THREE.MeshPhongMaterial and load the texture we saw in the Getting ready section of this recipe. To render this correctly, we also need to set the side property to THREE.DoubleSide so that the inside of the sphere is also rendered, and we need to set the depthWrite property to false. This will tell WebGL that we still want to test our vertices against the WebGL depth buffer, but we don't write to it. Often, you need to set this to false when working with more complex transparent objects or particles. Finally, add the sphere to the scene: var sphere = new THREE.Mesh(sphereGeometry, mat); scene.add(sphere); With these simple steps, you can create really interesting effects by just experimenting with textures and geometries. There's more With Three.js, it is possible to repeat textures (refer to the Setup repeating textures recipe). You can use this to create interesting-looking objects such as this: The code required to set a texture to repeat is the following: var mat = new THREE.MeshPhongMaterial(); mat.map = new THREE.ImageUtils.loadTexture( "../assets/textures/partial-transparency.png"); mat.transparent = true; mat.map.wrapS = mat.map.wrapT = THREE.RepeatWrapping; mat.map.repeat.set( 4, 4 ); mat.depthWrite = false; mat.color = new THREE.Color(0x00ff00); By changing the mat.map.repeat.set values, you define how often the texture is repeated. Using a cubemap to create reflective materials With the approach Three.js uses to render scenes in real time, it is difficult and very computationally intensive to create reflective materials. Three.js, however, provides a way you can cheat and approximate reflectivity. For this, Three.js uses cubemaps. In this recipe, we'll explain how to create cubemaps and use them to create reflective materials. Getting ready A cubemap is a set of six images that can be mapped to the inside of a cube. They can be created from a panorama picture and look something like this: In Three.js, we map such a map on the inside of a cube or sphere and use that information to calculate reflections. The following screenshot (example 04.10-use-reflections.html) shows what this looks like when rendered in Three.js: As you can see in the preceeding screenshot, the objects in the center of the scene reflect the environment they are in. This is something often called a skybox. To get ready, the first thing we need to do is get a cubemap. If you search on the Internet, you can find some ready-to-use cubemaps, but it is also very easy to create one yourself. For this, go to http://gonchar.me/panorama/. On this page, you can upload a panoramic picture and it will be converted to a set of pictures you can use as a cubemap. For this, perform the following steps: First, get a 360 degrees panoramic picture. Once you have one, upload it to the http://gonchar.me/panorama/ website by clicking on the large OPEN button: Once uploaded, the tool will convert the panorama picture to a cubemap as shown in the following screenshot: When the conversion is done, you can download the various cube map sites. The recipe in this book uses the naming convention provided by Cube map sides option, so download them. You'll end up with six images with names such as right.png, left.png, top.png, bottom.png, front.png, and back.png. Once you've got the sides of the cubemap, you're ready to perform the steps in the recipe. How to do it... To use the cubemap we created in the previous section and create reflecting material,we need to perform a fair number of steps, but it isn't that complex: The first thing you need to do is create an array from the cubemap images you downloaded: var urls = [ '../assets/cubemap/flowers/right.png', '../assets/cubemap/flowers/left.png', '../assets/cubemap/flowers/top.png', '../assets/cubemap/flowers/bottom.png', '../assets/cubemap/flowers/front.png', '../assets/cubemap/flowers/back.png' ]; With this array, we can create a cubemap texture like this: var cubemap = THREE.ImageUtils.loadTextureCube(urls); cubemap.format = THREE.RGBFormat; From this cubemap, we can use THREE.BoxGeometry and a custom THREE.ShaderMaterial object to create a skybox (the environment surrounding our meshes): var shader = THREE.ShaderLib[ "cube" ]; shader.uniforms[ "tCube" ].value = cubemap; var material = new THREE.ShaderMaterial( { fragmentShader: shader.fragmentShader, vertexShader: shader.vertexShader, uniforms: shader.uniforms, depthWrite: false, side: THREE.DoubleSide }); // create the skybox var skybox = new THREE.Mesh( new THREE.BoxGeometry( 10000, 10000, 10000 ), material ); scene.add(skybox); Three.js provides a custom shader (a piece of WebGL code) that we can use for this. As you can see in the code snippet, to use this WebGL code, we need to define a THREE.ShaderMaterial object. With this material, we create a giant THREE.BoxGeometry object that we add to scene. Now that we've created the skybox, we can define the reflecting objects: var sphereGeometry = new THREE.SphereGeometry(4,15,15); var envMaterial = new THREE.MeshBasicMaterial( {envMap:cubemap}); var sphere = new THREE.Mesh(sphereGeometry, envMaterial); As you can see, we also pass in the cubemap we created as a property (envmap) to the material. This informs Three.js that this object is positioned inside a skybox, defined by the images that make up cubemap. The last step is to add the object to the scene, and that's it: scene.add(sphere); In the example in the beginning of this recipe, you saw three geometries. You can use this approach with all different types of geometries. Three.js will determine how to render the reflective area. How it works... Three.js itself doesn't really do that much to render the cubemap object. It relies on a standard functionality provided by WebGL. In WebGL, there is a construct called samplerCube. With samplerCube, you can sample, based on a specific direction, which color matches the cubemap object. Three.js uses this to determine the color value for each part of the geometry. The result is that on each mesh, you can see a reflection of the surrounding cubemap using the WebGL textureCube function. In Three.js, this results in the following call (taken from the WebGL shader in GLSL): vec4 cubeColor = textureCube( tCube, vec3( -vReflect.x, vReflect.yz ) ); A more in-depth explanation on how this works can be found at http://codeflow.org/entries/2011/apr/18/advanced-webgl-part-3-irradiance-environment-map/#cubemap-lookup. There's more... In this recipe, we created the cubemap object by providing six separate images. There is, however, an alternative way to create the cubemap object. If you've got a 360 degrees panoramic image, you can use the following code to directly create a cubemap object from that image: var texture = THREE.ImageUtils.loadTexture( 360-degrees.png', new THREE.UVMapping()); Normally when you create a cubemap object, you use the code shown in this recipe to map it to a skybox. This usually gives the best results but requires some extra code. You can also use THREE.SphereGeometry to create a skybox like this: var mesh = new THREE.Mesh( new THREE.SphereGeometry( 500, 60, 40 ), new THREE.MeshBasicMaterial( { map: texture })); mesh.scale.x = -1; This applies the texture to a sphere and with mesh.scale, turns this sphere inside out. Besides reflection, you can also use a cubemap object for refraction (think about light bending through water drops or glass objects): All you have to do to make a refractive material is load the cubemap object like this: var cubemap = THREE.ImageUtils.loadTextureCube(urls, new THREE.CubeRefractionMapping()); And define the material in the following way: var envMaterial = new THREE.MeshBasicMaterial({envMap:cubemap}); envMaterial.refractionRatio = 0.95; Summary In this article, we learned about the different textures and materials supported by Three.js Resources for Article: Further resources on this subject: Creating the maze and animating the cube [article] Working with the Basic Components That Make Up a Three.js Scene [article] Mesh animation [article]

0
0
25991

article-image-reinforcement-learning-mdp-markov-decision-process-tutorial

Fatema Patrawala

09 Jul 2018

11 min read

Implement Reinforcement learning using Markov Decision Process [Tutorial]

Fatema Patrawala

09 Jul 2018

11 min read

0
0
25990

article-image-datacamp-reckons-in-metoo-movement-ceo-steps-down-from-his-role-indefinitely

Fatema Patrawala

25 Apr 2019

7 min read

DataCamp reckons with its #MeToo movement; CEO steps down from his role indefinitely

Fatema Patrawala

25 Apr 2019

7 min read

The data science community is reeling after data science learning startup DataCamp penned a blog post acknowledging that an unnamed company executive made "uninvited physical contact" with one of its employees. DataCamp, which operates an e-platform where aspiring data scientists can take courses in coding and data analysis is a startup valued at $184 million. It has additionally raised over $30 million in funding. The company disclosed in a blog post published on 4th April that this incident occurred at an "informal employee gathering" at a bar in October 2017. The unnamed DataCamp executive had "danced inappropriately and made uninvited physical contact" with the employee on the dance floor, the post read. The company didn't name the executive involved in the incident in its post. But called the executive's behavior on the dance floor "entirely inappropriate" and "inconsistent" with employee expectations and policies. When Buisness Insider reached out to one of the course instructors OS Keyes familiar with this matter, Keyes said that the executive in question is DataCamp's co-founder and CEO Jonathan Cornelissen. Yesterday Motherboard also reported that the company did not adequately address sexual misconduct by a senior executive there and instructors at DataCamp have begun boycotting the service and asking the company to delete their courses following allegations. What actually happened and how did DataCamp respond? On April 4, DataCamp shared a statement on its blog titled “a note to our community.” In it, the startup addressed the accusations against one of the company’s executives: “In October 2017, at an informal employee gathering at a bar after a week-long company offsite, one of DataCamp’s executives danced inappropriately and made uninvited physical contact with another employee while on the dance floor.” DataCamp got the complaint reviewed by a “third party not involved in DataCamp’s day-to-day business,” and said it took several “corrective actions,” including “extensive sensitivity training, personal coaching, and a strong warning that the company will not tolerate any such behavior in the future.” DataCamp only posted its blog a day after more than 100 DataCamp instructors signed a letter and sent it to DataCamp executives. “We are unable to cooperate with continued silence and lack of transparency on this issue,” the letter said. “The situation has not been acknowledged adequately to the data science community, leading to harmful rumors and uncertainty.” But as instructors read the statement from DataCamp following the letter, many found the actions taken to be insufficient. https://twitter.com/hugobowne/status/1120733436346605568 https://twitter.com/NickSolomon10/status/1120837738004140038 Motherboard reported this case in detail taking notes from Julia Silge, a data scientist who co-authored the letter to DataCamp. Julia says that going public with our demands for accountability was the last resort. Julia spoke about the incident in detail and says she remembered seeing the victim of the assault start working at DataCamp and then leave abruptly. This raised “red flags” but she did not reach out to her. Then Silge heard about the incident from a mutual friend and she began to raise the issue with internal people at DataCamp. “There were various responses from the rank and file. It seemed like after a few months of that there was not a lot of change, so I escalated a little bit,” she said. DataCamp finally responded to Silge by saying “I think you have misconceptions about what happened,” and they also mentioned that “there was alcohol involved” to explain the behavior of the executive. DataCamp further explained that “We also heard over and over again, ‘This has been thoroughly handled.’” But according to Silge and other instructors who have spoken out, say that DataCamp hasn’t properly handled the situation and has tried to sweep it under the rug. Silge also created a private Slack group to communicate and coordinate their efforts to confront this issue. She along with the group got into a group video conference with DataCamp, which was put into “listen-only” mode for all the other participants except DataCamp, meaning they could not speak in the meeting, and were effectively silenced. “It felt like 30 minutes of the DataCamp leadership saying what they wanted to say to us,” Silge said. “The content of it was largely them saying how much they valued diversity and inclusion, which is hard to find credible given the particular ways DataCamp has acted over the past.” Following that meeting, instructors began to boycott DataCamp more blatantly, with one instructor refusing to make necessary upgrades to her course until DataCamp addressed the situation. Silge and two other instructors eventually drafted and sent the letter, at first to the small group involved in accountability efforts, then to almost every DataCamp instructor. All told, the letter received more than 100 signatures (of about 200 total instructors). A DataCamp spokesperson said in response to this, “When we became aware of this matter, we conducted a thorough investigation and took actions we believe were necessary and appropriate. However, recent inquiries have made us aware of mischaracterizations of what occurred and we felt it necessary to make a public statement. As a matter of policy, we do not disclose details on matters like this, to protect the privacy of the individuals involved.” “We do not retaliate against employees, contractors or instructors or other members of our community, under any circumstances, for reporting concerns about behavior or conduct,” the company added. The response received from DataCamp was not only inadequate, but technologically faulty, as per one of the contractors Noam Ross who pointed out in his blog post that DataCamp had published the blog with a “no-index” tag, meaning it would not show up in aggregated searches like Google results. Thus adding this tag knowingly represents DataCamp’s continued lack of public accountability. OS Keyes said to Business Insider that at this point, the best course of action for DataCamp is a blatant change in leadership. “The investors need to get together and fire the [executive], and follow that by publicly explaining why, apologising, compensating the victim and instituting a much more rigorous set of work expectations,” Keyes said. #Rstats and other data science communities and DataCamp instructors take action One of the contractors Ines Montani expressed this by saying, “I was pretty disappointed, appalled and frustrated by DataCamp's reaction and non-action, especially as more and more details came out about how they essentially tried to sweep this under the rug for almost two years,” Due to their contracts, many instructors cannot take down their DataCamp courses. Instead of removing the courses, many contractors for DataCamp, including Montani, took to Twitter after DataCamp published the blog, urging students to boycott the very courses they designed. https://twitter.com/noamross/status/1116667602741485571 https://twitter.com/daniellequinn88/status/1117860833499832321 https://twitter.com/_tetration_/status/1118987968293875714 Instructors put financial pressures on the company by boycotting their own courses. They also wanted to get the executive responsible for such misbehaviour account for his actions, compensate the victim and compensate those who were fired for complaining—this may ultimately undercut DataCamp’s bottom line. Influential open-source communities, including RStudio, SatRdays, and R-Ladies, have cut all ties with DataCamp to show disappointment with the lack of serious accountability.. CEO steps down “indefinitely” from his role and accepts his mistakes Today Jonathan Cornelissen, accepted his mistake and wrote a public apology for his inappropriate behaviour. He writes, “I want to apologize to a former employee, our employees, and our community. I have failed you twice. First in my behavior and second in my failure to speak clearly and unequivocally to you in a timely manner. I am sorry.” He has also stepped down from his position as the company CEO indefinitely until there is complete review of company’s environment and culture. While it is in the right direction, unfortunately this apology comes to the community very late and is seen as a PR move to appease the backlash from the data science community and other instructors. https://twitter.com/mrsnoms/status/1121235830381645824 9 Data Science Myths Debunked 30 common data science terms explained Why is data science important?

0
0
25950

article-image-mysql-data-transfer-using-sql-server-integration-services-ssis

Packt Editorial Staff

12 Aug 2009

8 min read

MySQL Data Transfer using Sql Server Integration Services (SSIS)

Packt Editorial Staff

12 Aug 2009

8 min read

There are a large number of posts on various difficulties experienced while transferring data from MySQL using Microsoft SQL Server Integration Services. While the transfer of data from MySQL to Microsoft SQL Server 2008 is not fraught with any blocking issues, transfer of data from SQL Server 2008 to MySQL has presented various problems. There are some workarounds suggested. In this article by Dr. Jay Krishnaswamy, data transfer to MySQL using SQL Server Integration Services will be described. If you are new to SQL Server Integration Services (SSIS) you may want to read a book by the same author on Beginners Guide to SQL Server Integration Services Using Visual Studio 2005, published by Packt. Connectivity with MySQL For data interchange with MySQL there are two options one of which can be accessed in the connection wizards of SQL Server Integration Services assuming you have installed the programs. The other can be used to set up a ODBC DSN as described further down. The two connection options are: MySQL Connector/ODBC 5.1 Connector/Net 5.2 New versions 6.0 & 6.1 In this article we will be using the ODBC connector for MySQL which can be downloaded from the MySQL Site. The connector will be used to create an ODBC DSN. Transferring a table from SQL Server 2008 to MySQL We will transfer a table in the TestNorthwind database on SQL Server 2008 (Enterprise & Evaluation) to MySQL server database. The MySQL database we are using is described in the article on Exporting data from MS Access 2003 to MySQL. In another article, MySQL Linked Server on SQL Server 2008, creating an ODBC DSN for MySQL was described. We will be using the DSN created in that article. Creating an Integration Services project in Visual Studio 2008 Start the Visual Studio 2008 program from its shortcut. Click File | New | Project... to open the New Project window and select an integration services template from the business intelligence projects by providing a suitable name. The project folder will have a file called Package.dtsx which can be renamed with a custom name. Add and configure an ADO.NET Source The Project's package designer will be open displaying the Control Flow tab. Drag and drop a Data Flow Task on to the control flow tabbed page. Click next on the Data Flow tab in the designer to display the Data Flow page. Read the instructions on this page. Drag and drop a ADO.NET Source from the Data Flow Sources items in the Toolbox. It is assumed that you can set up a connection manager to the resident SQL Server 2008 on your machine. The next figure shows the configured connection manager to the SQL Server 2008. The table (PrincetonTemp) that will be transferred is in the TestNorthwind database. The authentication is Windows and a .NET provider is used to access the data. You may also test the connection by clicking the Test Connection button. If the connection shown above is correctly configured, the test should indicate a successful connection. Right click the ADO.NET source and from the drop-down click Edit. The ADO.NET Source Editor gets displayed. As mentioned earlier you should be able to access the table and view objects on the database as shown in the next figure. We have chosen to transfer a simple table, PrincetonTemp from the TestNorthwind database on SQL Server 2008. It has a only couple of columns as shown in the Columns page of the ADO.NET Source Editor. The default for the Error page setting has been assumed, that is, if there is an error or truncation of data the task will fail. Add an ADO.NET destination and port the data from the source Drag and drop an ADO.NET destination item from under Data Flow Destinations items in the Toolbox on to the data flow page of the designer. There are two ways to arrange for the data to flow from source to the destination. The easy way is just drag the green dangling line from the source with your mouse and let go on the ADO.NET destination. A solid line will connect the source and the destination as shown. (For more resources on Microsoft, see here.) Configure a connection manager to connect to MySQL In the Connection Manager's pane under the Package designer right click to display a pop-up menu which allows you to make a new connection. When you agree to make a new ADO.NET Connection the Configure ADO.NET connection Manager's window shows up and click on New... button on this page. The connection manager's page gets displayed as shown. In the Providers drop-down you will see a number of providers. There are the two providers that you can use, the ODBC through the connector and the MySQL Data Provider. Click on the Odbc Data Provider. As mentioned previously we will be using the System DSN MySQL_Link created earlier for the other article shown in the drop-down list of available ODBC DSN's. Provide the USERID and Password; click the Test Connection button. If all the information is correct you should get a success message as shown. Close out of the message as well as the Configure ADO.NET Connection Manager windows. Right click the ADO.NET Destination to display its editor window. In the drop-down for connection manager you should be able to pick the connection Manager you created in the previous step (MySQL_INK.root) as shown. Click on the New... button to create a Table or View. You will get a warning message regarding not knowing the mapping to SSIS as shown. Click OK. The create table window gets displayed as shown. Notice that the table is displaying all the columns from the table that the source is sending out. If you were to click OK, you would get an error that the syntax is not correct as shown. Modify the table as shown to change the destination table name (your choice) and the data type. CREATE TABLE From2k8( "Id" INT, "Month" VARCHAR(10), "Temperature" DOUBLE PRECISION, "RecordHigh" DOUBLE PRECISION ) Click OK. Again you get the same error regarding syntax not being correct. Modify the Create Table statement further as shown. CREATE TABLE From2k8 ( Id INT, Month VARCHAR(10), Temperature DOUBLE PRECISION, RecordHigh DOUBLE PRECISION ) Click OK after the above modification. The table gets added to the ADO.NET Destination Manager Editor as shown. Click on the Mappings on the left side of the ADO.NET Destination Editor. The column mappings page gets displayed as shown. We accept the default settings for Error Output page. Click OK. Build the project and execute the package by right clicking the package and choosing Execute Package. The program runs and processes the package and ends up being unsuccessful with the error message in the Progress tab of the project as shown (only relevant message is shown here). .... ..... [SSIS.Pipeline] Information: Execute phase is beginning. [ADO NET Destination 1 [165]] Error: An exception has occurred during data insertion, the message returned from the provider is: ERROR [42000] [MySQL][ODBC 5.1 Driver] [mysqld-5.1.30-community]You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '"Id", "Month", "Temperature", "RecordHigh") VALUES (1, 'Jan ', 4.000000000' at line 1 [SSIS.Pipeline] Error: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component "ADO NET Destination 1" (165) failed with error code 0xC020844B while processing input "ADO NET Destination Input" (168). The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure. [SSIS.Pipeline] Information: Post Execute phase is beginning. ...... .... Task Data Flow Task failed .... Start the MySQL Server and login to it. Run the following commands as shown in the next figure. By setting the mode to 'ANSI' makes the syntax more standard like as MySQL can cater to clients using other SQL modes. This is why the above error is returned although the syntax itself appears correct. In fact a create statement run on command line to create a table directly on MySQL could not create a table and returned an error when SSIS was used to create the same table. After running the above statements, build the BI project and execute the package. This time the execution is will be successful and you can query the MySQL Server as in the following: Summary The article describes step by step transferring a table from SQL Server 2008 to MySQL using ODBC connectivity. For successful transfer, the data type differences between SQL Server 2008 and the MySQL version must be properly taken into consideration as well as correctly setting the SQL_Mode property of MySQL Server. Further resources on this subject: Easy guide to understand WCF in Visual Studio 2008 SP1 and Visual Studio 2010 Express MySQL Linked Server on SQL Server 2008 Displaying MySQL data on an ASP.NET Web Page Exporting data from MS Access 2003 to MySQL Transferring Data from MS Access 2003 to SQL Server 2008

0
0
25808

article-image-write-effective-stored-procedures-postgresql

Amey Varangaonkar

12 Dec 2017

8 min read

How to write effective Stored Procedures in PostgreSQL

Amey Varangaonkar

12 Dec 2017

8 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book Mastering PostgreSQL 9.6 written by Hans-Jürgen Schönig. PostgreSQL is an open source database used not only for handling large datasets (Big Data) but also as a JSON document database. It finds applications in the software as well as the web domains. This book will enable you to build better PostgreSQL applications and administer databases more efficiently. [/box] In this article, we explain the concept of Stored Procedures, and how to write them effectively in PostgreSQL 9.6. When it comes to stored procedures, PostgreSQL differs quite significantly from other database systems. Most database engines force you to use a certain programming language to write server-side code. Microsoft SQL Server offers Transact-SQL while Oracle encourages you to use PL/SQL. PostgreSQL does not force you to use a certain language but allows you to decide on what you know best and what you like best. The reason PostgreSQL is so flexible is actually quite interesting too in a historical sense. Many years ago, one of the most well-known PostgreSQL developers (Jan Wieck), who had written countless patches back in its early days, came up with the idea of using TCL as the server-side programming language. The trouble was simple—nobody wanted to use TCL and nobody wanted to have this stuff in the database engine. The solution to the problem was to make the language interface so flexible that basically any language can be integrated with PostgreSQL easily. Then, the CREATE LANGUAGE clause was born: test=# h CREATE LANGUAGE Command: CREATE LANGUAGE Description: define a new procedural language Syntax: CREATE [ OR REPLACE ] [ PROCEDURAL ] LANGUAGE name CREATE [ OR REPLACE ] [ TRUSTED ] [ PROCEDURAL ] LANGUAGE name HANDLER call_handler [ INLINE inline_handler ] [ VALIDATOR valfunction ] Nowadays, many different languages can be used to write stored procedures. The flexibility added to PostgreSQL back in the early days has really paid off, and so you can choose from a rich set of programming languages. How exactly does PostgreSQL handle languages? If you take a look at the syntax of the CREATE LANGUAGE clause, you will see a couple of keywords: HANDLER: This function is actually the glue between PostgreSQL and any external language you want to use. It is in charge of mapping PostgreSQL data structures to whatever is needed by the language and helps to pass the code around. VALIDATOR: This is the policeman of the infrastructure. If it is available, it will be in charge of delivering tasty syntax errors to the end user. Many languages are able to parse the code before actually executing it. PostgreSQL can use that and tell you whether a function is correct or not when you create it. Unfortunately, not all languages can do this, so in some cases, you will still be left with problems showing up at runtime. INLINE: If it is present, PostgreSQL will be able to run anonymous code blocks in this function. The anatomy of a stored procedure Before actually digging into a specific language, I want to talk a bit about the anatomy of a typical stored procedure. For demo purposes, I have written a function that just adds up two numbers: test=# CREATE OR REPLACE FUNCTION mysum(int, int) RETURNS int AS ' SELECT $1 + $2; ' LANGUAGE 'sql'; CREATE FUNCTION The first thing you can see is that the procedure is written in SQL. PostgreSQL has to know which language we are using, so we have to specify that in the definition. Note that the code of the function is passed to PostgreSQL as a string ('). That is somewhat noteworthy because it allows a function to become a black box to the execution machinery. In other database engines, the code of the function is not a string but is directly attached to the statement. This simple abstraction layer is what gives the PostgreSQL function manager all its power. Inside the string, you can basically use all that the programming language of your choice has to offer. In my example, I am simply adding up two numbers passed to the function. For this example, two integer variables are in use. The important part here is that PostgreSQL provides you with function overloading. In other words, the mysum(int, int) function is not the same as the mysum(int8, int8) function. PostgreSQL sees these things as two distinct functions. Function overloading is a nice feature; however, you have to be very careful not to accidentally deploy too many functions if your parameter list happens to change from time to time. Always make sure that functions that are not needed anymore are really deleted. The CREATE OR REPLACE FUNCTION clause will not change the parameter list. You can, therefore, use it only if the signature does not change. It will either error out or simply deploy a new function. Let's run the function: test=# SELECT mysum(10, 20); Mysum ------- 30 (1 row) The result is not really surprising. Introducing dollar quoting Passing code to PostgreSQL as a string is very flexible. However, using single quotes can be an issue. In many programming languages, single quotes show up frequently. To be able to use quotes, people have to escape them when passing the string to PostgreSQL. For many years this has been the standard procedure. Fortunately, those old times have passed by and new means to pass the code to PostgreSQL are available: test=# CREATE OR REPLACE FUNCTION mysum(int, int) RETURNS int AS $$ SELECT $1 + $2; $$ LANGUAGE 'sql'; CREATE FUNCTION The solution to the problem of quoting strings is called dollar quoting. Instead of using quotes to start and end strings, you can simple use $$. Currently, I am only aware of two languages that have assigned a meaning to $$. In Perl as well as in bash scripts, $$ represents the process ID. To overcome even this little obstacle, you can use $ almost anything $ to start and end the string. The following example shows how that works: test=# CREATE OR REPLACE FUNCTION mysum(int, int) RETURNS int AS $ $ SELECT $1 + $2; $ $ LANGUAGE 'sql'; CREATE FUNCTION All this flexibility allows you to really overcome the problem of quoting once and for all. As long as the start string and the end string match, there won't be any problems left. Making use of anonymous code blocks So far, you have learned to write the most simplistic stored procedures possible, and you have learned to execute code. However, there is more to code execution than just full-blown stored procedures. In addition to full-blown procedures, PostgreSQL allows the use of anonymous code blocks. The idea is to run code that is needed only once. This kind of code execution is especially useful to deal with administrative tasks. Anonymous code blocks don't take parameters and are not permanently stored in the database as they don't have a name anyway. Here is a simple example: test=# DO $$ BEGIN RAISE NOTICE 'current time: %', now(); END; $$ LANGUAGE 'plpgsql'; NOTICE: current time: 2016-12-12 15:25:50.678922+01 CONTEXT: PL/pgSQL function inline_code_block line 3 at RAISE DO In this example, the code only issues a message and quits. Again, the code block has to know which language it uses. The string is again passed to PostgreSQL using simple dollar quoting. Using functions and transactions As you know, everything that PostgreSQL exposes in user land is a transaction. The same, of course, applies if you are writing stored procedures. The procedure is always part of the transaction you are in. It is not autonomous—it is just like an operator or any other operation. Here is an example: All three function calls happen in the same transaction. This is important to understand because it implies that you cannot do too much transactional flow control inside a function. Suppose the second function call commits. What happens in such a case anyway? It cannot work. However, Oracle has a mechanism that allows for autonomous transactions. The idea is that even if a transaction rolls back, some parts might still be needed and should be kept. The classical example is as follows: Start a function to look up secret data Add a log line to the document that somebody has modified this important secret data Commit the log line but roll back the change You still want to know that somebody attempted to change data To solve problems like this one, autonomous transactions can be used. The idea is to be able to commit a transaction inside the main transaction independently. In this case, the entry in the log table will prevail while the change will be rolled back. As of PostgreSQL 9.6, autonomous transactions are not happening. However, I have already seen patches floating around that implement this feature. We will see when these features make it to the core. To give you an impression of how things will most likely work, here is a code snippet based on the first patches: AS $$ DECLARE PRAGMA AUTONOMOUS_TRANSACTION; BEGIN FOR i IN 0..9 LOOP START TRANSACTION; INSERT INTO test1 VALUES (i); IF i % 2 = 0 THEN COMMIT; ELSE ROLLBACK; END IF; END LOOP; RETURN 42; END; $$; ... The point in this example is that we can decide on the fly whether to commit or to roll back the autonomous transaction. More about stored procedures - on how to write them in different languages and how you can improve their performance - can be found in the book Mastering PostgreSQL 9.6. Make sure you order your copy now!

0
0
25784

article-image-troubleshooting-in-sql-server

Sunith Shetty

15 Mar 2018

16 min read

Troubleshooting in SQL Server

Sunith Shetty

15 Mar 2018

16 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book SQL Server 2017 Administrator's Guide written by Marek Chmel and Vladimír Mužný. This book will help you learn to implement and administer successful database solution with SQL Server 2017.[/box] Today, we will perform SQL Server analysis, and also learn ways for efficient performance monitoring and tuning. Performance monitoring and tuning Performance monitoring and tuning is a crucial part of your database administration skill set so as to keep the performance and stability of your server great, and to be able to find and fix the possible issues. The overall system performance can decrease over time; your system may work with more data or even become totally unresponsive. In such cases, you need the skills and tools to find the issue to bring the server back to normal as fast as possible. We can use several tools on the operating system layer and, then, inside the SQL Server to verify the performance and the possible root cause of the issue. The first tool that we can use is the performance monitor, which is available on your Windows Server: Performance monitor can be used to track important SQL Server counters, which can be very helpful in evaluating the SQL Server Performance. To add a counter, simply right-click on the monitoring screen in the Performance monitoring and tuning section and use the Add Counters item. If the SQL Server instance that you're monitoring is a default instance, you will find all the performance objects listed as SQL Server. If your instance is named, then the performance objects will be listed as MSSQL$InstanceName in the list of performance objects. We can split the important counters to watch between the system counters for the whole server and specific SQL Server counters. The list of system counters to watch include the following: Processor(_Total)% Processor Time: This is a counter to display the CPU load. Constant high values should be investigated to verify whether the current load does not exceed the performance limits of your HW or VM server, or if your workload is not running with proper indexes and statistics, and is generating bad query plans. MemoryAvailable MBytes: This counter displays the available memory on the operating system. There should always be enough memory for the operating system. If this counter drops below 64MB, it will send a notification of low memory and the SQL Server will reduce the memory usage. Physical Disk—Avg. Disk sec/Read: This disk counter provides the average latency information for your storage system; be careful if your storage is made of several different disks to monitor the proper storage system. Physical Disk: This indicates the average disk writes per second. Physical Disk: This indicates the average disk reads per second. Physical Disk: This indicates the number of disk writes per second. System—Processor Queue Length: This counter displays the number of threads waiting on a system CPU. If the counter is above 0, this means that there are more requests than the CPU can handle, and if the counter is constantly above 0, this may signal performance issues. Network interface: This indicates the total number of bytes per second. Once you have added all these system counters, you can see the values real time or you can configure a data collection, which will run for a specified selected time and periodically collect the information: With SQL Server-specific counters, we can dig deeper into the CPU, memory, and storage utilization to see what the SQL Server is doing and how the SQL Server is utilizing the subsystems. SQL Server memory monitoring and troubleshooting Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: SQLServer-Buffer Manager—buffer cache hit ratio: This counter displays the ratio of how often the SQL Server can find the proper data in the cache when a query returns such data. If the data is not found in the cache, it has to be read from the disk. The higher the counter, the better the overall performance, since the memory access is usually faster compared to the disk subsystem. SQLServer-Buffer Manager—page life expectancy: This counter can measure how long a page can stay in the memory in seconds. The longer a page can stay in the memory, the less likely it will be for the SQL Server to need to access the disk in order to get the data into the memory again. SQL Server-Memory Manager—total server memory (KB): This is the amount of memory the server has committed using the memory manager. SQL Server-Memory Manager—target server memory (KB): This is the ideal amount of memory the server can consume. On a stable system, the target and total should be equal unless you face a memory pressure. Once the memory is utilized after the warm-up of your server, these two counters should not drop significantly, which would be another indication of system-level memory pressure, where the SQL Server memory manager has to deallocate memory. SQL Server-Memory Manager—memory grants pending: This counter displays the total number of SQL Server processes that are waiting to be granted memory from the memory manager. To check the performance counters, you can also use a T-SQL query, where you can query the sys.dm_os_performance_counters DMV: SELECT [counter_name] as [Counter Name], [cntr_value]/1024 as [Server Memory (MB)] FROM sys.dm_os_performance_counters WHERE [object_name] LIKE '%Memory Manager%' AND [counter_name] IN ('Total Server Memory (KB)', 'Target Server Memory (KB)') This query will return two values—one for target memory and one for total memory. These two should be close to each other on a warmed up system. Another query you can use is to get the information from a DMV named sys.dm_0s_sys_memory: SELECT total_physical_memory_kb/1024/1024 AS [Physical Memory (GB)], available_physical_memory_kb/1024/1024 AS [Available Memory (GB)], system_memory_state_desc AS [System Memory State] FROM sys.dm_os_sys_memory WITH (NOLOCK) OPTION (RECOMPILE) This query will display the available physical memory and the total physical memory of your server with several possible memory states: Available physical memory is high (this is a state you would like to see on your system, indicating there is no lack of memory) Physical memory usage is steady Available physical memory is getting low Available physical memory is low The memory grants can be verified with a T-SQL query: SELECT [object_name] as [Object name] , cntr_value AS [Memory Grants Pending] FROM sys.dm_os_performance_counters WITH (NOLOCK) WHERE [object_name] LIKE N'%Memory Manager%' AND counter_name = N'Memory Grants Pending' OPTION (RECOMPILE); If you face memory issues, there are several steps you can take for improvements: Check and configure your SQL Server max memory usage Add more RAM to your server; the limit for Standard Edition is 128 GB and there is no limit for SQL Server with Enterprise Use Lock Pages in Memory Optimize your queries SQL Server storage monitoring and troubleshooting The important counters to watch for SQL Server storage utilization would include counters from the SQL Server:Access Methods performance object: SQL Server-Access Methods—Full Scans/sec: This counter displays the number of full scans per second, which can be either table or full-index scans SQL Server-Access Methods—Index Searches/sec: This counter displays the number of searches in the index per second SQL Server-Access Methods—Forwarded Records/sec: This counter displays the number of forwarded records per second Monitoring the disk system is crucial, since your disk is used as a storage for the following: Data files Log files tempDB database Page file Backup files To verify the disk latency and IOPS metric of your drives, you can use the Performance monitor, or the T-SQL commands, which will query the sys.dm_os_volume_stats and sys.dm_io_virtual_file_stats DMF. Simple code to start with would be a T-SQL script utilizing the DMF to check the space available within the database files: SELECT f.database_id, f.file_id, volume_mount_point, total_bytes, available_bytes FROM sys.master_files AS f CROSS APPLY sys.dm_os_volume_stats(f.database_id, f.file_id); To check the I/O file stats with the second provided DMF, you can use a T-SQL code for checking the information about tempDB data files: SELECT * FROM sys.dm_io_virtual_file_stats (NULL, NULL) vfs join sys.master_files mf on mf.database_id = vfs.database_id and mf.file_id = vfs.file_id WHERE mf.database_id = 2 and mf.type = 0 To measure the disk performance, we can use a tool named DiskSpeed, which is a replacement for older SQLIO tool, which was used for a long time. DiskSpeed is an external utility, which is not available on the operating system by default. This tool can be downloaded from GitHub or the Technet Gallery at https://github.com/microsoft/diskspd. The following example runs a test for 15 seconds using a single thread to drive 100 percent random 64 KB reads at a depth of 15 overlapped (outstanding) I/Os to a regular file: DiskSpd –d300 -F1 -w0 -r –b64k -o15 d:datafile.dat Troubleshooting wait statistics We can use the whole Wait Statistics approach for a thorough understanding of the SQL Server workload and undertake performance troubleshooting based on the collected data. Wait Statistics are based on the fact that, any time a request has to wait for a resource, the SQL Server tracks this information, and we can use this information for further analysis. When we consider any user process, it can include several threads. A thread is a single unit of execution on SQL Server, where SQLOS controls the thread scheduling instead of relying on the operating system layer. Each processor core has it's own scheduler component responsible for executing such threads. To see the available schedulers in your SQL Server, you can use the following query: SELECT * FROM sys.dm_os_schedulers Such code will return all the schedulers in your SQL Server; some will be displayed as visible online and some as hidden online. The hidden ones are for internal system tasks while the visible ones are used by user tasks running on the SQL Server. There is one more scheduler, which is displayed as Visible Online (DAC). This one is used for dedicated administration connection, which comes in handy when the SQL Server stops responding. To use a dedicated admin connection, you can modify your SSMS connection to use the DAC, or you can use a switch with the sqlcmd.exe utility, to connect to the DAC. To connect to the default instance with DAC on your server, you can use the following command: sqlcmd.exe -E -A Each thread can be in three possible states: running: This indicates that the thread is running on the processor suspended: This indicates that the thread is waiting for a resource on a waiter list runnable: This indicates that the thread is waiting for execution on a runnable queue Each running thread runs until it has to wait for a resource to become available or until it has exhausted the CPU time for a running thread, which is set to 4 ms. This 4 ms time can be visible in the output of the previous query to sys.dm_os_schedulers and is called a quantum. When a thread requires any resource, it is moved away from the processor to a waiter list, where the thread waits for the resource to become available. Once the resource is available, the thread is notified about the resource availability and moves to the bottom of the runnable queue. Any waiting thread can be found via the following code, which will display the waiting threads and the resource they are waiting for: SELECT * FROM sys.dm_os_waiting_tasks The threads then transition between the execution at the CPU, waiter list, and runnable queue. There is a special case when a thread does not need to wait for any resource and has already run for 4 ms on the CPU, then the thread will be moved directly to the runnable queue instead of the waiter list. In the following image, we can see the thread states and the objects where the thread resides: When the thread is waiting on the waiter list, we can talk about a resource wait time. When the thread is waiting on the runnable queue to get on the CPU for execution, we can talk about the signal time. The total wait time is, then, the sum of the signal and resource wait times. You can find the ratio of the signal to resource wait times with the following script: Select signalWaitTimeMs=sum(signal_wait_time_ms) ,'%signal waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) ,resourceWaitTimeMs=sum(wait_time_ms - signal_wait_time_ms) ,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) from sys.dm_os_wait_stats When the ratio goes over 30 percent for the signal waits, then there will be a serious CPU pressure and your processor(s) will have a hard time handling all the incoming requests from the threads. The following query can then grab the wait statistics and display the most frequent wait types, which were recorded through the thread executions, or actually during the time the threads were waiting on the waiter list for any particular resource: WITH [Waits] AS (SELECT [wait_type], [wait_time_ms] / 1000.0 AS [WaitS], ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS], [signal_wait_time_ms] / 1000.0 AS [SignalS], [waiting_tasks_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_wait_stats WHERE [wait_type] NOT IN ( N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE', N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE', N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX', N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE', N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP', N'LOGMGR_QUEUE', N'MEMORY_ALLOCATION_EXT', N'ONDEMAND_TASK_QUEUE', N'PREEMPTIVE_XE_GETTARGETSTATE', N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'PWAIT_DIRECTLOGCONSUMER_GETNEXT', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_ASYNC_QUEUE', N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'QDS_SHUTDOWN_QUEUE', N'REDO_THREAD_PENDING_WORK', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE', N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY', N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK', N'SLEEP_TASK', N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT', N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_RECOVERY', N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN', N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT' ) AND [waiting_tasks_count] > 0 ) SELECT MAX ([W1].[wait_type]) AS [WaitType], CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S], CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S], CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S], MAX ([W1].[WaitCount]) AS [WaitCount], CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage], CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S], CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S], CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S] FROM [Waits] AS [W1] INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum] GROUP BY [W1].[RowNum] HAVING SUM ([W2].[Percentage]) - MAX( [W1].[Percentage] ) < 95 GO This code is available on the whitepaper, published by SQLSkills, named SQL Server Performance Tuning Using Wait Statistics by Erin Stellato and Jonathan Kehayias, which then refers the URL on SQL Skills and uses the full query by Paul Randal available at https://www.sqlskills.com/ blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/. Some of the typical wait stats you can see are: PAGEIOLATCH The PAGEIOLATCH wait type is used when the thread is waiting for a page to be read into the buffer pool from the disk. This wait type comes with two main forms: PAGEIOLATCH_SH: This page will be read from the disk PAGEIOLATCH_EX: This page will be modified You may quickly assume that the storage has to be the problem, but that may not be the case. Like any other wait, they need to be considered in correlation with other wait types and other counters available to correctly find the root cause of the slow SQL Server operations. The page may be read into the buffer pool, because it was previously removed due to memory pressure and is needed again. So, you may also investigate the following: Buffer Manager: Page life expectancy Buffer Manager: Buffer cache hit ratio Also, you need to consider the following as a possible factor to the PAGEIOLATCH wait types: Large scans versus seeks on the indexes Implicit conversions Inundated statistics Missing indexes PAGELATCH This wait type is quite frequently misplaced with PAGEIOLATCH but PAGELATCH is used for pages already present in the memory. The thread waits for the access to such a page again with possible PAGELATCH_SH and PAGELATCH_EX wait types. A pretty common situation with this wait type is a tempDB contention, where you need to analyze what page is being waited for and what type of query is actually waiting for such a resource. As a solution to the tempDB, contention you can do the following: Add more tempDB data files Use traceflags 1118 and 1117 for tempDB on systems older than SQL Server 2016 CXPACKET This wait type is encountered when any thread is running in parallel. The CXPACKET wait type itself does not mean that there is really any problem on the SQL Server. But if such a wait type is accumulated very quickly, it may be a signal of skewed statistics, which require an update or a parallel scan on the table where proper indexes are missing. The option for parallelism is controlled via MAX DOP setting, which can be configured on the following: The server level The database level A query level with a hint We learned about SQL Server analysis with the Wait Statistics troubleshooting methodology and possible DMVs to get more insight on the problems occurring in the SQL Server. To know more about how to successfully create, design, and deploy databases using SQL Server 2017, do checkout the book SQL Server 2017 Administrator's Guide.

0
0
25776

article-image-can-a-modified-mit-hippocratic-license-to-restrict-misuse-of-open-source-software-prompt-a-wave-of-ethical-innovation-in-tech

Savia Lobo

24 Sep 2019

5 min read

Can a modified MIT ‘Hippocratic License’ to restrict misuse of open source software prompt a wave of ethical innovation in tech?

Savia Lobo

24 Sep 2019

5 min read

Open source licenses allow software to be freely distributed, modified, and used. These licenses give developers an additional advantage of allowing others to use their software as per their own rules and conditions. Recently, software developer and open-source advocate Coraline Ada Ehmke has caused a stir in the software engineering community with ‘The Hippocratic License.’ Ehmke was also the original author of Contributor Covenant, a “code of conduct" for open source projects that encourages participants to use inclusive language and to refrain from personal attacks and harassment. In a tweet posted in September last year, following the code of conduct, she mentioned, “40,000 open source projects, including Linux, Rails, Golang, and everything OSS produced by Google, Microsoft, and Apple have adopted my code of conduct.” [box type="shadow" align="" class="" width=""]The term ‘Hippocratic’ is derived from the Hippocratic Oath, the most widely known of Greek medical texts. The Hippocratic Oath in literal terms requires a new physician to swear upon a number of healing gods that he will uphold a number of professional ethical standards.[/box] Ehmke explained the license in more detail in a post published on Sunday. In it, she highlights how the idea that writing software with the goals of clarity, conciseness, readability, performance, and elegance are limiting, and potentially dangerous.“All of these technologies are inherently political,” she writes. “There is no neutral political position in technology. You can’t build systems that can be weaponized against marginalized people and take no responsibility for them.”The concept of the Hippocratic license is relatively simple. In a tweet, Ehmke said that it “specifically prohibits the use of open-source software to harm others.” Open source software and the associated harm Out of the many privileges that open source software allows such as free redistribution of the software as well as the source code, the OSI also defines there is no discrimination against who uses it or where it will be put to use. A few days ago, a software engineer, Seth Vargo pulled his open-source software, Chef-Sugar, offline after finding out that Chef (a popular open source DevOps company using the software) had recently signed a contract selling $95,000-worth of licenses to the US Immigrations and Customs Enforcement (ICE), which has faced widespread condemnation for separating children from their parents at the U.S. border and other abuses. Vargo took down the Chef Sugar library from both GitHub and RubyGems, the main Ruby package repository, as a sign of protest. In May, this year, Mijente, an advocacy organization released documents stating that Palantir was responsible for the 2017 ICE operation that targeted and arrested family members of children crossing the border alone. Also, in May 2018, Amazon employees, in a letter to Jeff Bezos, protested against the sale of its facial recognition tech to Palantir where they “refuse to contribute to tools that violate human rights”, citing the mistreatment of refugees and immigrants by ICE. Also, in July, the WYNC revealed that Palantir’s mobile app FALCON was being used by ICE to carry out raids on immigrant communities as well as enable workplace raids in New York City in 2017. Founder of OSI responds to Ehmke’s Hippocratic License Bruce Perens, one of the founders of the Open Source movement in software, responded to Ehmke in a post titled “Sorry, Ms. Ehmke, The “Hippocratic License” Can’t Work” . “The software may not be used by individuals, corporations, governments, or other groups for systems or activities that actively and knowingly endanger harm, or otherwise threaten the physical, mental, economic, or general well-being of underprivileged individuals or groups,” he highlights in his post. “The terms are simply far more than could be enforced in a copyright license,” he further adds. “Nobody could enforce Ms. Ehmke’s license without harming someone, or at least threatening to do so. And it would be easy to make a case for that person being underprivileged,” he continued. He concluded saying that, though the terms mentioned in Ehmke’s license were unagreeable, he will “happily support Ms. Ehmke in pursuit of legal reforms meant to achieve the protection of underprivileged people.” Many have welcomed Ehmke's idea of an open source license with an ethical clause. However, the license is not OSI approved yet and chances are slim after Perens’ response. There are many users who do not agree with the license. Reaching a consensus will be hard. https://twitter.com/seannalexander/status/1175853429325008896 https://twitter.com/AdamFrisby/status/1175867432411336704 https://twitter.com/rishmishra/status/1175862512509685760 Even though developers host their source code on open source repositories, a license may bring certain level of restrictions on who is allowed to use the code. However, as Perens mentions, many of the terms in Ehmke’s license hard to implement. Irrespective of the outcome of this license’s approval process, Coraline Ehmke has widely opened up the topic of the need for long overdue FOSS licensing reforms in the open source community. It would be interesting to see if such a license would boost ethical reformation by giving more authority to the developers in imbibing their values and preventing the misuse of their software. Read the Hippocratic license to know more in detail. Other interesting news Tech ImageNet Roulette: New viral app trained using ImageNet exposes racial biases in artificial intelligent system Machine learning ethics: what you need to know and what you can do Facebook suspends tens of thousands of apps amid an ongoing investigation into how apps use personal data

0
0
25773

article-image-6-index-types-in-postgresql-10-you-should-know

Sugandha Lahoti

28 Feb 2018

13 min read

6 index types in PostgreSQL 10 you should know

Sugandha Lahoti

28 Feb 2018

13 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering PostgreSQL 10 written by Hans-Jürgen Schönig. This book will help you master the capabilities of PostgreSQL 10 to efficiently manage and maintain your database.[/box] In today’s post, we will learn about the different index types available for sorting in PostgreSQL and also understand how they function. What are index types and why you need them Data types can be sorted in a useful way. Just imagine a polygon. How would you sort these objects in a useful way? Sure, you can sort by the area covered, its length or so, but doing this won't allow you to actually find them using a geometric search. The solution to the problem is to provide more than just one index type. Each index will serve a special purpose and do exactly what is needed. The following six index types are available (as of PostgreSQL 10.0): test=# SELECT * FROM pg_am; amname | amhandler | amtype ---------+-------------+-------- btree | bthandler | i hash | hashhandler | i GiST | GiSThandler | i Gin | ginhandler | i spGiST | spghandler | i brin | brinhandler | i (6 rows) A closer look at the 6 index types in PostgreSQL 10 The following sections will outline the purpose of each index type available in PostgreSQL. Note that there are some extensions that can be used on top of what you can see here. Additional index types available on the web are rum, vodka, and in the future, cognac. Hash indexes Hash indexes have been around for many years. The idea is to hash the input value and store it for later lookups. Having hash indexes actually makes sense. However, before PostgreSQL 10.0, it was not advised to use hash indexes because PostgreSQL had no WAL support for them. In PostgreSQL 10.0, this has changed. Hash indexes are now fully logged and are therefore ready for replication and are considered to be a 100% crash safe. Hash indexes are generally a bit larger than b-tree indexes. Suppose you want to index 4 million integer values. A btree will need around 90 MB of storage to do this. A hash index will need around 125 MB on disk. The assumption made by many people that a hash is super small on the disk is therefore, in many cases, just wrong. GiST indexes Generalized Search Tree (GiST) indexes are highly important index types because they are used for a variety of different things. GiST indexes can be used to implement R-tree behavior and it is even possible to act as b-tree. However, abusing GiST for b-tree indexes is not recommended. Typical use cases for GiST are as follows: Range types Geometric indexes (for example, used by the highly popular PostGIS extension) Fuzzy searching Understanding how GiST works To many people, GiST is still a black box. We will now discuss how GiST works internally. Consider the following diagram: Source: http://leopard.in.ua/assets/images/postgresql/pg_indexes/pg_indexes2.jpg Take a look at the tree. You will see that R1 and R2 are on top. R1 and R2 are the bounding boxes containing everything else. R3, R4, and R5 are contained by R1. R8, R9, and R10 are contained by R3, and so on. A GiST index is therefore hierarchically organized. What you can see in the diagram is that some operations, which are not available in b-trees are supported. Some of those operations are overlaps, left of, right of, and so on. The layout of a GiST tree is ideal for geometric indexing. Extending GiST Of course, it is also possible to come up with your own operator classes. The following strategies are supported: Operation Strategy number Strictly left of 1 Does not extend to right of 2 Overlaps 3 Does not extend to left of 4 Strictly right of 5 Same 6 Contains 7 Contained by 8 Does not extend above 9 Strictly below 10 Strictly above 11 Does not extend below 12 If you want to write operator classes for GiST, a couple of support functions have to be provided. In the case of a b-tree, there is only the same function - GiST indexes provide a lot more: Function Description Support function number consistent The functions determine whether a key satisfies the query qualifier. Internally, strategies are looked up and checked. 1 union Calculate the union of a set of keys. In case of numeric values, simply the upper and lower values or a range are computed. It is especially important to geometries. 2 compress Compute a compressed representation of a key or value. 3 decompress This is the counterpart of the compress function. 4 penalty During insertion, the cost of inserting into the tree will be calculated. The cost determines where the new entry will go inside the tree. Therefore, a good penalty function is key to the good overall performance of the index. 5 picksplit Determines where to move entries in case of a page split. Some entries have to stay on the old page while others will go to the new page being created. Having a good picksplit function is essential to a good index performance. 6 equal The equal function is similar to the same function you have already seen in b-trees. 7 distance Calculates the distance (a number) between a key and the query value. The distance function is optional and is needed in case KNN search is supported. 8 fetch Determine the original representation of a compressed key. This function is needed to handle index only scans as supported by the recent version of PostgreSQL. 9 Implementing operator classes for GiST indexes is usually done in C. If you are interested in a good example, I advise you to check out the btree_GiST module in the contrib directory. It shows how to index standard data types using GiST and is a good source of information as well as inspiration. GIN indexes Generalized inverted (GIN) indexes are a good way to index text. Suppose you want to index a million text documents. A certain word may occur millions of times. In a normal b- tree, this would mean that the key is stored millions of times. Not so in a GIN. Each key (or word) is stored once and assigned to a document list. Keys are organized in a standard b- tree. Each entry will have a document list pointing to all entries in the table having the same key. A GIN index is very small and compact. However, it lacks an important feature found in the b-trees-sorted data. In a GIN, the list of item pointers associated with a certain key is sorted by the position of the row in the table and not by some arbitrary criteria. Extending GIN Just like any other index, GIN can be extended. The following strategies are available: Operation Strategy number Overlap 1 Contains 2 Is contained by 3 Equal 4 On top of this, the following support functions are available: Function Description Support function number compare The compare function is similar to the same function you have seen in b-trees. If two keys are compared, it returns -1 (lower), 0 (equal), or 1 (higher). 1 extractValue Extract keys from a value to be indexed. A value can have many keys. For example, a text value might consist of more than one word. 2 extractQuery Extract keys from a query condition. 3 consistent Check whether a value matches a query condition. 4 comparePartial Compare a partial key from a query and a key from the index. Returns -1, 0, or 1 (similar to the same function supported by b-trees). 5 triConsistent Determine whether a value matches a query condition (ternary variant). It is optional if the consistent function is present. 6 If you are looking for a good example of how to extend GIN, consider looking at the btree_gin module in the PostgreSQL contrib directory. It is a valuable source of information and a good way to start your own implementation. SP-GiST indexes Space partitioned GiST (SP-GiST) has mainly been designed for in-memory use. The reason for this is an SP-GiST stored on disk needs a fairly high number of disk hits to function. Disk hits are way more expensive than just following a couple of pointers in RAM. The beauty is that SP-GiST can be used to implement various types of trees such as quad- trees, k-d trees, and radix trees (tries). The following strategies are provided: Operation Strategy number Strictly left of 1 Strictly right of 5 Same 6 Contained by 8 Strictly below 10 Strictly above 11 To write your own operator classes for SP-GiST, a couple of functions have to be provided: Function Description Support function number config Provides information about the operator class in use 1 choose Figures out how to insert a new value into an inner tuple 2 picksplit Figures out how to partition/split a set of values 3 inner_consistent Determine which subpartitions need to be searched for a query 4 leaf_consistent Determine whether key satisfies the query qualifier 5 BRIN indexes Block range indexes (BRIN) are of great practical use. All indexes discussed until now need quite a lot of disk space. Although a lot of work has gone into shrinking GIN indexes and the like, they still need quite a lot because an index pointer is needed for each entry. So, if there are 10 million entries, there will be 10 million index pointers. Space is the main concern addressed by the BRIN indexes. A BRIN index does not keep an index entry for each tuple but will store the minimum and the maximum value of 128 (default) blocks of data (1 MB). The index is therefore very small but lossy. Scanning the index will return more data than we asked for. PostgreSQL has to filter out these additional rows in a later step. The following example demonstrates how small a BRIN index really is: test=# CREATE INDEX idx_brin ON t_test USING brin(id); CREATE INDEX test=# di+ idx_brin List of relations Schema | Name | Type | Owner | Table | Size --------+----------+-------+-------+--------+-------+------------- public | idx_brin | index | hs | t_test | 48 KB (1 row) In my example, the BRIN index is 2,000 times smaller than a standard b-tree. The question naturally arising now is, why don't we always use BRIN indexes? To answer this kind of question, it is important to reflect on the layout of BRIN; the minimum and maximum value for 1 MB are stored. If the data is sorted (high correlation), BRIN is pretty efficient because we can fetch 1 MB of data, scan it, and we are done. However, what if the data is shuffled? In this case, BRIN won't be able to exclude chunks of data anymore because it is very likely that something close to the overall high and the overall low is within 1 MB of data. Therefore, BRIN is mostly made for highly correlated data. In reality, correlated data is quite likely in data warehousing applications. Often, data is loaded every day and therefore dates can be highly correlated. Extending BRIN indexes BRIN supports the same strategies as a b-tree and therefore needs the same set of operators. The code can be reused nicely: Operation Strategy number Less than 1 Less than or equal 2 Equal 3 Greater than or equal 4 Greater than 5 The support functions needed by BRIN are as follows: Function Description Support function number opcInfo Provide internal information about the indexed columns 1 add_value Add an entry to an existing summary tuple 2 consistent Check whether a value matches a condition 3 union Calculate the union of two summary entries (minimum/maximum values) 4 Adding additional indexes Since PostgreSQL 9.6, there has been an easy way to deploy entirely new index types as extensions. This is pretty cool because if those index types provided by PostgreSQL are not enough, it is possible to add additional ones serving precisely your purpose. The instruction to do this is CREATE ACCESS METHOD: test=# h CREATE ACCESS METHOD Command: CREATE ACCESS METHOD Description: define a new access method Syntax: CREATE ACCESS METHOD name TYPE access_method_type HANDLER handler_function Don't worry too much about this command—just in case you ever deploy your own index type, it will come as a ready-to-use extension. One of these extensions implements bloom filters. Bloom filters are probabilistic data structures. They sometimes return too many rows but never too few. Therefore, a bloom filter is a good method to pre-filter data. How does it work? A bloom filter is defined on a couple of columns. A bitmask is calculated based on the input values, which is then compared to your query. The upside of a bloom filter is that you can index as many columns as you want. The downside is that the entire bloom filter has to be read. Of course, the bloom filter is smaller than the underlying data and so it is, in many cases, very beneficial. To use bloom filters, just activate the extension, which is a part of the PostgreSQL contrib package: test=# CREATE EXTENSION bloom; CREATE EXTENSION As stated previously, the idea behind a bloom filter is that it allows you to index as many columns as you want. In many real-world applications, the challenge is to index many columns without knowing which combinations the user will actually need at runtime. In the case of a large table, it is totally impossible to create standard b-tree indexes on, say, 80 fields or more. A bloom filter might be an alternative in this case: test=# CREATE TABLE t_bloom (x1 int, x2 int, x3 int, x4 int, x5 int, x6 int, x7 int); CREATE TABLE Creating the index is easy: test=# CREATE INDEX idx_bloom ON t_bloom USING bloom(x1, x2, x3, x4, x5, x6, x7); CREATE INDEX If sequential scans are turned off, the index can be seen in action: test=# SET enable_seqscan TO off; SET test=# explain SELECT * FROM t_bloom WHERE x5 = 9 AND x3 = 7; QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan on t_bloom (cost=18.50..22.52 rows=1 width=28) Recheck Cond: ((x3 = 7) AND (x5 = 9)) -> Bitmap Index Scan on idx_bloom (cost=0.00..18.50 rows=1 width=0) Index Cond: ((x3 = 7) AND (x5 = 9)) Note that I have queried a combination of random columns; they are not related to the actual order in the index. The bloom filter will still be beneficial. If you are interested in bloom filters, consider checking out the website: https://en.wikipedia.org/wiki/Bloom_filter. We learnt how to use the indexing features in PostgreSQL and fine-tune the performance of our queries. If you liked our article, check out the book Mastering PostgreSQL 10 to implement advanced administrative tasks such as server maintenance and monitoring, replication, recovery, high availability, etc in PostgreSQL 10.

0
0
25771

article-image-20-lessons-bias-machine-learning-systems-nips-2017

Aarthi Kumaraswamy

08 Dec 2017

9 min read

20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017

Aarthi Kumaraswamy

08 Dec 2017

9 min read

Kate Crawford is a Principal Researcher at Microsoft Research and a Distinguished Research Professor at New York University. She has spent the last decade studying the social implications of data systems, machine learning, and artificial intelligence. Her recent publications address data bias and fairness, and social impacts of artificial intelligence among others. This article attempts to bring our readers to Kate’s brilliant Keynote speech at NIPS 2017. It talks about different forms of bias in Machine Learning systems and the ways to tackle such problems. By the end of this article, we are sure you would want to listen to her complete talk on the NIPS Facebook page. All images in this article come from Kate's presentation slides and do not belong to us. The rise of Machine Learning is every bit as far reaching as the rise of computing itself. A vast new ecosystem of techniques and infrastructure are emerging in the field of machine learning and we are just beginning to learn their full capabilities. But with the exciting things that people can do, there are some really concerning problems arising. Forms of bias, stereotyping and unfair determination are being found in machine vision systems, object recognition models, and in natural language processing and word embeddings. High profile news stories about bias have been on the rise, from women being less likely to be shown high paying jobs to gender bias and object recognition datasets like MS COCO, to racial disparities in education AI systems. 20 lessons on bias in machine learning systems Interest in the study of bias in ML systems has grown exponentially in just the last 3 years. It has more than doubled in the last year alone. We are speaking different languages when we talk about bias. I.e., it means different things to different people/groups. Eg: in law, in machine learning, in geometry etc. Read more on this in the ‘What is bias?’ section below. In the simplest terms, for the purpose of understanding fairness in machine learning systems, we can consider ‘bias’ as a skew that produces a type of harm. Bias in MLaaS is harder to identify and also correct as we do not build them from scratch and are not always privy to how it works under the hood. Data is not neutral. Data cannot always be neutralized. There is no silver bullet for solving bias in ML & AI systems. There are two main kinds of harms caused by bias: Harms of allocation and harms of representation. The former takes an economically oriented view while the latter is more cultural. Allocative harm is when a system allocates or withholds certain groups an opportunity or resource. To know more, jump to the ‘harms of allocation’ section. When systems reinforce the subordination of certain groups along the lines of identity like race, class, gender etc., they cause representative harm. This is further elaborated in the ‘Harms of representation’ section. Harm can further be classified into five types: stereotyping, recognition, denigration, under-representation and ex-nomination. There are many technical approaches to dealing with the problem of bias in a training dataset such as scrubbing to neutral, demographic sampling etc among others. But they all still suffer from bias. Eg: who decides what is ‘neutral’. When we consider bias purely as a technical problem, which is hard enough, we are already missing part of the picture. Bias in systems is commonly caused by bias in training data. We can only gather data about the world we have which has a long history of discrimination. So, the default tendency of these systems would be to reflect our darkest biases. Structural bias is a social issue first and a technical issue second. If we are unable to consider both and see it as inherently socio-technical, then these problems of bias are going to continue to plague the ML field. Instead of just thinking about ML contributing to decision making in say hiring or criminal justice, we also need to think of the role of ML in the harmful representation of human identity. While technical responses to bias are very important and we need more of them, they won’t get us all the way to addressing representational harms to group identity. Representational harms often exceed the scope of individual technical interventions. Developing theoretical fixes that come from the tech world for allocational harms is necessary but not sufficient. The ability to move outside our disciplinary boundaries is paramount to cracking the problem of bias in ML systems. Every design decision has consequences and powerful social implications. Datasets reflect not only the culture but also the hierarchy of the world that they were made in. Our current datasets stand on the shoulder of older datasets building on earlier corpora. Classifications can be sticky and sometimes they stick around longer than we intend them to, even when they are harmful. ML can be deployed easily in contentious forms of categorization that could have serious repercussions. Eg: free-of-bias criminality detector that has Physiognomy at the heart of how it predicts the likelihood of a person being a criminal based on his appearance. What is bias? 14th century: an oblique or diagonal line 16th century: undue prejudice 20th century: systematic differences between the sample and a population In ML: underfitting (low variance and high bias) vs overfitting (high variance and low bias) In Law: judgments based on preconceived notions or prejudices as opposed to the impartial evaluation of facts. Impartiality underpins jury selection, due process, limitations placed on judges etc. Bias is hard to fix with model validation techniques alone. So you can have an unbiased system in an ML sense producing a biased result in a legal sense. Bias is a skew that produces a type of harm. Where does bias come from? Commonly from Training data. It can be incomplete, biased or otherwise skewed. It can draw from non-representative samples that are wholly defined before use. Sometimes it is not obvious because it was constructed in a non-transparent way. In addition to human labeling, other ways that human biases and cultural assumptions can creep in ending up in exclusion or overrepresentation of subpopulation. Case in point: stop-and-frisk program data used as training data by an ML system. This dataset was biased due to systemic racial discrimination in policing. Harms of allocation Majority of the literature understand bias as harms of allocation. Allocative harm is when a system allocates or withholds certain groups, an opportunity or resource. It is an economically oriented view primarily. Eg: who gets a mortgage, loan etc. Allocation is immediate, it is a time-bound moment of decision making. It is readily quantifiable. In other words, it raises questions of fairness and justice in discrete and specific transactions. Harms of representation It gets tricky when it comes to systems that represent society but don't allocate resources. These are representational harms. When systems reinforce the subordination of certain groups along the lines of identity like race, class, gender etc. It is a long-term process that affects attitudes and beliefs. It is harder to formalize and track. It is a diffused depiction of humans and society. It is at the root of all of the other forms of allocative harm. 5 types of allocative harms Source: Kate Crawford’s NIPS 2017 Keynote presentation: Trouble with Bias Stereotyping 2016 paper on word embedding that looked at Gender stereotypical associations and the distances between gender pronouns and occupations. Google translate swaps the genders of pronouns even in a gender-neutral language like Turkish Recognition When a group is erased or made invisible by a system In a narrow sense, it is purely a technical problem. i.e., does a system recognize a face inside an image or video? Failure to recognize someone’s humanity. In the broader sense, it is about respect, dignity, and personhood. The broader harm is whether the system works for you. Eg: system could not process darker skin tones, Nikon’s camera s/w mischaracterized Asian faces as blinking, HP's algorithms had difficulty recognizing anyone with a darker shade of pale. Denigration When people use culturally offensive or inappropriate labels Eg: autosuggestions when people typed ‘jews should’ Under-representation An image search of 'CEOs' yielded only one woman CEO at the bottom-most part of the page. The majority were white male. ex-nomination Technical responses to the problem of biases Improve accuracy Blacklist Scrub to neutral Demographics or equal representation Awareness Politics of classification Where did identity categories come from? What if bias is a deeper and more consistent issue with classification? Source: Kate Crawford’s NIPS 2017 Keynote presentation: Trouble with Bias The fact that bias issues keep creeping into our systems and manifesting in new ways, suggests that we must understand that classification is not simply a technical issue but a social issue as well. One that has real consequences for people that are being classified. There are two themes: Classification is always a product of its time We are currently in the biggest experiment of classification in human history Eg: labeled faces in the wild dataset has 77.5% men, and 83.5% white. An ML system trained on this dataset will work best for that group. What can we do to tackle these problems? Start working on fairness forensics Test our systems: eg: build pre-release trials to see how a system is working across different populations How do we track the life cycle of a training dataset to know who built it and what the demographics skews might be in that dataset Start taking interdisciplinarity seriously Working with people who are not in our field but have deep expertise in other areas Eg: FATE (Fairness Accountability Transparency Ethics) group at Microsoft Research Build spaces for collaboration like the AI now institute. Think harder on the ethics of classification The ultimate question for fairness in machine learning is this. Who is going to benefit from the system we are building? And who might be harmed?

0
0
25748

article-image-alarming-ways-governments-use-surveillance-tech

Neil Aitken

14 Jun 2018

12 min read

Alarming ways governments are using surveillance tech to watch you

Neil Aitken

14 Jun 2018

12 min read

Mapquest, part of the Verizon company, is the second largest provider of mapping services in the world, after Google Maps. It provides advanced cartography services to companies like Snap and PapaJohns pizza. The company is about to release an app that users can install on their smartphone. Their new application will record and transmit video images of what’s happening in front of your vehicle, as you travel. Data can be sent from any phone with a camera – using the most common of tools – a simple mobile data plan, for example. In exchange, you’ll get live traffic updates, among other things. Mapquest will use the video image data they gather to provide more accurate and up to date maps to their partners. The real world is changing all the time – roads get added, cities re-route traffic from time to time. The new AI based technology Mapquest employ could well improve the reliability of driverless cars, which have to engage with this ever changing landscape, in a safe manner. No-one disagrees with safety improvements. Mapquests solution is impressive technology. The fact that they can use AI to interpret the images they see and upload the information they receive to update maps is incredible. And, in this regard, the company is just one of the myriad daily news stories which excite and astound us. These stories do, however, often have another side to them which is rarely acknowledged. In the wrong hands, Mapquest’s solution could create a surveillance database which tracked people in real time. Surveillance technology involves the use of data and information products to capture details about individuals. The act of surveillance is usually undertaken with a view to achieving a goal. The principle is simple. The more ‘they’ know about you, the easier it will be to influence you towards their ends. Surveillance information can be used to find you, apprehend you or potentially, to change your mind, without even realising that you had been watched. Mapquest’s innovation is just a single example of surveillance technology in government hands which has expanded in capability far beyond what most people realise. Read also: What does the US government know about you? The truth beyond the Facebook scandal Facebook’s share price fell 14% in early 2018 as a result of public outcry related to the Cambridge Analytica announcements the company made. The idea that a private company had allowed detailed information about individuals to be provided to a third party without their consent appeared to genuinely shock and appall people. Technology tools like Mapquest’s tracking capabilities and Facebook’s profiling techniques are being taken and used by police forces and corporate entities around the world. The reality of current private and public surveillance capabilities is that facilities exist, and are in use, to collect and analyse data on most people in the developing world. The known limits of these services may surprise even those who are on the cutting edge of technology. There are so many examples from all over the world listed below that will genuinely make you want to consider going off grid! Innovative, Ingenious overlords: US companies have a flare for surveillance The US is the centre for information based technology companies. Much of what they develop is exported as well as used domestically. The police are using human genome matching to track down criminals and can find ‘any family in the country’ There have been 2 recent examples of police arresting a suspect after using human genome databases to investigate crimes. A growing number of private individuals have now used publicly available services such as 23andme to sequence their genome (DNA) either to investigate further their family tree, or to determine the potential of a pre-disposition to the gene based component of a disease. In one example, The Golden State Killer, an ex cop, was arrested 32 years after the last reported rape in a series of 45 (in addition to 12 murders) which occurred between 1976 and 1986. To track him down, police approached sites like 23andme with DNA found at crime scenes, established a family match and then progressed the investigation using conventional means. More than 12 million Americans have now used a genetic sequencing service and it is believed that investigators could find a family match for the DNA of anyone who has committed a crime in America. In simple terms, whether you want it or not, the law enforcement has the DNA of every individual in the country available to them. Domain Awareness Centers (DAC) bring the Truman Show to life The 400,000 Residents of Oakland, California discovered in 2012, that they had been the subject of an undisclosed mass surveillance project, by the local police force, for many years. Feeds from CCTV cameras installed in Oakland’s suburbs were augmented with weather information feeds, social media feeds and extracted email conversations, as well as a variety of other sources. The scheme began at Oakland’s port with Federal funding as part of a national response to the events of 9.11.2001 but was extended to cover the near half million residents of the city. Hundreds of additional video cameras were installed, along with gunshot recognition microphones and some of the other surveillance technologies provided in this article. The police force conducting the surveillance had no policy on what information was recorded or for how long it was kept. Internet connected toys spy on children The FBI has warned Americans that children’s toys connected to the internet ‘could put the privacy and safety of children at risk.' Children’s toy Hello Barbie was specifically admonished for poor privacy controls as part of the FBI’s press release. Internet connected toys could be used to record video of children at any point in the day or, conceivably, to relay a human voice, making it appear to the child that the toy was talking to them. Oracle suggest Google’s Android operating system routinely tracks users’ position even when maps are turned off In Australia, two American companies have been involved in a disagreement about the potential monitoring of Android phones. Oracle accused Google of monitoring users’ location (including altitude), even when mapping software is turned off on the device. The tracking is performed in the background of their phone. In Australia alone, Oracle suggested that Google’s monitoring could involve around 1GB of additional mobile data every month, costing users nearly half a billion dollars a year, collectively. Amazon facial recognition in real time helps US law enforcement services Amazon are providing facial recognition services which take a feed from public video cameras, to a number of US Police Forces. Amazon can match images taken in real time to a database containing ‘millions of faces.’ Are there any state or Federal rules in place to govern police facial recognition? Wired reported that there are ‘more or less none.’ Amazon’s scheme is a trial taking place in Florida. There are at least 2 other companies offering similar schemes in the US to law enforcement services. Big glass microphone can help agencies keep an ear on the ground Project ‘Big Glass Microphone’ uses the vibrations that the movements of cars (among other things) cause in the buried fiber optic telecommunications links. A successful test of the technology has been undertaken on the fiber optic cables which run underground on the Stanford University Campus, to record vehicle movements. Fiber optic links now make up the backbone of much data transport infrastructure - the way your phone and computer connect to the internet. Big glass microphone as it stands is the first step towards ‘invisible’ monitoring of people and their assets. It appears the FBI now have the ability to crack/access any phone Those in the know suggest that Apple’s iPhone is the most secure smart device against government surveillance. In 2016, this was put to the test. The Justice Department came into possession of an iPhone allegedly belonging to one of the San Bernadino shooters and ultimately sued Apple in an attempt to force the company to grant access to it, as part of their investigation. The case was ultimately dropped leading some to speculate that NAND mirroring techniques were used to gain access to the phone without Apple’s assistance, implying that even the most secure phones can now be accessed by authorities. Cornell University’s lie detecting algorithm Groundbreaking work by Cornell University will provide ‘at a distance’ access to information that previously required close personal access to an accused subject. Cornell’s solution interprets feeds from a number of video cameras on subjects and analyses the results to judge their heart rate. They believe the system can be used to determine if someone is lying from behind a screen. University of Southern California can anticipate social unrest with social media feeds Researchers at the University Of Southern California have developed an AI tool to study Social Media posts and determine whether those writing them are likely to cause Social Unrest. The software claims to have identified an association between both the volume of tweets written / the content of those tweets and protests turning physical. They can now offer advice to law enforcement on the likelihood of a protest turning violent so they can be properly prepared based on this information. The UK, an epicenter of AI progress, is not far behind in tracking people The UK has a similarly impressive array of tools at its disposal to watch the people that representatives of the country feels may be required. Given the close levels of cooperation between the UK and US governments, it is likely that many of these UK facilities are shared with the US and other NATO partners. Project stingray – fake cell phone/mobile phone ‘towers’ to intercept communications Stingray is a brand name for an IMSI (the unique identifier on a SIM card) tracker. They ‘spoof’ real towers, presenting themselves as the closest mobile phone tower. This ‘fools’ phones in to connecting to them. The technology has been used to spy on criminals in the UK but it is not just the UK government which use Stingray or its equivalents. The Washington Post reported in June 2018 that a number of domestically compiled intelligence reports suggest that foreign governments acting on US soil, including China and Russia, have been eavesdropping on the Whitehouse, using the same technology. UK developed Spyware is being used by authoritarian regimes Gamma International is a company based in Hampshire UK, which provided the (notably authoritarian) Egyptian government with a facility to install what was effectively spyware delivered with a virus on to computers in their country. Once installed, the software permitted the government to monitor private digital interactions, without the need to engage the phone company or ISP offering those services. Any internet based technology could be tracked, assisting in tracking down individuals who may have negative feelings about the Egyptian government. Individual arrested when his fingerprint was taken from a WhatsApp picture of his hand A Drug Dealer was pictured holding an assortment of pills in the UK two months ago. The image of his hand was used to extract an image of his fingerprint. From that, forensic scientists used by UK police, confirmed that officers had arrested the correct person and associated him with drugs. AI solutions to speed up evidence processing including scanning laptops and phones UK police forces are trying out AI software to speed up processing evidence from digital devices. A dozen departments around the UK are using software, called Cellebrite, which employs AI algorithms to search through data found on devices, including phones and laptops. Cellbrite can recognize images that contain child abuse, accepts feeds from multiple devices to see when multiple owners were in the same physical location at the same time and can read text from screenshots. Officers can even feed it photos of suspects to see if a picture of them show up on someone’s hard drive. China takes the surveillance biscuit and may show us a glimpse of the future There are 600 million mobile phone users in China, each producing a great deal of information about their users. China has a notorious record of human rights abuses and the ruling Communist Party takes a controlling interest (a board seat) in many of their largest technology companies, to ensure the work done is in the interest of the party as well as profitable for the corporate. As a result, China is on the front foot when it comes to both AI and surveillance technology. China’s surveillance tools could be a harbinger of the future in the Western world. Chinese cities will be run by a private company Alibaba, China’s equivalent of Amazon, already has control over the traffic lights in one Chinese city, Hangzhou. Alibaba is far from shy about it’s ambitions. It has 120,000 developers working on the problem and intends to commercialise and sell the data it gathers about citizens. The AI based product they’re using is called CityBrain. In the future, all Chinese cities could well all be run by AI from the Alibaba corporation the idea is to use this trial as a template for every city. The technology is likely to be placed in Kuala Lumpur next. In the areas under CityBrain’s control, traffic speeds have increased by 15% already. However, some of those observing the situation have expressed concerns not just about (the lack of) oversight on CityBrain’s current capabilities but the potential for future abuse. What to make of this incredible list of surveillance capabilities Facilities like Mapquest’s new mapping service are beguiling. They’re clever ideas which create a better works. Similar technology, however, behind the scenes, is being adopted by law enforcement bodies in an ever growing list of countries. Even for someone who understands cutting edge technology, the sum of those facilities may be surprising. Literally any aspect of your behaviour, from the way you walk, to your face, your heatmap and, of course, the contents of your phone and laptops can now be monitored. Law enforcement can access and review information feeds with Artificial Intelligence software, to process and summarise findings quickly. In some cases, this is being done without the need for a warrant. Concerningly, these advances seem to be coming without policy or, in many cases any form of oversight. We must change how we think about AI, urge AI founding fathers

0
0
25733

article-image-deep-learning-models-have-massive-carbon-footprints-can-photonic-chips-help-reduce-power-consumption

Sugandha Lahoti

11 Jun 2019

10 min read

Deep learning models have massive carbon footprints, can photonic chips help reduce power consumption?

Sugandha Lahoti

11 Jun 2019

10 min read

Most of the recent breakthroughs in Artificial Intelligence are driven by data and computation. What is essentially missing is the energy cost. Most large AI networks require huge number of training data to ensure accuracy. However, these accuracy improvements depend on the availability of exceptionally large computational resources. The larger the computation resource, the more energy it consumes. This not only is costly financially (due to the cost of hardware, cloud compute, and electricity) but is also straining the environment, due to the carbon footprint required to fuel modern tensor processing hardware. Considering the climate change repercussions we are facing on a daily basis, consensus is building on the need for AI research ethics to include a focus on minimizing and offsetting the carbon footprint of research. Researchers should also put energy cost in results of research papers alongside time, accuracy, etc. The process of deep learning outsizing environmental impact was further highlighted in a recent research paper published by MIT researchers. In the paper titled “Energy and Policy Considerations for Deep Learning in NLP”, researchers performed a life cycle assessment for training several common large AI models. They quantified the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP and provided recommendations to reduce costs and improve equity in NLP research and practice. They have also provided recommendations to reduce costs and improve equity in NLP research and practice. Per the paper, training AI models can emit more than 626,000 pounds of carbon dioxide equivalent—nearly five times the lifetime emissions of the average American car (and that includes the manufacture of the car itself). It is estimated that we must cut carbon emissions by half over the next decade to deter escalating rates of natural disaster. Source This speaks volumes about the carbon offset and brings conversation to the returns on heavy (carbon) investment of deep learning and if it is really worth the marginal improvement in predictive accuracy over cheaper, alternative methods. This news alarmed people tremendously. https://twitter.com/sakthigeek/status/1137555650718908416 https://twitter.com/vinodkpg/status/1129605865760149504 https://twitter.com/Kobotic/status/1137681505541484545 Even if some of this energy may come from renewable or carbon credit-offset resources, the high energy demands of these models are still a concern. This is because the current energy is derived from carbon-neural sources in many locations, and even when renewable energy is available, it is limited to the equipment produced to store it. The carbon footprint of NLP models The researchers in this paper adhere specifically to NLP models. They looked at four models, the Transformer, ELMo, BERT, and GPT-2, and trained each on a single GPU for up to a day to measure its power draw. Next, they used the number of training hours listed in the model’s original papers to calculate the total energy consumed over the complete training process. This number was then converted into pounds of carbon dioxide equivalent based on the average energy mix in the US, which closely matches the energy mix used by Amazon’s AWS, the largest cloud services provider. Source The researchers found that environmental costs of training grew proportionally to model size. It exponentially increased when additional tuning steps were used to increase the model’s final accuracy. In particular, neural architecture search had high associated costs for little performance benefit. Neural architecture search is a tuning process which tries to optimize a model by incrementally tweaking a neural network’s design through exhaustive trial and error. The researchers also noted that these figures should only be considered as baseline. In practice, AI researchers mostly develop a new model from scratch or adapt an existing model to a new data set, both require many more rounds of training and tuning. Based on their findings, the authors recommend certain proposals to heighten the awareness of this issue to the NLP community and promote mindful practice and policy: Researchers should report training time and sensitivity to hyperparameters. There should be a standard, hardware independent measurement of training time, such as gigaflops required to convergence. There should also be a standard measurement of model sensitivity to data and hyperparameters, such as variance with respect to hyperparameters searched. Academic researchers should get equitable access to computation resources. This trend toward training huge models on tons of data is not feasible for academics, because they don’t have the computational resources. It will be more cost effective for academic researchers to pool resources to build shared compute centers at the level of funding agencies, such as the U.S. National Science Foundation. Researchers should prioritize computationally efficient hardware and algorithms. For instance, developers could aid in reducing the energy associated with model tuning by providing easy-to-use APIs implementing more efficient alternatives to brute-force. The next step is to introduce energy costs as a standard metric, that researchers are expected to report their findings. They should also try to minimise carbon footprint by developing compute efficient training methods such as new ML algos, or new engineering tools to make existing ones more compute efficient. Above all, we need to formulate strict public policies that steer digital technologies toward speeding a clean energy transition while mitigating the risks. Another factor which contributes to high energy consumptions are Optical neural networks which are used for most deep learning tasks. To tackle that issue, researchers and major tech companies — including Google, IBM, and Tesla — have developed “AI accelerators,” specialized chips that improve the speed and efficiency of training and testing neural networks. However, these AI accelerators use electricity and have a theoretical minimum limit for energy consumption. Also, most present day ASICs are based on CMOS technology and suffer from the interconnect problem. Even in highly optimized architectures where data are stored in register files close to the logic units, a majority of the energy consumption comes from data movement, not logic. Analog crossbar arrays based on CMOS gates or memristors promise better performance, but as analog electronic devices, they suffer from calibration issues and limited accuracy. Implementing chips that use light instead of electricity Another group of MIT researchers have developed a “photonic” chip that uses light instead of electricity, and consumes relatively little power in the process. The photonic accelerator uses more compact optical components and optical signal-processing techniques, to drastically reduce both power consumption and chip area. Practical applications for such chips can also include reducing energy consumption in data centers. “In response to vast increases in data storage and computational capacity in the last decade, the amount of energy used by data centers has doubled every four years, and is expected to triple in the next 10 years.” https://twitter.com/profwernimont/status/1137402420823306240 The chip could be used to process massive neural networks millions of times more efficiently than today’s classical computers. How the photonic chip works? The researchers have given a detailed explanation of the chip’s working in their research paper, “Large-Scale Optical Neural Networks Based on Photoelectric Multiplication”. The chip relies on a compact, energy efficient “optoelectronic” scheme that encodes data with optical signals, but uses “balanced homodyne detection” for matrix multiplication. This technique that produces a measurable electrical signal after calculating the product of the amplitudes (wave heights) of two optical signals. Pulses of light encoded with information about the input and output neurons for each neural network layer — which are needed to train the network — flow through a single channel. Optical signals carrying the neuron and weight data fan out to grid of homodyne photodetectors. The photodetectors use the amplitude of the signals to compute an output value for each neuron. Each detector feeds an electrical output signal for each neuron into a modulator, which converts the signal back into a light pulse. That optical signal becomes the input for the next layer, and so on. Limitation of Photonic accelerators Photonic accelerators generally have an unavoidable noise in the signal. The more light that’s fed into the chip, the less noise and greater accuracy. Less input light increases efficiency but negatively impacts the neural network’s performance. The ideal condition is achieved when AI accelerators is measured in how many joules it takes to perform a single operation of multiplying two numbers. Traditional accelerators are measured in picojoules, or one-trillionth of a joule. Photonic accelerators measure in attojoules, which is a million times more efficient. In their simulations, the researchers found their photonic accelerator could operate with sub-attojoule efficiency. Tech companies are the largest contributors of carbon footprint The realization that training an AI model can produce emissions equivalent to a five cars, should make carbon footprint of artificial intelligence an important consideration for researchers and companies going forward. UMass Amherst’s Emma Strubell, one of the research team and co-author of the paper said, “I’m not against energy use in the name of advancing science, obviously, but I think we could do better in terms of considering the trade off between required energy and resulting model improvement.” “I think large tech companies that use AI throughout their products are likely the largest contributors to this type of energy use,” Strubell said. “I do think that they are increasingly aware of these issues, and there are also financial incentives for them to curb energy use.” In 2016, Google’s ‘DeepMind’ was able to reduce the energy required to cool Google Data Centers by 30%. This full-fledged AI system has features including continuous monitoring and human override. Recently Microsoft doubled its internal carbon fee to $15 per metric ton on all carbon emissions. The funds from this higher fee will maintain Microsoft’s carbon neutrality and help meet their sustainability goals. On the other hand, Microsoft is also two years into a seven-year deal—rumored to be worth over a billion dollars—to help Chevron, one of the world’s largest oil companies, better extract and distribute oil. https://twitter.com/AkwyZ/status/1137020554567987200 Amazon had announced that it would power data centers with 100 percent renewable energy without a dedicated timeline. Since 2018 Amazon has reportedly slowed down its efforts to use renewable energy using only 50 percent. It has also not announced any new deals to supply clean energy to its data centers since 2016, according to a report by Greenpeace, and it quietly abandoned plans for one of its last scheduled wind farms last year. In April, over 4,520 Amazon employees organized against Amazon’s continued profiting from climate devastation. However, Amazon rejected all 11 shareholder proposals including the employee-led climate resolution at Annual shareholder meeting. Both these studies’ researchers illustrate the dire need to change our outlook towards building Artificial Intelligence models and chips that have an impact on the carbon footprint. However, this does not mean halting the research of AI altogether. Instead there should be an awareness of the environmental impact that training AI models might have. Which in turn can inspire researchers to develop more efficient hardware and algorithms for the future. Responsible tech leadership or climate washing? Microsoft hikes its carbon tax and announces new initiatives to tackle climate change. Microsoft researchers introduce a new climate forecasting model and a public dataset to train these models. Now there’s a CycleGAN to visualize the effects of climate change. But is this enough to mobilize action?

0
0
25447

Packt

11 Jul 2017

19 min read

Massive Graphs on Big Data

Packt

11 Jul 2017

19 min read

In this article by Rajat Mehta, author of the book Big Data Analytics with Java, we will learn about graphs. Graphs theory is one of the most important and interesting concepts of computer science. Graphs have been implemented in real life in a lot of use cases. If you use a GPS on your phone or a GPS device and it shows you a driving direction to a place, behind the scene there is an efficient graph that is working for you to give you the best possible direction. In a social network you are connected to your friends and your friends are connected to other friends and so on. This is a massive graph running in production in all the social networks that you use. You can send messages to your friends or follow them or get followed all in this graph. Social networks or a database storing driving directions all involve massive amounts of data and this is not data that can be stored on a single machine, instead this is distributed across a cluster of thousands of nodes or machines. This massive data is nothing but big data and in this article we will learn how data can be represented in the form of a graph so that we make analysis or deductions on top of these massive graphs. In this article, we will cover: A small refresher on the graphs and its basic concepts A small introduction on graph analytics, its advantages and how Apache Spark fits in Introduction on the GraphFrames library that is used on top of Apache Spark Before we dive deeply into each individual section, let's look at the basic graph concepts in brief. (For more resources related to this topic, see here.) Refresher on graphs In this section, we will cover some of the basic concepts of graphs, and this is supposed to be a refresher section on graphs. This is a basic section, hence if some users already know this information they can skip this section. Graphs are used in many important concepts in our day-to-day lives. Before we dive into the ways of representing a graph, let's look at some of the popular use cases of graphs (though this is not a complete list) Graphs are used heavily in social networks In finding driving directions via GPS In many recommendation engines In fraud detection in many financial companies In search engines and in network traffic flows In biological analysis As you must have noted earlier, graphs are used in many applications that we might be using on a daily basis. Graphs is a form of a data structure in computer science that helps in depicting entities and the connection between them. So, if there are two entities such as Airport A and Airport B and they are connected by a flight that takes, for example, say a few hours then Airport A and Airport B are the two entities and the flight connecting between them that takes those specific hours depict the weightage between them or their connection. In formal terms, these entities are called as vertexes and the relationship between them are called as edges. So, in mathematical terms, graph G = {V, E},that is, graph is a function of vertexes and edges. Let's look at the following diagram for the simple example of a graph: As you can see, the preceding graph is a set of six vertexes and eight edges,as shown next: Vertexes = {A, B, C, D, E, F} Edges = { AB, AF, BC, CD, CF, DF, DE, EF} These vertexes can represent any entities, for example, they can be places with the edges being 'distances' between the places or they could be people in a social network with the edges being the type of relationship, for example, friends or followers. Thus, graphs can represent real-world entities like this. The preceding graph is also called a bidirected graph because in this graph the edges go in either direction that is the edge from A to B can be traversed both ways from A to B as well as from B to A. Thus, the edges in the preceding diagrams that is AB can be BA or AF can be FA too. There are other types of graphs called as directed graphs and in these graphs the direction of the edges go in one way only and does not retrace back. A simple example of a directed graph is shown as follows:. As seen in the preceding graph, the edge A to B goes only in one direction as well as the edge B to C. Hence, this is a directed graph. A simple linked list data structure or a tree datastructure are also forms of graph only. In a tree, nodes can have children only and there are no loops, while there is no such rule in a general graph. Representing graphs Visualizing a graph makes it easily comprehensible but depicting it using a program requires two different approaches Adjacency matrix: Representing a graph as a matrix is easy and it has its own advantages and disadvantages. Let's look at the bidirected graph that we showed in the preceding diagram. If you would represent this graph as a matrix, it would like this: The preceding diagram is a simple representation of our graph in matrix form. The concept of matrix representation of graph is simple—if there is an edge to a node we mark the value as 1, else, if the edge is not present, we mark it as 0. As this is a bi-directed graph, it has edges flowing in one direction only. Thus, from the matrix, the rows and columns depict the vertices. There if you look at the vertex A, it has an edge to vertex B and the corresponding matrix value is 1. As you can see, it takes just one step or O[1]to figure out an edge between two nodes. We just need the index (rows and columns) in the matrix and we can extract that value from it. Also, if you would have looked at the matrix closely, you would have seen that most of the entries are zero, hence this is a sparse matrix. Thus, this approach eats a lot of space in computer memory in marking even those elements that do not have an edge to each other, and this is its main disadvantage. Adjacency list: Adjacency list solves the problem of space wastage of adjacency matrix. To solve this problem, it stores the node and its neighbors in a list (linked list) as shown in the following diagram: For maintaining brevity, we have not shown all the vertices but you can make out from the diagram that each vertex is storing its neighbors in a linked list. So when you want to figure out the neighbors of a particular vertex, you can directly iterate over the list. Of course this has the disadvantage of iterating when you have to figure out whether an edge exists between two nodes or not. This approach is also widely used in many algorithms in computer science. We have briefly seen how graphs can be represented, let's now see some important terms that are used heavily on graphs. Common terminology on graphs We will now introduce you to some common terms and concepts in graphs that you can use in your analytics on top of graphs: Vertices:As we mentioned earlier, vertices are the mathematical terms for the nodes in a graph. For analytic purposes, thevertices count shows the number of nodes in the system, for example, the number of people involved in a social graph. Edges: As we mentioned earlier, edges are the connection between vertices and edges can carry weights. The number of edges represent the number of relations in a system of graph. The weight on a graph represents the intensity of the relationship between the nodes involved; for example, in a social network, the relationship of friends is a stronger relationship than followed between nodes. Degrees: Represent the total number of connections flowing into as well as out of a node. For example, in the previous diagram the degree of node F is four. The degree count is useful, for example, in a social network graph it can represent how well a person is connected if his degree count is very high. Indegrees: This represents the number of connections flowing into a node. For example, in the previous diagram, for node F the indegree value is three. In a social network graph, this might represent how many people can send messages to this person or node. Oudegrees: This represents the number of connections flowing out of a node. For example, in the previous diagram, for node F the outdegree value is one. In a social network graph, this might represent how many people can send messages to this person or node. Common algorithms on graphs Let's look at the three common algorithms that are run on graphs frequently and some of their uses: Breadth first search: Breadth first search is an algorithm for graph traversal or searching. As the name suggests, the traversal occurs across the breadth of the graphs that is to say the neighbors of the node form where traversal starts are searched first before exploring further in the same manner. We will refer to the same graph we used earlier: If we start at vertex A, then according to breadth first search next we search or go at the neighbors of A that's B and F. After that, we will go at the neighbors of B and that will be C. Next we will go to the neighbours of F and those will be E and D. We only go through each node once and this mimics real-life travel as well, as to reach from a point to another point we seldom cover the same road or path again. Thus, our breadth first traversal starting from A will be {A , B , F , C , D , E }. Breadth first search is very useful in graph analytics and can tell us things such as the friends that are not your immediate friends but just at the next level after your immediate friends in a social network or in the case of a graph of a flights network it can show flights with just a single stop or two stops to the destination. Depth first search: This is another way of searching where we start from the source vertice and keep on searching until we reach the end node or the leaf node and then we backtrack. This algorithm is not as performant as the bread first search as it requires lots of traversals. So if you want to know if a node A is connected to node B, you might end up searching along a lot of wasteful nodes that do not have anything to do with the original nodes A and B before coming at the appropriate solution. Dijkstra's shortest path: This is a greedy algorithm to find the shortest path in a graph network. So in a weighted graph, if you need to find the shortest path between two nodes, you can start from the starting node and keep on picking the next node in path greedily to be the one with the least weight (in the case of weights being distances between nodes like in city graphs depicting interconnecting cities and roads).So in a road network, you can find the shortest path between two cities using this algorithm. PageRank algorithm: This is a very popular algorithm that came out from Google and it essentially is used to find the importance of a web page by figuring out how connected it is to other important websites. It gives a page rank score to each of the websites based on this approach and finally the search results are built based on this score. The best part about this algorithm is it can be applied to other areas in life too, for example, in figuring out the important airports in a flight graph, or figuring out the most important people in a social network group. So much for the basics and refresher on graphs, in the next section, we will now see how graphs can be used in real world in massive datasets such as social network data or in data used in the field of biology. We will also study how graph analytics can be used on top of these graphs to derive exclusive deductions. Plotting graphs There is a handy open source Java library called GraphStream, which can be used to plot graphs and this is very useful specially if you want to view the structure of your graphs. While viewing, you can also figure out if some of the vertices are very close to each other (clustered) or in general how they are placed. Using the GraphStream library is easy. Just download the jar from http://graphstream-project.org and put it in the classpath of your project. Next, we will show a simple example demonstrating how easy it is to plot a graph using this library. Just create an instance of a graph. For our example, we will create a simple DefaultGraph and name it SimpleGraph. Next, we will add the nodes or vertices of the graph. We will also add the attribute of the label that is displayed on the vertice. Graph graph = newDefaultGraph("SimpleGraph"); graph.addNode("A" ).setAttribute("ui.label", "A"); graph.addNode("B" ).setAttribute("ui.label", "B"); graph.addNode("C" ).setAttribute("ui.label", "C"); After building the nodes, it's now time to connect these nodes using the edges. The API is simple to use and on the graph instance we can define the edges, provided an ID is given to them and the starting and ending nodes are also given. graph.addEdge("AB", "A", "B"); graph.addEdge("BC", "B", "C"); graph.addEdge("CA", "C", "A"); All the information of nodes and edges is present on the graph instance. It's now time to plot this graph on the UI and we can just invoke the display method on the graph instance as shown next and display it on the UI. graph.display(); This would plot the graph on the UI as follows: This library is extensive and it will be good learning experience to explore this library further and we would urge the readers to further explore this library on their own. Massive graphs on big data Big data comprises huge amount of data distributed across a cluster of thousands (if not more) of machines. Building graphs based on this massive data has different challenges shown as follows: Due to the vast amount of data involved, the data for the graph is distributed across a cluster of machines. Hence, in actuality, it's not a single node graph and we have to build a graph that spans across a cluster of machines. A graph that spans across a cluster of machines would have vertices and edges spread across different machines and this data in a graph won't fit into the memory of one single machine. Consider your friend's list on Facebook; some of your friend's data in your Facebook friend list graph might lie on different machines and this data might be just tremendous in size. Look at an example diagram of a graph of 10 Facebook friends and their network shown as follows: As you can see in the preceding diagram, when for just 10 friends the data can be huge, and here since the graph is drawn by hand we have not even shown a lot of connections to make the image comprehensible, but in real life each person can have say more than thousands of connections. So imagine what will happen to a graph with say thousands if not more people on the list. As shown in the reasons we just saw, building massive graphs on big data is a different ball game altogether and there are few main approaches for building this massive graphs. From the perspective of big data building the massive graphs involve running and storing data parallely on many nodes. The two main approaches are bulk synchronous parallely and the pregel approach. Apache Spark follows the pregel approach. Covering these approaches in detail is out of scope of this book and if the users are interested more on these topics they should refer to other books and the Wikipedia for the same. Graph analytics The biggest advantage to using graphs is you can analyze these graphs and use them for analyzing complex datasets. You might ask what is so special about graph analytics that we can't do by relational databases. Let's try to understand this using an example, suppose we want to analyze your friends network on Facebook and pull information about your friends such as their name, their birth date, their recent likes, and so on. If Facebook had a relational database, then this would mean firing a query on some table using the foreign key of the user requesting this info. From the perspective of relational database, this first level query is easy. But what if we now ask you to go to the friends at level four in your network and fetch their data (as shown in the following diagram). The query to get this becomes more and more complicated from a relational database perspective but this is a trivial task on a graph or graphical database (such as Neo4j). Graphs are extremely good on operations where you want to pull information from one end of the node to another, where the other node lies after a lot of joins and hops. As such, graph analytics is good for certain use cases (but not for all use cases, relation database are still good on many other use cases). As you can see, the preceding diagram depicts a huge social network (though the preceding diagram might just be depicting a network of a few friends only). The dots represent actual people in a social network. So if somebody asks to pick one user on the left-most side of the diagram and see and follow host connections to the right-most side and pull the friends at the say 10th level or more, this is something very difficult to do in a normal relational database and doing it and maintaining it could easily go out of hand. There are four particular use cases where graph analytics is extremely useful and used frequently (though there are plenty more use cases too) Path analytics: As the name suggests, this analytics approach is used to figure out the paths as you traverse along the nodes of a graph. There are many fields where this can be used—simplest being road networks and figuring out details such as shortest path between cities, or in flight analytics to figure out the shortest time taking flight or direct flights. Connectivity analytics: As the name suggests, this approach outlines how the nodes within a graph are connected to each other. So using this you can figure out how many edges are flowing into a node and how many are flowing out of the node. This kind of information is very useful in analysis. For example, in a social network if there is a person who receives just one message but gives out say ten messages within his network then this person can be used to market his favorite products as he is very good in responding to messages. Community Analytics: Some graphs on big data are huge. But within these huge graphs there might be nodes that are very close to each other and are almost stacked in a cluster of their own. This is useful information as based on this you can extract out communities from your data. For example, in a social network if there are people who are part of some community, say marathon runners, then they can be clubbed into a single community and further tracked. Centrality Analytics: This kind of analytical approach is useful in finding central nodes in a network or graph. This is useful in figuring out sources that are single handedly connected to many other sources. It is helpful in figuring out influential people in a social network, or a central computer in a computer network. From the perspective of this article, we will be covering some of these use cases in our sample case studies and for this we will be using a library on Apache Spark called GraphFrames. GraphFrames GraphX library is advanced and performs well on massive graphs, but, unfortunately, it's currently only implemented in Scala and does not have any direct Java API. GraphFrames is a relatively new library that is built on top of Apache Spark and provides support for dataframe (now dataset) based graphs.It contains a lot of methods that are direct wrappers over the underlying sparkx methods. As such it provides similar functionality as GraphX except that GraphX acts on the Spark SRDD and GraphFrame works on the dataframe so GraphFrame is more user friendly (as dataframes are simpler to use). All the advantages of firing Spark SQL queries, joining datasets, filtering queries are all supported on this. To understand GraphFrames and representing massive big data graphs, we will take small baby steps first by building some simple programs using GraphFrames before building full-fledged case studies. Summary In this article, we learned about graph analytics. We saw how graphs can be built even on top of massive big datasets. We learned how Apache Spark can be used to build these massive graphs and in the process we learned about the new library GraphFrames that helps us in building these graphs. Resources for Article: Further resources on this subject: Saying Hello to Java EE [article] Object-Oriented JavaScript [article] Introduction to JavaScript [article]

0
0
25378

article-image-key-takeaways-from-sundar-pichais-congress-hearing-over-user-data-political-bias-and-project-dragonfly

Natasha Mathur

14 Dec 2018

12 min read

Key Takeaways from Sundar Pichai’s Congress hearing over user data, political bias, and Project Dragonfly

Natasha Mathur

14 Dec 2018

12 min read

Google CEO, Sundar Pichai testified before the House Judiciary Committee earlier this week. The hearing titled “Transparency & Accountability: Examining Google and its Data Collection, Use, and Filtering Practices” was a three-and-a-half-hour question-answer session that centered mainly around user data collection at Google, allegations of political bias in its search algorithms, and Google’s controversial plans with China. “All of these topics, competition, censorship, bias, and others..point to one fundamental question that demands the nation’s attention. Are America’s technology companies serving as instruments of freedom or instruments of control?,” said Representative Kevin McCarthy of California, the House Republican leader. The committee members could have engaged with Pichai on more important topics had they not been busy focussing on opposing each other’s opinions over whether Google search and its other products are biased against conservatives. Also, most of Pichai’s responses were unsatisfactory as he cleverly dodged questions regarding its Project Dragonfly and user data. Here are the key highlights from the testimony. Allegations of Political Bias One common theme throughout the long hearing session was Republicans asking questions based around alleged bias against conservatives on Google's platforms. Google search Bias Rep. Lamar Smith asked questions regarding the alleged political bias that is “imbibed” in Google’s search algorithms and its culture. Smith presented an example of a study by Robert Epstein, a Harvard trained psychologist. As per the study’s results, Google’s search bias likely swung 2.6 million votes to Hillary Clinton during the 2016 elections. To this Pichai’s reply was that Google has investigated some of the studies including the one by Dr. Epstein, and found that there were issues with the methodology and its sample size. He also mentioned how Google evaluates their search results for accuracy by using a “robust methodology” that it has been using for the past 20 years. Pichai also added that “providing users with high quality, accurate, and trusted information is sacrosanct to us. It’s what our principles are and our business interests and our natural long-term incentives are aligned with that. We need to serve users everywhere and we need to earn their trust in order to do so.” Google employees’ bias, the reason for biased search algorithms, say Republicans Smith also presented examples of pro-Trump content and immigration laws being tagged as hate speech on Google search results posing threat to the democratic form of government. He also alleged that people at Google were biased and intentionally transferred their biases into these search algorithms to get the results they want and management allows it. Pichai clarified that Google doesn't manually intervene on any particular search result. “Google doesn’t choose conservative voices over liberal voices. There’s no political bias and Google operates in a neutral way,” added Pichai. Would Google allow an independent third party to study its search results to determine the degree of political bias? Pichai responded to this question saying that they already have third parties that are completely independent and haven’t been appointed by Google in place for evaluating its search algorithms. “We’re transparent as to how we evaluate our search. We publish our rater guidelines. We publish it externally and raters evaluate it, we’re trying hard to understand what users want and this is what we think is right. It’s not possible for an employee or a group of employees to manipulate our search algorithm”. Political advertising bias The Committee Chairman Bob Goodlatte, a Republican from Virginia also asked Pichai about political advertising bias on Google’s ad platforms that offer different rates for different political candidates to reach prospective voters. This is largely different than how other competitive media platforms like TV and radio operate - offering the lowest rate to all political candidates. He asked if Google should charge the same effective ad rates to political candidates. Pichai explained that their advertising products are built without any bias and the rates are competitive and set by a live auction process. The prices are calculated automatically based on the keywords that you’re bidding for, and on the demand in the auction. There won’t be a difference in rates based on any political reasons unless there are keywords that are of particular interest. He referred the whole situation to a demand-supply equilibrium, where the rates can differ but that will vary from time to time. There could be occasions when there is a substantial difference in rates based on the time of the day, location, how keywords are chosen etc, and it’s a process that Google has been using for over 20 years. Pichai further added that “anything to do with the civic process, we make sure to do it in a non-partisan way and it's really important for us”. User data collection and security Another highlight of the hearing was Google’s practices around user data collection and security. “Google is able to collect an amount of information about its users that would even make the NSA blush. Americans have no idea the sheer volume of information that is collected”, said Goodlatte. Location tracking data related privacy concerns During Mr. Pichai’s testimony, the first question from Rep. Goodlatte was about whether consumers understand the frequency and amount of location data that Google collects from its Android operating system. Goodlatte asked Pichai about the collection of location data and apps running on Android. To this Pichai replied that Google offers users controls for limiting location data collection. “We go to great lengths to protect their privacy, we give them transparency, choice, and control,” says Pichai. Pichai highlighted that Android is a powerful smartphone that offers services to over 2 billion people. User data that is collected via Android depends on the applications that users choose to use. He also pointed out that Google makes it very clear to its users about what information is collected. He pointed out that there are terms of service and also a “privacy checkup”. Going to “my account” settings on Gmail gives you a clear picture of what user data they have. He also says that users can take that data to other platforms if they choose to stop using Google. On Google+ data breach Another Rep. Jerrold Nadler talked about the recent Google plus data breach that affected some 52.5 million users. He asked Pichai about the legal obligations that the company is under to publicly expose the security issues. Pichai responded to this saying that Google “takes privacy seriously,” and that Google needs to alert the users and the necessary authorities of any kind of data breach or bugs within 72 hours. He also mentions "building software inevitably has bugs associated as part of the process”. Google undertakes a lot of efforts to find bugs and the root cause of it, and make sure to take care of it. He also says how they have advanced protection in Gmail to offer a stronger layer of security to its users. Google’s commitment to protecting U.S. elections from foreign interference It was last year when Google discovered that Russian operatives spent tens of thousands of dollars on ads on its YouTube, Gmail and Google Search products in an effort to meddle in the 2016 US presidential election. “Does Google now know the full extent to which its online platforms were exploited by Russian actors in the election 2 years ago?” asked Nadler. Pichai responded that Google conducted a thorough investigation in 2016. It found out that there were two main ads accounts linked to Russia which advertised on google for about 4700 dollars in advertising. “We found a limited activity, improper activity, we learned from that and have increased the protections dramatically we have around our elections offering”, says Pichai. He also added that to protect the US elections, Google will do a significant review of how ads are bought, it will look for the origin of these accounts, share and collaborate with law enforcement, and other tech companies. “Protecting our elections is foundational to our democracy and you have my full commitment that we will do that,” said Pichai. Google’s plans with China Rep. Sheila Jackson Lee was the first person to directly ask Pichai about the company’s Project Dragonfly i.e. its plans of building a censored search engine with China. “We applauded you in 2010 when Google took a very powerful stand principle and democratic values over profits and came out of China,” said Jackson. Other who asked Pichai regarding Google's China plans were Rep. Tom Marino and Rep. David Cicilline. Google left China in 2010 because of concerns regarding hacking, attacks, censorship, and how the Chinese government was gaining access to its data. How is working with the Chinese govt to censor search results a part of Google’s core values? Pichai repeatedly said that Google has no plans currently to launch in China. “We don't have a search product there. Our core mission is to provide users with access to information and getting access to information is an important right (of users) so we try hard to provide that information”, says Pichai. He added that Google always has evidence based on every country that it has operated in. “Us reaching out and giving users more information has a very positive impact and we feel that calling but right now there are no plans to launch in China,” says Pichai. He also mentioned that if Google ever approaches a decision like that he’ll be fully transparent with US policymakers and “engage in consult widely”. He further added that Google only provides Android services in China for which it has partners and manufacturers all around the world. “We don't have any special agreements on user data with the Chinese government”, said Pichai. On being asked by Rep. Marino about a report from The Intercept that said Google created a prototype for a search engine to censor content in China, Pichai replied, “we designed what a search could look like if it were to be launched in a country like China and that’s what we explored”. Rep. Cicilline asked Pichai whether any employees within Google are currently attending product meetings on Dragonfly. Pichai replied evasively saying that Google has “undertaken an internal effort, but right now there are no plans to launch a search service in China necessarily”. Cicilline shot another question at Pichai asking if Google employees are talking to members of the Chinese government, which Pichai dodged by responding with "Currently we are not in discussions around launching a search product in China," instead. Lastly, when Pichai was asked if he would rule out "launching a tool for surveillance and censorship in China”, he replied that Google’s mission is providing users with information, and that “we always think it’s in our duty to explore possibilities to give users access to information. I have a commitment, but as I’ve said earlier we’ll be very thoughtful and we’ll engage widely as we make progress”. On ending forced arbitration for all forms of discrimination Last month 20,000 Google employees along with Temps, Vendors, and Contractors walked out of their respective Google offices to protest discrimination and sexual harassment in the workplace. As part of the walkout, Google employees laid out five demands urging Google to bring about structural changes within the workplace. One of the demands was ending forced arbitration meaning that Google should no longer require people to waive their right to sue. Also, that every co-worker should have the right to bring a representative, or supporter of their choice when meeting with HR for filing a harassment claim. Rep. Pramila Jayapal asked Pichai if he can commit to expanding the policy of ending forced arbitration for any violation of an employee’s (also contractors) right not just sexual harassment. To this Pichai replied that Google is currently definitely looking into this further. “It’s an area where I’ve gotten feedback personally from our employees so we’re currently reviewing what we could do and I’m looking forward to consulting, and I’m happy to think about more changes here. I’m happy to have my office follow up to get your thoughts on it and we are definitely committed to looking into this more and making changes”, said Pichai. Managing misinformation and hate speech During the hearing, Pichai was questioned about how Google is handling misinformation and hate speech. Rep. Jamie Raskin asked why videos promoting conspiracy theory known as “Frazzledrip,” ( Hillary Clinton kills young women and drinks their blood) are still allowed on YouTube. To this Pichai responded with, “We would need to validate whether that specific video violates our policies”. Rep. Jerry Nadler also asked Pichai about Google’s actions to "combat white supremacy and right-wing extremism." Pichai said Google has defined policies against hate speech and that if Google finds violations, it takes down the content. “We feel a tremendous sense of responsibility to moderate hate speech, define hate speech clearly inciting violence or hatred towards a group of people. It's absolutely something we need to take a strict line on. We’ve stated our policies strictly and we’re working hard to make our enforcement better and we’ve gotten a lot better but it's not enough so yeah we’re committed to doing a lot more here”, said Pichai. Our Take Hearings between tech companies and legislators, in the current form, are an utter failure. In addition to making tech reforms, there is an urgent need to also make reforms in how policy hearings are conducted. It is high time we upgraded ourselves to the 21st century. These were the key highlights of the hearing held on 11th December 2018. We recommend you watch the complete hearing for a more comprehensive context. As Pichai defends Google’s “integrity” ahead of today’s Congress hearing, over 60 NGOs ask him to defend human rights by dropping Drag Google bypassed its own security and privacy teams for Project Dragonfly reveals Intercept Google employees join hands with Amnesty International urging Google to drop Project Dragonfly

0
0
25368

article-image-scientific-analysis-of-donald-trumps-tweets-on-covid-19-with-transformers

Expert Network

19 May 2021

7 min read

Scientific Analysis of Donald Trump’s Tweets on COVID-19 with Transformers

Expert Network

19 May 2021

7 min read

It takes time and effort to figure out what is fake news and what isn't. Like children, we have to work our way through something we perceive as fake news. This article is an excerpt from the book Transformers for Natural Language Processing by Denis Rothman – A comprehensive guide for deep learning & NLP practitioners, data analysts and data scientists who want an introduction to AI language understanding to process the increasing amounts of language-driven functions. In this article, we will focus on the logic of fake news. We will run the BERT model on SRL and visualize the results on AllenNLP.org. Now, let's go through some presidential tweets on COVID-19. Our goal is certainly not to judge anybody or anything. Fake news involves both opinion and facts. News often depends on the perception of facts by local culture. We will provide ideas and tools to help others gather more information on a topic and find their way in the jungle of information we receive every day. Semantic Role Labeling (SRL) SRL is an excellent educational tool for all of us. We tend just to read Tweets passively and listen to what others say about them. Breaking messages down with SRL is a good way to develop social media analytical skills to distinguish fake from accurate information. I recommend using SRL transformers for educational purposes in class. A young student can enter a Tweet and analyze each verb and its arguments. It could help younger generations become active readers on social media. We will first analyze a relatively undivided Tweet and then a conflictual Tweet. Analyzing the undivided Tweet Let's analyze the latest Tweet found on July 4 while writing the book, Transformers for Natural Language Processing. I took the name of the person who is referred to as a "Black American" out and paraphrased some of the former President's text: "X is a great American, is hospitalized with coronavirus, and has requested prayer. Would you join me in praying for him today, as well as all those who are suffering from COVID-19?" Let's go to AllenNLP.org, visualize our SRL using https://demo.allennlp.org/semantic-role-labeling, run the sentence, and look at the result. The verb "hospitalized" shows the member is staying close to the facts: Figure: SRL arguments of the verb "hospitalized" The message is simple: "X" + "hospitalized" + "coronavirus." The verb "requested" shows that the message is becoming political: Figure: SRL arguments of the verb "requested" We don't know if the person requested the former President to pray or he decided he would be the center of the request. A good exercise would be to display an HTML page and ask the users what they think. For example, the users could be asked to look at the results of the SRL task and answer the two following questions: "Was former President Trump asked to pray, or did he deviate a request made to others for political reasons?" "Is the fact that former President Trump states that he was indirectly asked to pray for X fake news or not?" You can think about it and decide for yourself! Analyzing the Banned Tweet Let's have a look at one that was banned from Twitter. I took the names out and paraphrased it and toned it down. Still, when we run it on AllenNLP.org and visualize the results, we get some surprising SRL outputs. Here is the toned-down and paraphrased Tweet: These thugs are dishonoring the memory of X. When the looting starts, actions must be taken. Although I suppressed the main part of the original Tweet, we can see that the SRL task shows the bad associations made in the Tweet: Figure: SRL arguments of the verb "dishonoring" An educational approach to this would be to explain that we should not associate the arguments "thugs" and "memory" and "looting." They do not fit together at all. An important exercise would be to ask a user why the SRL arguments do not fit together. I recommend many such exercises so that the transformer model users develop SRL skills to have a critical view of any topic presented to them. Critical thinking is the best way to stop the propagation of the fake news pandemic! We have gone through rational approaches to fake news with transformers, heuristics, and instructive websites. However, in the end, a lot of the heat in fake news debates boils down to emotional and irrational reactions. In a world of opinion, you will never find an entirely objective transformer model that detects fake news since opposing sides never agree on what the truth is in the first place! One side will agree with the transformer model's output. Another will say that the model is biased and built by enemies of their opinion! The best approach is to listen to others and try to keep the heat down! Looking for the silver bullet Looking for a silver bullet transformer model can be time-consuming or rewarding, depending on how much time and money you want to spend on continually changing models. For example, a new approach to transformers can be found through disentanglement. Disentanglement in AI allows you to separate the features of a representation to make the training process more flexible. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen designed DeBERTa, a disentangled version of a transformer, and described the model in an interesting article: DeBERTa: Decoding-enhanced BERT with Disentangled Attention, https://arxiv.org/ abs/2006.03654 The two main ideas implemented in DeBERTa are: Disentangle the content and position in the transformer model to train the two vectors separately. Use an absolute position in thedecoderto predict masked tokens in the pretraining process. The authors provide the code on GitHub: https://github.com/microsoft/DeBERTa DeBERTa exceeds the human baseline on the SuperGLUE leaderboard in December 2020 using 1.5B parameters. Should you stop everything you are doing on transformers and rush to this model, integrate your data, train the model, test it, and implement it? It is very probable that by the end of 2021, another model will beat this one and so on. Should you change models all of the time in production? That will be your decision. You can also choose to design better training methods. Looking for reliable training methods Looking for reliable training methods with smaller models such as the PET designed by Timo Schick can also be a solution. Why? Being in a good position on the SuperGLUE leaderboard does not mean that the model will provide a high quality of decision-making for medical, legal, and other critical areas for sequence predications. Looking for customized training solutions for a specific topic could be more productive than trying all the best transformers on the SuperGLUE leaderboard. Take your time to think about implementing transformers to find the best approach for your project. We will now conclude the article. Summary Fake news begins deep inside our emotional history as humans. When an event occurs, emotions take over to help us react quickly to a situation. We are hardwired to react strongly when we are threatened. We went through raging conflicts over COVID-19, former President Trump, and climate change. In each case, we saw that emotional reactions are the fastest ones to build up into conflicts. We then designed a roadmap to take the emotional perception of fake news to a rational level. We showed that it is possible to find key information in Tweets, Facebook messages, and other media. The news used in this article is perceived by some as real news and others as fake news to create a rationale for teachers, parents, friends, co-workers, or just people talking. About the Author Denis Rothman graduated from Sorbonne University and Paris-Diderot University, patenting one of the very first word2matrix embedding solutions. Denis Rothman is the author of three cutting-edge AI solutions: one of the first AI cognitive chatbots more than 30 years ago; a profit-orientated AI resource optimizing system; and an AI APS (Advanced Planning and Scheduling) solution based on cognitive patterns used worldwide in aerospace, rail, energy, apparel, and many other fields. Designed initially as a cognitive AI bot for IBM, it then went on to become a robust APS solution used to this day.

0
0
25196

article-image-uber-ai-labs-senior-research-scientist-ankit-jain-tensorflow-updates-learning-machine-learning

Sugandha Lahoti

19 Dec 2019

10 min read

Uber AI Labs senior research scientist, Ankit Jain on TensorFlow updates and learning machine learning by doing [Interview]

Sugandha Lahoti

19 Dec 2019

10 min read

No doubt, TensorFlow is one of the most popular machine learning libraries right now. However, newbie developers who want to experiment with TensorFlow often face difficulties in learning TensorFlow, relying just on tutorials. Recently, we sat down with Ankit Jain, senior research scientist at Uber AI Labs and one of the authors of the book, TensorFlow Machine Learning Projects. Ankit talked about how real-world implementations can be a good way to learn for those developing TF models, specifically the ‘learn by doing’ approach. Talking about TensorFlow 2.0, he considers ‘eager execution by default’ a major paradigm shift and is all game for interoperability between TF 2.0 and other machine learning frameworks. He also gave us an insight into the limitations of AI algorithms (generalization, AI ethics, labeled data to name a few). Continue reading the full interview for a detailed perspective. On why TensorFlow 2 upgrade is paradigm-shifting in more ways than one TensorFlow 2 was released last month. What are some of your top features in TensorFlow 2.0? How do you think it has upgraded the machine learning ecosystem? TF 2.0 is a major upgrade from its predecessor in many ways. It addressed many of the shortcomings of TF 1.x and with this release, the difference between Pytorch and TF has narrowed. One of the biggest paradigm shifts in TF 2.0 is eager execution by default. This means you don’t have to pre-define a static computation graph, create sessions, deal with the unintuitive interface or have painful experience in debugging your deep learning model code. However, you lose on some performance in run time when you switch to complete eager mode. For that purpose, they have introduced tf.function decorator which can help you translate your Python functions to Tensorflow graphs. This way you can retain both code readability and ease of debugging while getting the performance of TensorFlow graphs. Another major update is that many confusing redundancies have been consolidated and many functions are now integrated with Keras API. This will help to standardize the communication of data/models among various components of TensorFlow ecosystem. TF 2.0 also comes with backward compatibility to TF 1.X with an easy optional way to convert your TF 1.X code into TF 2.0. TF 1.X suffered from a lack of standardization in how we load/save trained machine learning models. TF 2.0 fixed this by defining a single API SavedModels. As SavedModels is integrated with the Tensorflow ecosystem, it becomes much easier to deploy models using Tensorflow Lite, Tensorflow.js to other devices/applications. With the onset of TensorFlow 2, Tensorflow and Keras are integrated into one module (tf.keras). TF 2.0 now delivers Keras as the central high-level API used to build and train models. What is the future/benefits of TensorFlow + Keras? Keras has been a very popular high-level API for faster prototyping and production and even for research. As the field of AI/ML is in nascent stages, ease of development can have a huge impact for people getting started in machine learning. Previously, a developer new to machine learning started from Keras while an experienced researcher used only Tensorflow 1.x due to its flexibility to build custom models. With Keras integrated as a high level API for TF 2.0, we can expect both beginners and experts working on the same framework which can lead to better collaboration and better exchange of ideas in the community. Additionally, a single high level easy to use API reduces confusion and streamlines consistency across use cases of production and research. Overall, I think it’s a great step in the right direction by Google which will enable more developers to hop on the Tensorflow ecosystem. On TensorFlow, NLP and structured learning Recently, Transformers 2.0, a popular OS NLP library, was released that provides TF 2.0 and PyTorch deep interoperability. What are your views on this development? One of the areas where deep learning has made an immense impact is Natural Language Processing (NLP). Research in NLP is moving very fast and it is hard to keep up with all the papers and code releases by various research groups around the world. Hugging Face, the company behind the library “Transformers” has really eased the usage of state of the art (SOTA) models and process of building new models by simplifying the preprocessing and model building pipeline through an easy to use Keras like interface. “Transformers 2.0” is the recent release from the company and the most important feature is the interoperability between Pytorch and TF 2.0. TF 2.0 is more production-ready while Pytorch is more oriented towards research. With this upgrade, you can pretty much move from one framework to another for training, validation, and deployment of the model. Interoperability between frameworks is very important for the AI community as it enables development velocity. Moreover, as none of the frameworks can be perfect at everything, it makes the framework developers focus more on their strengths and make those features seamless. This will create greater efficiency going forward. Overall, I think this is a great development and I expect other libraries in domains like Computer Vision, Graph Learning etc. to follow suit. This will enable a lot more application of state of the art models to production. Google recently launched Neural Structured Learning (NSL), an open-source Tensorflow based framework for training neural networks with graphs and structured data. What are some of the potential applications of NSL? What do you think can be some Machine Learning Projects based around NSL? Neural structured learning is a concept of learning neural network parameters with structured signals other than features. Many real-world datasets contain some structured information like Knowledge graphs or molecular graphs in biology. Incorporating these signals can lead to a more accurate and robust model. From an implementation perspective, it boils down to adding a regularizer to the loss function such that the representation of neighboring nodes in the graph is similar. Any application where the amount of labeled data is limited but has structural information like Knowledge Graph that can be exploited is a good candidate for these types of models. A possible example could be fraud detection in online systems. Fraud data generally has sparse labels and fraudsters create multiple accounts that are connected to each other through some information like devices etc. This structured information can be utilized to learn a better representation of fraud accounts. There can be other applications is molecular data and other problems involving the knowledge graph. On Ankit’s experience working on his book, TensorFlow Machine Learning Project Tell us the motivation behind writing your book TensorFlow Machine Learning Projects. Why is TensorFlow ideal for building ML projects? What are some of your favorite machine learning projects from this book? When I started learning Tensorflow, I stumbled upon many tutorials (including the official ones) which explained various concepts on how Tensorflow works. While that was helpful in understanding the basics, most of my learning came from building projects with Tensorflow. That is when I realized the need for a resource that teaches using a ‘learn by doing’ approach. This book is unique in the way that it teaches machine learning theory, Tensorflow utilities and programming concepts all while developing a project in which you can have fun building and is also of practical use. My favorite chapter from the book is “Generating Uncertainty in Traffic Signs Classifier using Bayesian Neural Networks”. With the development of self-driving cars, traffic signs detection is a major problem that needs to be solved. This chapter explains an advanced AI concept of Bayesian Neural Networks and shows step by step how to use those to detect traffic signs using Tensorflow. Some of the readers of the book have started to use this concept in their practical applications already. Machine Learning challenges and advice to those developing TensorFlow models What are the biggest challenges today in the field of Machine Learning and AI? What do you see as the greatest technology disruptors in the next 5 years? While AI and machine learning has seen huge success in recent years, there are few limitations of AI algorithms as we see today. Some of the major ones are: Labeled Data: Most of the success of AI has come from supervised learning. Many of the recent supervised deep learning algorithms require huge quantities of labeled data which is expensive to obtain. For example, obtaining huge amounts of clinical trial data for healthcare prediction is very challenging. The good news is that there is some research around building good ML models using sparse data labels. Explainability: Deep learning models are essentially a “black box” where you don’t know what factor(s) led to the prediction. For some applications like money lending, disease diagnosis, fraud detection etc. the explanations of predictions become very important. Currently, we see some nascent work in this direction with LIME and SHAP libraries. Generalization: In the current state of AI, we build one model for each application. We still don’t have good generality of models from one task to another. Generalization, if solved, can lead us to truly Artificial General Intelligence (AGI). Thankfully approaches like transfer learning and meta-learning are trying to solve this challenge. Bias, Fairness, and Ethics: An output of the machine learning model is heavily based on the input training data. Many a time, training data can have biases towards particular ethnicities, classes, religions, etc. We need more solutions in this direction to build trust in AI algorithms. Overall, I feel, AI is becoming mainstream and in the next 5 years we will see many traditional industries adopt AI to solve critical business problems and achieve more automation. At the same time, tooling for AI will keep on improving which will also help in its adoption. What is your advice for those developing machine learning projects on TensorFlow? Building projects with new techniques and technologies is a hard process. It requires patience, dealing with failures and hard work. For that reason, it is very important to pick up a project that you are passionate about. This way, you will continue building even if you are stuck somewhere. The selection of the right projects is by far the most important criterion in the project-based learning method. About the Author Ankit currently works as a Senior Research Scientist at Uber AI Labs, the machine learning research arm of Uber. His work primarily involves the application of Deep Learning methods to a variety of Uber’s problems ranging from food recommendation system, forecasting to self-driving cars. Previously, he has worked in a variety of data science roles at Bank of America, Facebook and other startups. Additionally, he has been a featured speaker in many of the top AI conferences and universities across the US, including UC Berkeley, OReilly AI conference etc. He completed his MS from UC Berkeley and a BS from IIT Bombay (India). You can find him on Linkedin, Twitter, and GitHub. About the Book With the help of this book, TensorFlow Machine Learning Projects you’ll not only learn how to build advanced projects using different datasets but also be able to tackle common challenges using a range of libraries from the TensorFlow ecosystem. To start with, you’ll get to grips with using TensorFlow for machine learning projects; you’ll explore a wide range of projects using TensorForest and TensorBoard for detecting exoplanets, TensorFlow.js for sentiment analysis, and TensorFlow Lite for digit classification. As you make your way through the book, you’ll build projects in various real-world domains. By the end of this book, you’ll have gained the required expertise to build full-fledged machine learning projects at work.

0
0
25113

How-To Tutorials - Data

Three.js - Materials and Texture

Implement Reinforcement learning using Markov Decision Process [Tutorial]

DataCamp reckons with its #MeToo movement; CEO steps down from his role indefinitely

MySQL Data Transfer using Sql Server Integration Services (SSIS)

How to write effective Stored Procedures in PostgreSQL

Troubleshooting in SQL Server

Can a modified MIT ‘Hippocratic License’ to restrict misuse of open source software prompt a wave of ethical innovation in tech?

6 index types in PostgreSQL 10 you should know

20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017

Alarming ways governments are using surveillance tech to watch you

Trending Topics

Deep learning models have massive carbon footprints, can photonic chips help reduce power consumption?

Massive Graphs on Big Data

Key Takeaways from Sundar Pichai’s Congress hearing over user data, political bias, and Project Dragonfly

Scientific Analysis of Donald Trump’s Tweets on COVID-19 with Transformers

Uber AI Labs senior research scientist, Ankit Jain on TensorFlow updates and learning machine learning by doing [Interview]

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading