Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-building-recommendation-system-azure
Packt
19 Feb 2016
7 min read
Save for later

Building A Recommendation System with Azure

Packt
19 Feb 2016
7 min read
Recommender systems are common these days. You may not have noticed, but you might already be a user or receiver of such a system somewhere. Most of the well-performing e-commerce platforms use recommendation systems to recommend items to their users. When you see on the Amazon website that a book is recommended to you based on your earlier preferences, purchases, and browse history, Amazon is actually using such a recommendation system. Similarly, Netflix uses its recommendation system to suggest movies for you. (For more resources related to this topic, see here.) A recommender or recommendation system is used to recommend a product or information often based on user characteristics, preferences, history, and so on. So, a recommendation is always personalized. Until recently, it was not so easy or straightforward to build a recommender, but Azure ML makes it really easy to build one as long as you have your data ready. This article introduces you to the concept of recommendation systems and also the model available in ML Studio for you to build your own recommender system. It then walks you through the process of building a recommendation system with a simple example. The Matchbox recommender Microsoft has developed a large-scale recommender system based on a probabilistic model (Bayesian) called Matchbox. This model can learn about a user's preferences through observations made on how they rate items, such as movies, content, or other products. Based on those observations, it recommends new items to the users when requested. Matchbox uses the available data for each user in the most efficient way possible. The learning algorithm it uses is designed specifically for big data. However, its main feature is that Matchbox takes advantage of metadata available for both users and items. This means that the things it learns about one user or item can be transferred across to other users or items. You can find more information about the Matchbox model at the Microsoft Research project link. Kinds of recommendations The Matchbox recommender supports the building of four kinds of recommenders, which will include most of the scenarios. Let's take a look at the following list: Rating Prediction: This predicts ratings for a given user and item, for example, if a new movie is released, the system will predict what will be your rating for that movie out of 1-5. Item Recommendation: This recommends items to a given user, for example, Amazon suggests you books or YouTube suggests you videos to watch on its home page (especially when you are logged in). Related Users: This finds users that are related to a given user, for example, LinkedIn suggests people that you can get connected to or Facebook suggests friends to you. Related Items: This finds the items related to a given item, for example, a blog site suggests you related posts when you are reading a blog post. Understanding the recommender modules The Matchbox recommender comes with three components; as you might have guessed, a module each to train, score, and evaluate the data. The modules are described as follows. The train Matchbox recommender This module contains the algorithm and generates the trained algorithm, as shown in the following screenshot: This module takes the values for the following two parameters. The number of traits This value decides how many implicit features (traits) the algorithm will learn about that are related to every user and item. The higher this value, the precise it would be as it would lead to better prediction. Typically, it takes a value in the range of 2 to 20. The number of recommendation algorithm iterations It is the number of times the algorithm iterates over the data. The higher this value, the better would the predictions be. Typically, it takes a value in the range of 1 to 10. The score matchbox recommender This module lets you specify the kind of recommendation and corresponding parameters you want: Rating Prediction Item Prediction Related Users Related Items Let's take a look at the following screenshot: The ML Studio help page for the module provides details of all the corresponding parameters. The evaluate recommender This module takes a test and a scored dataset and generates evaluation metrics, as shown in the following screenshot: It also lets you specify the kind of recommendation, such as the score module and corresponding parameters. Building a recommendation system Now, it would be worthwhile that you learn to build one by yourself. We will build a simple recommender system to recommend restaurants to a given user. ML Studio includes three sample datasets, described as follows: Restaurant customer data: This is a set of metadata about customers, including demographics and preferences, for example, latitude, longitude, interest, and personality. Restaurant feature data: This is a set of metadata about restaurants and their features, such as food type, dining style, and location, for example, placeID, latitude, longitude, price. Restaurant ratings: This contains the ratings given by users to restaurants on a scale of 0 to 2. It contains the columns: userID, placeID, and rating. Now, we will build a recommender that will recommend a given number of restaurants to a user (userID). To build a recommender perform the following steps: Create a new experiment. In the Search box in the modules palette, type Restaurant. The preceding three datasets get listed. Drag them all to the canvas one after another. Drag a Split module and connect it to the output port of the Restaurant ratings module. On the properties section to the right, choose Splitting mode as Recommender Split. Leave the other parameters at their default values. Drag a Project Columns module to the canvas and select the columns: userID, latitude, longitude, interest, and personality. Similarly, drag another Project Columns module and connect it to the Restaurant feature data module and select the columns: placeID, latitude, longitude, price, the_geom_meter, and address, zip. Drag a Train Matchbox Recommender module to the canvas and make connections to the three input ports, as shown in the following screenshot: Drag a Score Matchbox Recommender module to the canvas and make connections to the three input ports and set the property's values, as shown in the following screenshot: Run the experiment and when it gets completed, right-click on the output of the Score Matchbox Recommender module and click on Visualize to explore the scored data. You can note the different restaurants (IDs) recommended as items for a user from the test dataset. The next step is to evaluate the scored prediction. Drag the Evaluate Recommender module to the canvas and connect the second output of the Split module to its first input port and connect the output of the Score Matchbox Recommender module to its second input. Leave the module at its default properties. Run the experiment again and when finished, right-click on the output port of the Evaluate Recommender module and click on Visualize to find the evaluation metric. The evaluation metric Normalized Discounted Cumulative Gain (NDCG) is estimated from the ground truth ratings given in the test set. Its value ranges from 0.0 to 1.0, where 1.0 represents the most ideal ranking of the entities. Summary You started with gaining the basic knowledge about a recommender system. You then understood the Matchbox recommender that comes with ML Studio along with its components. You also explored different kinds of recommendations that you can make with it. Finally, you ended up building a simple recommendation system to recommend restaurants to a given user. For more information on Azure, take a look at the following books also by Packt Publishing: Learning Microsoft Azure (https://www.packtpub.com/networking-and-servers/learning-microsoft-azure) Microsoft Windows Azure Development Cookbook (https://www.packtpub.com/application-development/microsoft-windows-azure-development-cookbook) Resources for Article: Further resources on this subject: Introduction to Microsoft Azure Cloud Services[article] Microsoft Azure – Developing Web API for Mobile Apps[article] Security in Microsoft Azure[article]
Read more
  • 0
  • 0
  • 19211

article-image-ai-now-institute-publishes-a-report-on-the-diversity-crisis-in-ai-and-offers-12-solutions-to-fix-it
Bhagyashree R
22 Apr 2019
7 min read
Save for later

AI Now Institute publishes a report on the diversity crisis in AI and offers 12 solutions to fix it

Bhagyashree R
22 Apr 2019
7 min read
Earlier this month, the AI Now Institute published a report, authored by Sarah Myers West, Meredith Whittaker, and Kate Crawford, highlighting the link between the diversity issue in the current AI industry and the discriminating behavior of AI systems. The report further recommends some solutions to these problems that companies and the researchers behind these systems need to adopt to address these issues. Sarah Myers West is a postdoc researcher at the AI Now Institute and an affiliate researcher at the Berkman-Klein Center for Internet and Society. Meredith Whittaker is the co-founder of the AI Now Institute and leads Google's Open Research Group and the Google Measurement Lab.  Kate Crawford is a Principal Researcher at Microsoft Research and the co-founder and Director of Research at the AI Now Institute. Kate Crawford tweeted about this study. https://twitter.com/katecrawford/status/1118509988392112128 The AI industry lacks diversity, gender neutrality, and bias-free systems In recent years, we have come across several cases of “discriminating systems”. Facial recognition systems miscategorize black people and sometimes fails to work for trans drivers. When trained in online discourse, chatbots easily learn racist and misogynistic language. This type of behavior by machines is actually a reflection of society. “In most cases, such bias mirrors and replicates existing structures of inequality in the society,” says the report. The study also sheds light on gender bias in the current workforce. According to the report, only 18% of authors at some of the biggest AI conferences are women. On the other side of the spectrum are men who cover 80%. The tech giants, Facebook and Google, have a meager 15% and 10% women as their AI research staff. The situation for black workers in the AI industry looks even worse. While Facebook and Microsoft have 4% of their current workforce as black workers, Google stands at just 2.5%. Also, vast majority of AI studies assume gender is binary, and commonly assigns people as ‘male’ or ‘female’ based on physical appearance and stereotypical assumptions, erasing all other forms of gender identity. The report further reveals that, though there have been various “pipeline studies” to check the flow of diverse job candidates, they have failed to show substantial progress in bringing diversity in the AI industry. “The focus on the pipeline has not addressed deeper issues with workplace cultures, power asymmetries, harassment, exclusionary hiring practices, unfair compensation, and tokenization that are causing people to leave or avoid working in the AI sector altogether,” the report reads. What steps can industries take to address bias and discrimination in AI Systems The report lists 12 recommendations that AI researchers and companies should employ to improve workplace diversity and address bias and discrimination in AI systems. Publish compensation levels, including bonuses and equity, across all roles and job categories, broken down by race and gender. End pay and opportunity inequality, and set pay and benefit equity goals that include contract workers, temps, and vendors. Publish harassment and discrimination transparency reports, including the number of claims over time, the types of claims submitted, and actions taken. Change hiring practices to maximize diversity: include targeted recruitment beyond elite universities, ensure more equitable focus on under-represented groups, and create more pathways for contractors, temps, and vendors to become full-time employees. Commit to transparency around hiring practices, especially regarding how candidates are leveled, compensated, and promoted. Increase the number of people of color, women and other under-represented groups at senior leadership levels of AI companies across all departments. Ensure executive incentive structures are tied to increases in hiring and retention of underrepresented groups. For academic workplaces, ensure greater diversity in all spaces where AI research is conducted, including AI-related departments and conference committees. Remedying bias in AI systems is almost impossible when these systems are opaque. Transparency is essential, and begins with tracking and publicizing where AI systems are used, and for what purpose. Rigorous testing should be required across the lifecycle of AI systems in sensitive domains. Pre-release trials, independent auditing, and ongoing monitoring are necessary to test for bias, discrimination, and other harms. The field of research on bias and fairness needs to go beyond technical debiasing to include a wider social analysis of how AI is used in context. This necessitates including a wider range of disciplinary expertise. The methods for addressing bias and discrimination in AI need to expand to include assessments of whether certain systems should be designed at all, based on a thorough risk assessment. AI-related departments and conference committees. Credits: AI Now Institute Bringing diversity in the AI workforce In order to address the diversity issue in the AI industry, companies need to make changes in the current hiring practices. They should have a more equitable focus on under-represented groups. People of color, women, and other under-represented groups should get fair chance to get into senior leadership level of AI companies across all departments. Further opportunities should be created for contractors, temps, and vendors to become full-time employees. To bridge the gender pay gap in the AI industry, it is important that companies maintain transparency regarding the compensation levels, including bonuses and equity, regardless of gender or race. In the past few years, several cases of sexual misconducts involving some of the biggest companies like Google, Microsoft, have come into light because of movements like #MeToo, Google Walkout, and more. These movements gave the victims and other supporting employees  the courage to speak against employees at higher positions who were taking undue advantage of their power. There are cases were the sexual harassment complaints were not taken seriously by the HRs and victims were told to just “get over it”. This is why, companies should  publish harassment and discrimination transparency reports that include information like the number and types of claims made and the actions taken by the company. Academic workplaces should ensure diversity in all AI-related departments and conference committees. In the past, some of the biggest AI conferences like Neural Information Processing Systems conference has failed to provide a welcoming and safer environment for women. In a survey conducted last year, many respondents shared that they have experienced sexual harassment. Women reported persistent advances from men at the conference. The organizers of such conferences should ensure an inclusive and welcoming environment for everyone. Addressing bias and discrimination in AI systems To address bias and discrimination in AI systems, the report recommends to do rigorous testing across the lifecycle of these systems. These systems should have pre-release trials, independent auditing, and monitoring to check bias, discrimination, and other harms. Looking at the social implications of AI systems, just addressing the algorithmic bias is not enough. “The field of research on bias and fairness needs to go beyond technical debiasing to include a wider social analysis of how AI is used in context. This necessitates including a wider range of disciplinary expertise,” says the report. While assessing a AI system, researchers and developers should also check whether designing a certain system is required at all, considering the risks it poses. The study calls for re-evaluating the current AI systems used for classifying, detecting, and predicting the race and gender. The idea of identifying a race or gender just by appearance is flawed and can be easily abused. Especially, systems that use physical appearance to find interior states, for instance, those that claim to detect sexuality from headshots. These systems are urgently in need to be checked. To know more in detail, read the full report: Discriminating Systems. Microsoft’s #MeToo reckoning: female employees speak out against workplace harassment and discrimination Desmond U. Patton, Director of SAFElab shares why AI systems should be a product of interdisciplinary research and diverse teams Google’s Chief Diversity Officer, Danielle Brown resigns to join HR tech firm Gusto
Read more
  • 0
  • 0
  • 19011

article-image-connecting-data-mongodb-using-pymongo-php
Amey Varangaonkar
16 Mar 2018
6 min read
Save for later

Connecting your data to MongoDB using PyMongo and PHP

Amey Varangaonkar
16 Mar 2018
6 min read
[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Mastering MongoDB 3.x written by Alex Giamas. This book covers the fundamental as well as advanced tasks related to working with MongoDB.[/box] There are two ways to connect your data to MongoDB. The first is using the driver for your programming language. The second is by using an ODM (Object Document Mapper) layer to map your model objects to MongoDB in a transparent way. In this post, we will cover both methods using two of the most popular languages for web application development: Python, using the official MongoDB low level driver, PyMongo, and PHP, using the official PHP driver for MongoDB Connect using Python Installing PyMongo can be done using pip or easy_install: python -m pip install pymongo python -m easy_install pymongo Then in our class we can connect to a database: >>> from pymongo import MongoClient >>> client = MongoClient() Connecting to a replica set needs a set of seed servers for the client to find out what the primary, secondary, or arbiter nodes in the set are: client = pymongo.MongoClient('mongodb://user:passwd@node1:p1,node2:p2/?replicaSet=rs name') Using the connection string URL we can pass a username/password and replicaSet name all in a single string. Some of the most interesting options for the connection string URL are presented later. Connecting to a shard requires the server host and IP for the mongo router, which is the mongos process. PyMODM ODM Similar to Ruby's Mongoid, PyMODM is an ODM for Python that follows closely on Django's built-in ORM. Installing it can be done via pip: pip install pymodm Then we need to edit settings.py and replace the database engine with a dummy database: DATABASES = { 'default': { 'ENGINE': 'django.db.backends.dummy' } } And add our connection string anywhere in settings.py: from pymodm import connect connect("mongodb://localhost:27017/myDatabase", alias="MyApplication") Here we have to use a connection string that has the following structure: mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:port N]]][/[database][?options]] Options have to be pairs of name=value with an & between each pair. Some interesting pairs are: Model classes need to inherit from MongoModel. A sample class will look like this: from pymodm import MongoModel, fields class User(MongoModel): email = fields.EmailField(primary_key=True) first_name = fields.CharField() last_name = fields.CharField() This has a User class with first_name, last_name, and email fields where email is the primary field. Inheritance with PyMODM models Handling one-one and one-many relationships in MongoDB can be done using references or embedding. This example shows both ways: references for the model user and embedding for the comment model: from pymodm import EmbeddedMongoModel, MongoModel, fields class Comment(EmbeddedMongoModel): author = fields.ReferenceField(User) content = fields.CharField() class Post(MongoModel): title = fields.CharField() author = fields.ReferenceField(User) revised_on = fields.DateTimeField() content = fields.CharField() comments = fields.EmbeddedDocumentListField(Comment) Connecting with PHP The MongoDB PHP driver was rewritten from scratch two years ago to support the PHP 5, PHP 7, and HHVM architectures. The current architecture is shown in the following diagram: Currently we have official drivers for all three architectures with full support for the underlying functionality. Installation is a two-step process. First we need to install the MongoDB extension. This extension is dependent on the version of PHP (or HHVM) that we have installed and can be done using brew in Mac. For example with PHP 7.0: brew install php70-mongodb Then, using composer (a widely used dependency manager for PHP): composer require mongodb/mongodb Connecting to the database can then be done by using the connection string URI or by passing an array of options. Using the connection string URI we have: $client = new MongoDBClient($uri = 'mongodb://127.0.0.1/', array $uriOptions = [], array $driverOptions = []) For example, to connect to a replica set using SSL authentication: $client = new MongoDBClient('mongodb://myUsername:myPassword@rs1.example.com,rs2.example.com/?ssl=true&replicaSet=myReplicaSet&authSource=admin'); Or we can use the $uriOptions parameter to pass in parameters without using the connection string URL, like this: $client = new MongoDBClient( 'mongodb://rs1.example.com,rs2.example.com/' [ 'username' => 'myUsername', 'password' => 'myPassword', 'ssl' => true, 'replicaSet' => 'myReplicaSet', 'authSource' => 'admin', ], ); The set of $uriOptions and the connection string URL options available are analogous to the ones used for Ruby and Python. Doctrine ODM Laravel is one of the most widely used MVC frameworks for PHP, similar in architecture to Django and Rails from the Python and Ruby worlds respectively. We will follow through configuring our models using a stack of Laravel, Doctrine, and MongoDB. This section assumes that Doctrine is installed and working with Laravel 5.x. Doctrine entities are POPO (Plain Old PHP Objects) that, unlike Eloquent, Laravel's default ORM doesn't need to inherit from the Model class. Doctrine uses the Data Mapper pattern, whereas Eloquent uses Active Record. Skipping the get() set() methods, a simple class would look like: use DoctrineORMMapping AS ORM; use DoctrineCommonCollectionsArrayCollection; /** * @ORMEntity * @ORMTable(name="scientist") */ class Scientist { /** * @ORMId * @ORMGeneratedValue * @ORMColumn(type="integer") */ protected $id; /** * @ORMColumn(type="string") */ protected $firstname; /** * @ORMColumn(type="string") */ protected $lastname; /** * @ORMOneToMany(targetEntity="Theory", mappedBy="scientist", cascade={"persist"}) * @var ArrayCollection|Theory[] */ protected $theories; /** * @param $firstname * @param $lastname */ public function __construct($firstname, $lastname) { $this->firstname = $firstname; $this->lastname = $lastname; $this->theories = new ArrayCollection; } … public function addTheory(Theory $theory) { if(!$this->theories->contains($theory)) { $theory->setScientist($this); $this->theories->add($theory); } } This POPO-based model used annotations to define field types that need to be persisted in MongoDB. For example, @ORMColumn(type="string") defines a field in MongoDB with the string type firstname and lastname as the attribute names, in the respective lines. There is a whole set of annotations available here http://doctrine-orm.readthedocs.io/en/latest/reference/annotations- reference.html . If we want to separate the POPO structure from annotations, we can also define them using YAML or XML instead of inlining them with annotations in our POPO model classes. Inheritance with Doctrine Modeling one-one and one-many relationships can be done via annotations, YAML, or XML. Using annotations, we can define multiple embedded subdocuments within our document: /** @Document */ class User { // … /** @EmbedMany(targetDocument="Phonenumber") */ private $phonenumbers = array(); // … } /** @EmbeddedDocument */ class Phonenumber { // … } Here a User document embeds many PhoneNumbers. @EmbedOne() will embed one subdocument to be used for modeling one-one relationships. Referencing is similar to embedding: /** @Document */ class User { // … /** * @ReferenceMany(targetDocument="Account") */ private $accounts = array(); // … } /** @Document */ class Account { // … } @ReferenceMany() and @ReferenceOne() are used to model one-many and one-one relationships via referencing into a separate collection. We saw that the process of connecting data to MongoDB using Python and PHP is quite similar. We can accordingly define relationships as being embedded or referenced, depending on our design decision. If you found this post useful, check out our book Mastering MongoDB 3.x for more tips and techniques on working with MongoDB efficiently.  
Read more
  • 0
  • 0
  • 18911

article-image-oracle-12c-sql-plsql-new-features
Oli Huggins
16 Aug 2015
30 min read
Save for later

Oracle 12c SQL and PL/SQL New Features

Oli Huggins
16 Aug 2015
30 min read
In this article by Saurabh K. Gupta, author of the book Oracle Advanced PL/SQL Developer Professional Guide, Second Edition you will learn new features in Oracle 12c SQL and PL/SQL. (For more resources related to this topic, see here.) Oracle 12c SQL and PL/SQL new features SQL is the most widely used data access language while PL/SQL is an exigent language that can integrate seamlessly with SQL commands. The biggest benefit of running PL/SQL is that the code processing happens natively within the Oracle Database. In the past, there have been debates and discussions on server side programming while the client invokes the PL/SQL routines to perform a task. The server side programming approach has many benefits. It reduces the network round trips between the client and the database. It reduces the code size and eases the code portability because PL/SQL can run on all platforms, wherever Oracle Database is supported. Oracle Database 12c introduces many language features and enhancements which are focused on SQL to PL/SQL integration, code migration, and ANSI compliance. This section discusses the SQL and PL/SQL new features in Oracle database 12c. IDENTITY columns Oracle Database 12c Release 1 introduces identity columns in the tables in compliance with the American National Standard Institute (ANSI) SQL standard. A table column, marked as IDENTITY, automatically generate an incremental numeric value at the time of record creation. Before the release of Oracle 12c, developers had to create an additional sequence object in the schema and assign its value to the column. The new feature simplifies code writing and benefits the migration of a non-Oracle database to Oracle. The following script declares an identity column in the table T_ID_COL: /*Create a table for demonstration purpose*/ CREATE TABLE t_id_col (id NUMBER GENERATED AS IDENTITY, name VARCHAR2(20)) / The identity column metadata can be queried from the dictionary views USER_TAB_COLSand USER_TAB_IDENTITY_COLS. Note that Oracle implicitly creates a sequence to generate the number values for the column. However, Oracle allows the configuration of the sequence attributes of an identity column. The custom sequence configuration is listed under IDENTITY_OPTIONS in USER_TAB_IDENTITY_COLS view: /*Query identity column information in USER_TAB_COLS*/ SELECT column_name, data_default, user_generated, identity_column FROM user_tab_cols WHERE table_name='T_ID_COL' / COLUMN_NAME DATA_DEFAULT USE IDE -------------- ------------------------------ --- --- ID "SCOTT"."ISEQ$$_93001".nextval YES YES NAME YES NO Let us check the attributes of the preceding sequence that Oracle has implicitly created. Note that the query uses REGEXP_SUBSTR to print the sequence configuration in multiple rows: /*Check the sequence configuration from USER_TAB_IDENTITY_COLS view*/ SELECT table_name,column_name, generation_type, REGEXP_SUBSTR(identity_options,'[^,]+', 1, LEVEL) identity_options FROM user_tab_identity_cols WHERE table_name = 'T_ID_COL' CONNECT BY REGEXP_SUBSTR(identity_options,'[^,]+',1,level) IS NOT NULL / TABLE_NAME COLUMN_NAME GENERATION IDENTITY_OPTIONS ---------- ---------------------- --------------------------------- T_ID_COL ID ALWAYS START WITH: 1 T_ID_COL ID ALWAYS INCREMENT BY: 1 T_ID_COL ID ALWAYS MAX_VALUE: 9999999999999999999999999999 T_ID_COL ID ALWAYS MIN_VALUE: 1 T_ID_COL ID ALWAYS CYCLE_FLAG: N T_ID_COL ID ALWAYS CACHE_SIZE: 20 T_ID_COL ID ALWAYS ORDER_FLAG: N 7 rows selected While inserting data in the table T_ID_COL, do not include the identity column as its value is automatically generated: /*Insert test data in the table*/ BEGIN INSERT INTO t_id_col (name) VALUES ('Allen'); INSERT INTO t_id_col (name) VALUES ('Matthew'); INSERT INTO t_id_col (name) VALUES ('Peter'); COMMIT; END; / Let us check the data in the table. Note the identity column values: /*Query the table*/ SELECT id, name FROM t_id_col / ID NAME ----- -------------------- 1 Allen 2 Matthew 3 Peter The sequence created under the covers for identity columns is tightly coupled with the column. If a user tries to insert a user-defined input for the identity column, the operation throws an exception ORA-32795: INSERT INTO t_id_col VALUES (7,'Steyn'); insert into t_id_col values (7,'Steyn') * ERROR at line 1: ORA-32795: cannot insert into a generated always identity column Default column value to a sequence in Oracle 12c Oracle Database 12c allows developers to default a column directly to a sequence‑generated value. The DEFAULT clause of a table column can be assigned to SEQUENCE.CURRVAL or SEQUENCE.NEXTVAL. The feature will be useful while migrating non-Oracle data definitions to Oracle. The DEFAULT ON NULL clause Starting with Oracle Database 12c, a column can be assigned a default non-null value whenever the user tries to insert NULL into the column. The default value will be specified in the DEFAULT clause of the column with a new ON NULL extension. Note that the DEFAULT ON NULL cannot be used with an object type column. The following script creates a table t_def_cols. A column ID has been defaulted to a sequence while the column DOJ will always have a non-null value: /*Create a sequence*/ CREATE SEQUENCE seq START WITH 100 INCREMENT BY 10 / /*Create a table with a column defaulted to the sequence value*/ CREATE TABLE t_def_cols ( id number default seq.nextval primary key, name varchar2(30), doj date default on null '01-Jan-2000' ) / The following PL/SQL block inserts the test data: /*Insert the test data in the table*/ BEGIN INSERT INTO t_def_cols (name, doj) values ('KATE', '27-FEB-2001'); INSERT INTO t_def_cols (name, doj) values ('NANCY', '17-JUN-1998'); INSERT INTO t_def_cols (name, doj) values ('LANCE', '03-JAN-2004'); INSERT INTO t_def_cols (name) values ('MARY'); COMMIT; END; / Query the table and check the values for the ID and DOJ columns. ID gets the value from the sequence SEQ while DOJ for MARY has been defaulted to 01-JAN-2000. /*Query the table to verify sequence and default on null values*/ SELECT * FROM t_def_cols / ID NAME DOJ ---------- -------- --------- 100 KATE 27-FEB-01 110 NANCY 17-JUN-98 120 LANCE 03-JAN-04 130 MARY 01-JAN-00 Support for 32K VARCHAR2 Oracle Database 12c supports the VARCHAR2, NVARCHAR2, and RAW datatypes up to 32,767 bytes in size. The previous maximum limit for the VARCHAR2 (and NVARCHAR2) and RAW datatypes was 4,000 bytes and 2,000 bytes respectively. The support for extended string datatypes will benefit the non-Oracle to Oracle migrations. The feature can be controlled using the initialization parameter MAX_STRING_SIZE. It accepts two values: STANDARD (default)—The maximum size prior to the release of Oracle Database 12c will apply. EXTENDED—The new size limit for string datatypes apply. Note that, after the parameter is set to EXTENDED, the setting cannot be rolled back. The steps to increase the maximum string size in a database are: Restart the database in UPGRADE mode. In the case of a pluggable database, the PDB must be opened in MIGRATEmode. Use the ALTER SYSTEM command to set MAX_STRING_SIZE to EXTENDED. As SYSDBA, execute the $ORACLE_HOME/rdbms/admin/utl32k.sql script. The script is used to increase the maximum size limit of VARCHAR2, NVARCHAR2, and RAW wherever required. Restart the database in NORMAL mode. As SYSDBA, execute utlrp.sql to recompile the schema objects with invalid status. The points to be considered while working with the 32k support for string types are: COMPATIBLE must be 12.0.0.0 After the parameter is set to EXTENDED, the parameter cannot be rolled back to STANDARD In RAC environments, all the instances of the database comply with the setting of MAX_STRING_SIZE Row limiting using FETCH FIRST For Top-N queries, Oracle Database 12c introduces a new clause, FETCH FIRST, to simplify the code and comply with ANSI SQL standard guidelines. The clause is used to limit the number of rows returned by a query. The new clause can be used in conjunction with ORDER BY to retrieve Top-N results. The row limiting clause can be used with the FOR UPDATE clause in a SQL query. In the case of a materialized view, the defining query should not contain the FETCH clause. Another new clause, OFFSET, can be used to skip the records from the top or middle, before limiting the number of rows. For consistent results, the offset value must be a positive number, less than the total number of rows returned by the query. For all other offset values, the value is counted as zero. Keywords with the FETCH FIRST clause are: FIRST | NEXT—Specify FIRST to begin row limiting from the top. Use NEXT with OFFSET to skip certain rows. ROWS | PERCENT—Specify the size of the result set as a fixed number of rows or percentage of total number of rows returned by the query. ONLY | WITH TIES—Use ONLY to fix the size of the result set, irrespective of duplicate sort keys. If you want records with matching sort keys, specify WITH TIES. The following query demonstrates the use of the FETCH FIRST and OFFSET clauses in Top-N queries: /*Create the test table*/ CREATE TABLE t_fetch_first (empno VARCHAR2(30), deptno NUMBER, sal NUMBER, hiredate DATE) / The following PL/SQL block inserts sample data for testing: /*Insert the test data in T_FETCH_FIRST table*/ BEGIN INSERT INTO t_fetch_first VALUES (101, 10, 1500, '01-FEB-2011'); INSERT INTO t_fetch_first VALUES (102, 20, 1100, '15-JUN-2001'); INSERT INTO t_fetch_first VALUES (103, 20, 1300, '20-JUN-2000'); INSERT INTO t_fetch_first VALUES (104, 30, 1550, '30-DEC-2001'); INSERT INTO t_fetch_first VALUES (105, 10, 1200, '11-JUL-2012'); INSERT INTO t_fetch_first VALUES (106, 30, 1400, '16-AUG-2004'); INSERT INTO t_fetch_first VALUES (107, 20, 1350, '05-JAN-2007'); INSERT INTO t_fetch_first VALUES (108, 20, 1000, '18-JAN-2009'); COMMIT; END; / The SELECT query pulls in the top-5 rows when sorted by their salary: /*Query to list top-5 employees by salary*/ SELECT * FROM t_fetch_first ORDER BY sal DESC FETCH FIRST 5 ROWS ONLY / EMPNO DEPTNO SAL HIREDATE -------- ------ ------- --------- 104 30 1550 30-DEC-01 101 10 1500 01-FEB-11 106 30 1400 16-AUG-04 107 20 1350 05-JAN-07 103 20 1300 20-JUN-00 The SELECT query lists the top 25% of employees (2) when sorted by their hiredate: /*Query to list top-25% employees by hiredate*/ SELECT * FROM t_fetch_first ORDER BY hiredate FETCH FIRST 25 PERCENT ROW ONLY / EMPNO DEPTNO SAL HIREDATE -------- ------ ----- --------- 103 20 1300 20-JUN-00 102 20 1100 15-JUN-01 The SELECT query skips the first five employees and displays the next two—the 6th and 7th employee data: /*Query to list 2 employees after skipping first 5 employees*/ SELECT * FROM t_fetch_first ORDER BY SAL DESC OFFSET 5 ROWS FETCH NEXT 2 ROWS ONLY / Invisible columns Oracle Database 12c supports invisible columns, which implies that the visibility of a column. A column marked invisible does not appear in the following operations: SELECT * FROM queries on the table SQL* Plus DESCRIBE command Local records of %ROWTYPE Oracle Call Interface (OCI) description A column can be made invisible by specifying the INVISIBLE clause against the column. Columns of all types (except user-defined types), including virtual columns, can be marked invisible, provided the tables are not temporary tables, external tables, or clustered ones. An invisible column can be explicitly selected by the SELECT statement. Similarly, the INSERTstatement will not insert values in an invisible column unless explicitly specified. Furthermore, a table can be partitioned based on an invisible column. A column retains its nullity feature even after it is made invisible. An invisible column can be made visible, but the ordering of the column in the table may change. In the following script, the column NICKNAME is set as invisible in the table t_inv_col: /*Create a table to demonstrate invisible columns*/ CREATE TABLE t_inv_col (id NUMBER, name VARCHAR2(30), nickname VARCHAR2 (10) INVISIBLE, dob DATE ) / The information about the invisible columns can be found in user_tab_cols. Note that the invisible column is marked as hidden: /*Query the USER_TAB_COLS for metadata information*/ SELECT column_id, column_name, hidden_column FROM user_tab_cols WHERE table_name = 'T_INV_COL' ORDER BY column_id / COLUMN_ID COLUMN_NAME HID ---------- ------------ --- 1 ID NO 2 NAME NO 3 DOB NO NICKNAME YES Hidden columns are different from invisible columns. Invisible columns can be made visible and vice versa, but hidden columns cannot be made visible. If we try to make the NICKNAME visible and NAME invisible, observe the change in column ordering: /*Script to change visibility of NICKNAME column*/ ALTER TABLE t_inv_col MODIFY nickname VISIBLE / /*Script to change visibility of NAME column*/ ALTER TABLE t_inv_col MODIFY name INVISIBLE / /*Query the USER_TAB_COLS for metadata information*/ SELECT column_id, column_name, hidden_column FROM user_tab_cols WHERE table_name = 'T_INV_COL' ORDER BY column_id / COLUMN_ID COLUMN_NAME HID ---------- ------------ --- 1 ID NO 2 DOB NO 3 NICKNAME NO NAME YES Temporal databases Temporal databases were released as a new feature in ANSI SQL:2011. The term temporal data can be understood as a piece of information that can be associated with a period within which the information is valid. Before the feature was included in Oracle Database 12c, data whose validity is linked with a time period had to be handled either by the application or using multiple predicates in the queries. Oracle 12c partially inherits the feature from the ANSI SQL:2011 standard to support the entities whose business validity can be bracketed with a time dimension. If you're relating a temporal database with the Oracle Database 11g Total Recall feature, you're wrong. The total recall feature records the transaction time of the data in the database to secure the transaction validity and not the functional validity. For example, an investment scheme is active between January to December. The date recorded in the database at the time of data loading is the transaction timestamp. [box type="info" align="" class="" width=""]Starting from Oracle 12c, the Total Recall feature has been rebranded as Flashback Data Archive and has been made available for all versions of Oracle Database.[/box] The valid time temporal can be enabled for a table by adding a time dimension using the PERIOD FOR clause on the date or timestamp columns of the table. The following script creates a table t_tmp_db with the valid time temporal: /*Create table with valid time temporal*/ CREATE TABLE t_tmp_db( id NUMBER, name VARCHAR2(30), policy_no VARCHAR2(50), policy_term number, pol_st_date date, pol_end_date date, PERIOD FOR pol_valid_time (pol_st_date, pol_end_date)) / Create some sample data in the table: /*Insert test data in the table*/ BEGIN INSERT INTO t_tmp_db VALUES (100, 'Packt', 'PACKT_POL1', 1, '01-JAN-2015', '31-DEC-2015'); INSERT INTO t_tmp_db VALUES (110, 'Packt', 'PACKT_POL2', 2, '01-JAN-2015', '30-JUN-2015'); INSERT INTO t_tmp_db VALUES (120, 'Packt', 'PACKT_POL3', 3, '01-JUL-2015', '31-DEC-2015'); COMMIT; END; / Let us set the current time period window using DBMS_FLASHBACK_ARCHIVE. Grant the EXECUTE privilege on the package to the scott user. /*Connect to sysdba to grant execute privilege to scott*/ conn / as sysdba GRANT EXECUTE ON dbms_flashback_archive to scott / Grant succeeded. /*Connect to scott*/ conn scott/tiger /*Set the valid time period as CURRENT*/ EXEC DBMS_FLASHBACK_ARCHIVE.ENABLE_AT_VALID_TIME('CURRENT'); PL/SQL procedure successfully completed. Setting the valid time period as CURRENT means that all the tables with a valid time temporal will only list the rows that are valid with respect to today's date. You can set the valid time to a particular date too. /*Query the table*/ SELECT * from t_tmp_db / ID POLICY_NO POL_ST_DATE POL_END_DATE --------- ---------- ----------------------- ------------------- 100 PACKT_POL1 01-JAN-15 31-DEC-15 110 PACKT_POL2 01-JAN-15 30-JUN-15 [box type="info" align="" class="" width=""]Due to dependency on the current date, the result may vary when the reader runs the preceding queries.[/box] The query lists only those policies that are active as of March 2015. Since the third policy starts in July 2015, it is currently not active. In-Database Archiving Oracle Database 12c introduces In-Database Archiving to archive the low priority data in a table. The inactive data remains in the database but is not visible to the application. You can mark old data for archival, which is not actively required in the application except for regulatory purposes. Although the archived data is not visible to the application, it is available for querying and manipulation. In addition, the archived data can be compressed to improve backup performance. A table can be enabled by specifying the ROW ARCHIVAL clause at the table level, which adds a hidden column ORA_ARCHIVE_STATE to the table structure. The column value must be updated to mark a row for archival. For example: /*Create a table with row archiving*/ CREATE TABLE t_row_arch( x number, y number, z number) ROW ARCHIVAL / When we query the table structure in the USER_TAB_COLS view, we find an additional hidden column, which Oracle implicitly adds to the table: /*Query the columns information from user_tab_cols view*/ SELECT column_id,column_name,data_type, hidden_column FROM user_tab_cols WHERE table_name='T_ROW_ARCH' / COLUMN_ID COLUMN_NAME DATA_TYPE HID ---------- ------------------ ---------- --- ORA_ARCHIVE_STATE VARCHAR2 YES 1 X NUMBER NO 2 Y NUMBER NO 3 Z NUMBER NO Let us create test data in the table: /Insert test data in the table*/ BEGIN INSERT INTO t_row_arch VALUES (10,20,30); INSERT INTO t_row_arch VALUES (11,22,33); INSERT INTO t_row_arch VALUES (21,32,43); INSERT INTO t_row_arch VALUES (51,82,13); commit; END; / For testing purpose, let us archive the rows in the table where X > 50 by updating the ora_archive_state column: /*Update ORA_ARCHIVE_STATE column in the table*/ UPDATE t_row_arch SET ora_archive_state = 1 WHERE x > 50 / COMMIT / By default, the session displays only the active records from an archival-enabled table: /*Query the table*/ SELECT * FROM t_row_arch / X Y Z ------ -------- ---------- 10 20 30 11 22 33 21 32 43 If you wish to display all the records, change the session setting: /*Change the session parameter to display the archived records*/ ALTER SESSION SET ROW ARCHIVAL VISIBILITY = ALL / Session altered. /*Query the table*/ SELECT * FROM t_row_arch / X Y Z ---------- ---------- ---------- 10 20 30 11 22 33 21 32 43 51 82 13 Defining a PL/SQL subprogram in the SELECT query and PRAGMA UDF Oracle Database 12c includes two new features to enhance the performance of functions when called from SELECT statements. With Oracle 12c, a PL/SQL subprogram can be created inline with the SELECT query in the WITH clause declaration. The function created in the WITH clause subquery is not stored in the database schema and is available for use only in the current query. Since a procedure created in the WITH clause cannot be called from the SELECT query, it can be called in the function created in the declaration section. The feature can be very handy in read-only databases where the developers were not able to create PL/SQL wrappers. Oracle Database 12c adds the new PRAGMA UDF to create a standalone function with the same objective. Earlier, the SELECT queries could invoke a PL/SQL function, provided the function didn't change the database purity state. The query performance used to degrade because of the context switch from SQL to the PL/SQL engine (and vice versa) and the different memory representations of data type representation in the processing engines. In the following example, the function fun_with_plsql calculates the annual compensation of an employee's monthly salary: /*Create a function in WITH clause declaration*/ WITH FUNCTION fun_with_plsql (p_sal NUMBER) RETURN NUMBER IS BEGIN RETURN (p_sal * 12); END; SELECT ename, deptno, fun_with_plsql (sal) "annual_sal" FROM emp / ENAME DEPTNO annual_sal ---------- --------- ---------- SMITH 20 9600 ALLEN 30 19200 WARD 30 15000 JONES 20 35700 MARTIN 30 15000 BLAKE 30 34200 CLARK 10 29400 SCOTT 20 36000 KING 10 60000 TURNER 30 18000 ADAMS 20 13200 JAMES 30 11400 FORD 20 36000 MILLER 10 15600 14 rows selected. [box type="info" align="" class="" width=""]If the query containing the WITH clause declaration is not a top-level statement, then the top level statement must use the WITH_PLSQL hint. The hint will be used if INSERT, UPDATE, or DELETE statements are trying to use a SELECT with a WITHclause definition. Failure to include the hint results in an exception ORA-32034: unsupported use of WITH clause.[/box] A function can be created with the PRAGMA UDF to inform the compiler that the function is always called in a SELECT statement. Note that the standalone function created in the following code carries the same name as the one in the last example. The local WITH clause declaration takes precedence over the standalone function in the schema. /*Create a function with PRAGMA UDF*/ CREATE OR REPLACE FUNCTION fun_with_plsql (p_sal NUMBER) RETURN NUMBER is PRAGMA UDF; BEGIN RETURN (p_sal *12); END; / Since the objective of the feature is performance, let us go ahead with a case study to compare the performance when using a standalone function, a PRAGMA UDF function, and a WITHclause declared function. Test setup The exercise uses a test table with 1 million rows, loaded with random data. /*Create a table for performance test study*/ CREATE TABLE t_fun_plsql (id number, str varchar2(30)) / /*Generate and load random data in the table*/ INSERT /*+APPEND*/ INTO t_fun_plsql SELECT ROWNUM, DBMS_RANDOM.STRING('X', 20) FROM dual CONNECT BY LEVEL <= 1000000 / COMMIT / Case 1: Create a PL/SQL standalone function as it used to be until Oracle Database 12c. The function counts the numbers in the str column of the table. /*Create a standalone function without Oracle 12c enhancements*/ CREATE OR REPLACE FUNCTION f_count_num (p_str VARCHAR2) RETURN PLS_INTEGER IS BEGIN RETURN (REGEXP_COUNT(p_str,'d')); END; / The PL/SQL block measures the elapsed and CPU time when working with a pre-Oracle 12c standalone function. These numbers will serve as the baseline for our case study. /*Set server output on to display messages*/ SET SERVEROUTPUT ON /*Anonymous block to measure performance of a standalone function*/ DECLARE l_el_time PLS_INTEGER; l_cpu_time PLS_INTEGER; CURSOR C1 IS SELECT f_count_num (str) FROM t_fun_plsql; TYPE t_tab_rec IS TABLE OF PLS_INTEGER; l_tab t_tab_rec; BEGIN l_el_time := DBMS_UTILITY.GET_TIME (); l_cpu_time := DBMS_UTILITY.GET_CPU_TIME (); OPEN c1; FETCH c1 BULK COLLECT INTO l_tab; CLOSE c1; DBMS_OUTPUT.PUT_LINE ('Case 1: Performance of a standalone function'); DBMS_OUTPUT.PUT_LINE ('Total elapsed time:'||to_char(DBMS_UTILITY.GET_TIME () - l_el_time)); DBMS_OUTPUT.PUT_LINE ('Total CPU time:'||to_char(DBMS_UTILITY.GET_CPU_TIME () - l_cpu_time)); END; / Performance of a standalone function: Total elapsed time:1559 Total CPU time:1366 PL/SQL procedure successfully completed. Case 2: Create a PL/SQL function using PRAGMA UDF to count the numbers in the str column. /*Create the function with PRAGMA UDF*/ CREATE OR REPLACE FUNCTION f_count_num_pragma (p_str VARCHAR2) RETURN PLS_INTEGER IS PRAGMA UDF; BEGIN RETURN (REGEXP_COUNT(p_str,'d')); END; / Let us now check the performance of the PRAGMA UDF function using the following PL/SQL block. /*Set server output on to display messages*/ SET SERVEROUTPUT ON /*Anonymous block to measure performance of a PRAGMA UDF function*/ DECLARE l_el_time PLS_INTEGER; l_cpu_time PLS_INTEGER; CURSOR C1 IS SELECT f_count_num_pragma (str) FROM t_fun_plsql; TYPE t_tab_rec IS TABLE OF PLS_INTEGER; l_tab t_tab_rec; BEGIN l_el_time := DBMS_UTILITY.GET_TIME (); l_cpu_time := DBMS_UTILITY.GET_CPU_TIME (); OPEN c1; FETCH c1 BULK COLLECT INTO l_tab; CLOSE c1; DBMS_OUTPUT.PUT_LINE ('Case 2: Performance of a PRAGMA UDF function'); DBMS_OUTPUT.PUT_LINE ('Total elapsed time:'||to_char(DBMS_UTILITY.GET_TIME () - l_el_time)); DBMS_OUTPUT.PUT_LINE ('Total CPU time:'||to_char(DBMS_UTILITY.GET_CPU_TIME () - l_cpu_time)); END; / Performance of a PRAGMA UDF function: Total elapsed time:664 Total CPU time:582 PL/SQL procedure successfully completed. Case 3: The following PL/SQL block dynamically executes the function in the WITH clause subquery. Note that, unlike other SELECT statements, a SELECT query with a WITH clause declaration cannot be executed statically in the body of a PL/SQL block. /*Set server output on to display messages*/ SET SERVEROUTPUT ON /*Anonymous block to measure performance of inline function*/ DECLARE l_el_time PLS_INTEGER; l_cpu_time PLS_INTEGER; l_sql VARCHAR2(32767); c1 sys_refcursor; TYPE t_tab_rec IS TABLE OF PLS_INTEGER; l_tab t_tab_rec; BEGIN l_el_time := DBMS_UTILITY.get_time; l_cpu_time := DBMS_UTILITY.get_cpu_time; l_sql := 'WITH FUNCTION f_count_num_with (p_str VARCHAR2) RETURN NUMBER IS BEGIN RETURN (REGEXP_COUNT(p_str,'''||''||'d'||''')); END; SELECT f_count_num_with(str) FROM t_fun_plsql'; OPEN c1 FOR l_sql; FETCH c1 bulk collect INTO l_tab; CLOSE c1; DBMS_OUTPUT.PUT_LINE ('Case 3: Performance of an inline function'); DBMS_OUTPUT.PUT_LINE ('Total elapsed time:'||to_char(DBMS_UTILITY.GET_TIME () - l_el_time)); DBMS_OUTPUT.PUT_LINE ('Total CPU time:'||to_char(DBMS_UTILITY.GET_CPU_TIME () - l_cpu_time)); END; / Performance of an inline function: Total elapsed time:830 Total CPU time:718 PL/SQL procedure successfully completed. Comparative analysis Comparing the results from the preceding three cases, it's clear that the Oracle 12c flavor of PL/SQL functions out-performs the pre-12c standalone function by a high margin. From the following matrix, it is apparent that the usage of the PRAGMA UDF or WITH clause declaration enhances the code performance by (roughly) a factor of 2. Case Description Elapsed Time CPU time Performance gain factor by CPU time Standalone PL/SQL function in pre-Oracle 12c database 1559 1336 1x Standalone PL/SQL PRAGMA UDF function in Oracle 12c 664 582 2.3x Function created in WITH clause declaration in Oracle 12c 830 718 1.9x   [box type="info" align="" class="" width=""]Note that the numbers may slightly differ in the reader's testing environment but you should be able to draw the same conclusion by comparing them.[/box] The PL/SQL program unit whitelisting Prior to Oracle 12c, a standalone or packaged PL/SQL unit could be invoked by all other programs in the session's schema. Oracle Database 12c allows users to prevent unauthorized access to PL/SQL program units. You can now specify the list of whitelist program units that can invoke a particular program. The PL/SQL program header or the package specification can specify the list of program units in the ACCESSIBLE BY clause in the program header. All other program units, including cross-schema references (even SYS owned objects), trying to access a protected subprogram will receive an exception, PLS-00904: insufficient privileges to access object [object name]. The feature can be very useful in an extremely sensitive development environment. Suppose, a package PKG_FIN_PROC contains the sensitive implementation routines for financial institutions, the packaged subprograms are called by another PL/SQL package PKG_FIN_INTERNALS. The API layer exposes a fixed list of programs through a public API called PKG_CLIENT_ACCESS. In order to restrict access to the packaged routines in PKG_FIN_PROC, the users can build a safety net so as to allow access to only authorized programs. The following PL/SQL package PKG_FIN_PROC contains two subprograms—P_FIN_QTR and P_FIN_ANN. The ACCESSIBLE BY clause includes PKG_FIN_INTERNALS which means that all other program units, including anonymous PL/SQL blocks, are blocked from invoking PKG_FIN_PROC constructs. /*Package with the accessible by clause*/ CREATE OR REPLACE PACKAGE pkg_fin_proc ACCESSIBLE BY (PACKAGE pkg_fin_internals) IS PROCEDURE p_fin_qtr; PROCEDURE p_fin_ann; END; / [box type="info" align="" class="" width=""]The ACCESSIBLE BY clause can be specified for schema-level programs only.[/box] Let's see what happens when we invoke the packaged subprogram from an anonymous PL/SQL block. /*Invoke the packaged subprogram from the PL/SQL block*/ BEGIN pkg_fin_proc.p_fin_qtr; END; / pkg_fin_proc.p_fin_qtr; * ERROR at line 2: ORA-06550: line 2, column 4: PLS-00904: insufficient privilege to access object PKG_FIN_PROC ORA-06550: line 2, column 4: PL/SQL: Statement ignored Well, the compiler throws an exception as invoking the whitelisted package from an anonymous block is not allowed. The ACCESSIBLE BY clause can be included in the header information of PL/SQL procedures and functions, packages, and object types. Granting roles to PL/SQL program units Before Oracle Database 12c, a PL/SQL unit created with the definer's rights (default AUTHID) always executed with the definer's rights, whether or not the invoker has the required privileges. It may lead to an unfair situation where the invoking user may perform unwanted operations without needing the correct set of privileges. Similarly for an invoker's right unit, if the invoking user possesses a higher set of privileges than the definer, he might end up performing unauthorized operations. Oracle Database 12c secures the definer's rights by allowing the defining user to grant complementary roles to individual PL/SQL subprograms and packages. From the security standpoint, the granting of roles to schema level subprograms provides granular control as the privileges of the invoker are validated at the time of execution. In the following example, we will create two users: U1 and U2. The user U1 creates a PL/SQL procedure P_INC_PRICE that adds a surcharge to the price of a product by a certain amount. U1 grants the execute privilege to user U2. Test setup Let's create two users and give them the required privileges. /*Create a user with a password*/ CREATE USER u1 IDENTIFIED BY u1 / User created. /*Grant connect privileges to the user*/ GRANT CONNECT, RESOURCE TO u1 / Grant succeeded. /*Create a user with a password*/ CREATE USER u2 IDENTIFIED BY u2 / User created. /*Grant connect privileges to the user*/ GRANT CONNECT, RESOURCE TO u2 / Grant succeeded. The user U1 contains the PRODUCTS table. Let's create and populate the table. /*Connect to U1*/ CONN u1/u1 /*Create the table PRODUCTS*/ CREATE TABLE products ( prod_id INTEGER, prod_name VARCHAR2(30), prod_cat VARCHAR2(30), price INTEGER ) / /*Insert the test data in the table*/ BEGIN DELETE FROM products; INSERT INTO products VALUES (101, 'Milk', 'Dairy', 20); INSERT INTO products VALUES (102, 'Cheese', 'Dairy', 50); INSERT INTO products VALUES (103, 'Butter', 'Dairy', 75); INSERT INTO products VALUES (104, 'Cream', 'Dairy', 80); INSERT INTO products VALUES (105, 'Curd', 'Dairy', 25); COMMIT; END; / The procedure p_inc_price is designed to increase the price of a product by a given amount. Note that the procedure is created with the definer's rights. /*Create the procedure with the definer's rights*/ CREATE OR REPLACE PROCEDURE p_inc_price (p_prod_id NUMBER, p_amt NUMBER) IS BEGIN UPDATE products SET price = price + p_amt WHERE prod_id = p_prod_id; END; / The user U1 grants execute privilege on p_inc_price to U2. /*Grant execute on the procedure to the user U2*/ GRANT EXECUTE ON p_inc_price TO U2 / The user U2 logs in and executes the procedure P_INC_PRICE to increase the price of Milk by 5 units. /*Connect to U2*/ CONN u2/u2 /*Invoke the procedure P_INC_PRICE in a PL/SQL block*/ BEGIN U1.P_INC_PRICE (101,5); COMMIT; END; / PL/SQL procedure successfully completed. The last code listing exposes a gray area. The user U2, though not authorized to view PRODUCTS data, manipulates its data with the definer's rights. We need a solution to the problem. The first step is to change the procedure from definer's rights to invoker's rights. /*Connect to U1*/ CONN u1/u1 /*Modify the privilege authentication for the procedure to invoker's rights*/ CREATE OR REPLACE PROCEDURE p_inc_price (p_prod_id NUMBER, p_amt NUMBER) AUTHID CURRENT_USER IS BEGIN UPDATE products SET price = price + p_amt WHERE prod_id = p_prod_id; END; / Now, if we execute the procedure from U2, it throws an exception because it couldn't find the PRODUCTS table in its schema. /*Connect to U2*/ CONN u2/u2 /*Invoke the procedure P_INC_PRICE in a PL/SQL block*/ BEGIN U1.P_INC_PRICE (101,5); COMMIT; END; / BEGIN * ERROR at line 1: ORA-00942: table or view does not exist ORA-06512: at "U1.P_INC_PRICE", line 5 ORA-06512: at line 2 In a similar scenario in the past, the database administrators could have easily granted select or updated privileges to U2, which is not an optimal solution from the security standpoint. Oracle 12c allows the users to create program units with invoker's rights but grant the required roles to the program units and not the users. So, an invoker right unit executes with invoker's privileges, plus the PL/SQL program role. Let's check out the steps to create a role and assign it to the procedure. SYSDBA creates the role and assigns it to the user U1. Using the ADMIN or DELEGATE option with the grant enables the user to grant the role to other entities. /*Connect to SYSDBA*/ CONN / as sysdba /*Create a role*/ CREATE ROLE prod_role / /*Grant role to user U1 with delegate option*/ GRANT prod_role TO U1 WITH DELEGATE OPTION / Now, user U1 assigns the required set of privileges to the role. The role is then assigned to the required subprogram. Note that only roles, and not individual privileges, can be assigned to the schema level subprograms. /*Connect to U1*/ CONN u1/u1 /*Grant SELECT and UPDATE privileges on PRODUCTS to the role*/ GRANT SELECT, UPDATE ON PRODUCTS TO prod_role / /*Grant role to the procedure*/ GRANT prod_role TO PROCEDURE p_inc_price / User U2 tries to execute the procedure again. The procedure is successfully executed which means the value of "Milk" has been increased by 5 units. /*Connect to U2*/ CONN u2/u2 /*Invoke the procedure P_INC_PRICE in a PL/SQL block*/ BEGIN U1.P_INC_PRICE (101,5); COMMIT; END; / PL/SQL procedure successfully completed. User U1 verifies the result with a SELECT query. /*Connect to U1*/ CONN u1/u1 /*Query the table to verify the change*/ SELECT * FROM products / PROD_ID PROD_NAME PROD_CAT PRICE ---------- ---------- ---------- ---------- 101 Milk Dairy 25 102 Cheese Dairy 50 103 Butter Dairy 75 104 Cream Dairy 80 105 Curd Dairy 25 Miscellaneous PL/SQL enhancements Besides the preceding key features, there are a lot of new features in Oracle 12c. The list of features is as follows: An invoker rights function can be result-cached—until Oracle 11g, only the definers' programs were allowed to cache their results. Oracle 12c adds the invoking user's identity to the result cache to make it independent of the definer. The compilation parameter PLSQL_DEBUG has been deprecated. Two conditional compilation inquiry directives $$PLSQL_UNIT_OWNER and $$PLSQL_UNIT_TYPE have been implemented. Summary This chapter covers the top rated and new features of Oracle 12c SQL and PL/SQL as well as some miscellaneous PL/SQL enhancements. This chapter covers the top rated and new features of Oracle 12c SQL and PL/SQL as well as some miscellaneous PL/SQL enhancements. Further resources on this subject: What is Oracle Public Cloud? [article] Oracle APEX 4.2 reporting [article] Oracle E-Business Suite with Desktop Integration [article]
Read more
  • 0
  • 0
  • 18800

article-image-how-facebook-is-advancing-artificial-intelligence-video
Richard Gall
14 Sep 2018
4 min read
Save for later

How Facebook is advancing artificial intelligence [Video]

Richard Gall
14 Sep 2018
4 min read
Facebook is playing a huge role in artificial intelligence research. It’s not only a core part of the Facebook platform, it’s central to how the organization works. The company launched its AI research lab - FAIR - back in 2013. Today, led by some of the best minds in the field, it's not only helping Facebook to leverage artificial intelligence, it's also making it more accessible to researchers and engineers around the world. Let’s take a look at some of the tools built by Facebook that are doing just that. PyTorch: Facebook's leading artificial intelligence tool PyTorch is a hugely popular deep learning framework (rivalling Google's TensorFlow) that, by combining flexiblity and dynamism with stability, bridges the gap between research and production. Using a tape-based auto-differentiation system, PyTorch can be modified and changed by engineers without losing speed. That’s good news for everyone. Although PyTorch steals the headlines, there are a range of supporting tools that are making artificial intelligence and deep learning more accessible and achievable for other engineers. Read next: Is PyTorch better than Google’s TensorFlow? Find PyTorch eBooks and videos on the Packt website.  Facebook's computer vision tools Another field that Facebook has revolutionized is computer vision and image processing. Detectron, Facebook’s state-of-the-art object detection software system, has powered many research projects including Mask R-CNN - a simple and flexible way of developing Convolution Neural Networks for image processing. Mask R-CNN has also helped to power DensePose, a tool that map all human pixels of an RGB image to a 3D surface-based representation of the human body. Facebook has also heavily contributed to research in detecting and recognizing Human-Object interactions as well. Their contribution to the field of generative modeling is equally very important, with tasks such as minimizing variations in the quality of images, JPEG compression as well as image quantization now becoming easier and more accessible. Facebook, language and artificial intelligence We share updates, we send messages - language is a cornerstone of Facebook. This is why it's such an important area for Facebook’s AI researchers. There are a whole host of libraries and tools that are built for language problems. FastText is a library for text representation and classification, while ParlAI is a platform pushing the boundaries of dialog research. The platform is focused on tackling 5 key AI tasks: question answering, sentence completion, goal-oriented dialog, chit-chat dialog, and visual dialog. The ultimate aim for ParlAI is to develop a general dialog AI. There are also a few more language tools in Facebook’s AI toolkit - Fairseq and Translate are helping with translation and text generation, while Wav2Letter is an Automatic Speech Recognition system that can be used for transcription tasks. Rational artificial intelligence for gaming and smart decision making Although Facebook isn’t known for gaming, its interest in developing artificial intelligence that can reason could have an impact on the way games are built in the future. ELF is a tool developed by Facebook that allows game developers to train and test AI algorithms in a gaming environment. ELF was used by Facebook researchers to recreate DeepMind’s AlphaGo Zero, the AI bot that has defeated Go champions. Running on a single GPU, the ELF OpenGo bot defeated four professional Go players 14-0. Impressive, right? There are other tools built by Facebook that aim to build AI into game reasoning. Torchcraft is probably the most notable example - its a library that’s making AI research on Starcraft - a strategy game - accessible to game developers and AI specialists alike. Facebook is defining the future of artificial intelligence As you can see, Facebook is doing a lot to push the boundaries of artificial intelligence. However, it’s not just keeping these tools for itself - all these tools are open source, which means they can be used by anyone.
Read more
  • 0
  • 0
  • 18689

article-image-nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh
Savia Lobo
15 Dec 2017
8 min read
Save for later

NIPS 2017 Special: A deep dive into Deep Bayesian and Bayesian Deep Learning with Yee Whye Teh

Savia Lobo
15 Dec 2017
8 min read
Yee Whye Teh is a professor at the department of Statistics of the University of Oxford and also a research scientist at DeepMind. He works on statistical machine learning, focussing on Bayesian nonparametrics, probabilistic learning, and deep learning. The motive of this article aims to bring our readers to Yee’s keynote speech at the NIPS 2017. Yee’s keynote ponders deeply on the interface between two perspectives on machine learning: Bayesian learning and Deep learning by exploring questions like: How can probabilistic thinking help us understand deep learning methods or lead us to interesting new methods? Conversely, how can deep learning technologies help us develop advanced probabilistic methods? For a more comprehensive and in-depth understanding of this novel approach, be sure to watch the complete keynote address by Yee Whye Teh on  NIPS facebook page. All images in this article come from Yee’s presentation slides and do not belong to us. The history of machine learning has shown a growth in both model complexity and in model flexibility. The theory led models have started to lose their shine. This is because machine learning is at the forefront of a revolution that could be called as data led models or the data revolution. As opposed to theory led models, data-led models try not to impose too many assumptions on the processes that have to be modeled and are rather superflexible non-parametric models that can capture the complexities but they require large amount of data to operate.   On the model flexibility side, we have various approaches that have been explored over the years. We have kernel methods, Gaussian processes, Bayesian nonparametrics and now we have deep learning as well. The community has also developed evermore complex frameworks both graphical and programmatic to compose large complex models from simpler building blocks. In the 90’s we had graphical models, later we had probabilistic programming systems, followed by deep learning systems like TensorFlow, Theano, and Torch. A recent addition is probabilistic Torch, which brings together ideas from both the probabilistic Bayesian learning and deep learning. On one hand we have Bayesian learning, which deals with learning as inference in some probabilistic models. On the other hand we have deep learning models, which view learning as optimization functions parametrized by neural networks. In recent years there has been an explosion of exciting research at this interface of these two popular approaches resulting in increasingly complex and exciting models. What is Bayesian theory of learning Bayesian learning describes an ideal learner as one who interacts with the world in order to know its state, which is given by θ. He/she makes some observations about the world by deducing a model in Bayesian context. This model is a joint distribution of both the unknown state of the world θ and the observation about the world x. The model consists of prior distribution and marginal distribution, combining which gives a reverse conditional distribution also known as posterior, which describes the totality of the agent's knowledge about the world after he/she sees x. This posterior can also be used for predicting future observations and act accordingly. Issues associated with Bayesian learning Rigidity Learning can be wrong if model is wrong Not all prior knowledge can be encoded as joint distribution Simple analytic forms are limiting for conditional distributions 2. Scalability: Intractable to compute this posterior and approximations have to be made, which then introduces trade offs between efficiency and accuracy. As a result, it is often assumed that Bayesian techniques are not scalable. To address these issues, the speaker highlights some of his recent projects which showcase scenarios where deep learning ideas are applied to Bayesian models (Deep Bayesian learning) or in the reverse applying Bayesian ideas to Neural Networks ( i.e. Bayesian Deep learning) Deep Bayesian learning: Deep learning assists Bayesian learning Deep learning can improve Bayesian learning in the following ways: Improve the modeling flexibility by using neural networks in the construction of Bayesian models Improve the inference and scalability of these methods by parameterizing the posterior way of using neural networks Empathizing inference over multiple runs These can be seen in the following projects showcased by Yee: Concrete VAEs(Variational Autoencoders) FIVO: Filtered Variational Objectives Concrete VAEs What are VAEs? All the qualities mentioned above, i.e. improving modeling flexibility, improving inference and scalability, and empathizing inference over multiple runs by using neural networks can be seen in a class of deep generative models known as VAE (Variational Autoencoders). Fig: Variational Autoencoders VAEs include latent variables that describe the contents of a scene i.e objects, pose. The relationship between these latent variables and the pixels have to be highly complex and nonlinear. So, in short, VAEs are used to parameterize generative and variable posterior distribution that allows for greater scope flexible modeling. The key that makes VAEs work is the reparameterization trick Fig: Adding reparameterization to VAEs The reparameterization trick is crucial to the continuous latent variables in the VAEs. But many models naturally include discrete latent variables. Yee suggests application of the reparameterization on the discrete latent variables as a work around. This brings us to the concept of Concrete VAEs.. CONtinuous relaxation of disCRETE distributions.Also, the density can be further calculated: This concrete distribution is the reparameterization trick for discrete variables which helps in calculating the KL divergence that is needed for variational inference. FIVO: Filtered Variational Objectives FIVO extends VAEs towards models for sequential and time series data. It is built upon another extension of VAEs known as Importance Weighted Autoencoder, a generative model with a similar as that of the VAE, but which uses a strictly tighter log-likelihood lower bound. Variational lower bound: Rederivation from importance sampling: Better to use multiple samples: Using Importance Weighted Autoencoders we can use multiple sampling, with which we can get a tighter lower bound and optimizing this lower bound should lead to better learning. Let’s have a look at the FIVO objectives: We can use any unbiased estimator p(X) of marginal probabilityTightness of bound related to variance of estimatorFor sequential models, we can use particle filters which produce unbiased estimator of marginal probability. They can also have much lower variance than importance samplers. Bayesian Deep learning: Bayesian approach for deep learning gives us counterintuitive and surprising ways to make deep learning scalable. In order to explore the potential of Bayesian learning with deep neural networks, Yee introduced a project named, The posterior server. The Posterior server The posterior server is a distributed server for deep learning. It makes use of the Bayesian approach in order to make neural networks highly scalable. This project focuses on Distributed learning, where both the data and the computations can be spread across the network. The figure above shows that there are a bunch of workers and each communicates with the parameter server, which effectively maintains the authoritative copy of the parameters of the network. At each iteration, each worker obtains the latest copy of the parameter from the server, computes the gradient update based on its data and sends it back to the server which then updates it to the authoritative copy. So, communications on the network tend to be slower than the computations that can be done on the network. Hence, one might consider multiple gradient steps on each iteration before it sends the accumulated update back to the parameter server. The problem is that the parameter and the worker quickly get out of sync with the authoritative copy on the parameter server. As a result, this leads to stale updates which allow noise into the system and we often need frequent synchronizations across the network for the algorithm to learn in a stable fashion. The main idea here in Bayesian context is that we don't just want a single parameter, we want a whole distribution over them. This will then relax the need for frequent synchronizations across the network and hopefully lead to algorithms that are robust to last frequent communication. Each worker is simply going to construct its own tractable approximation to his own likelihood function and send this information to the posterior server which then combines these approximations together to form the full posterior or an approximation of it. Further, the approximations that are constructed would be based on the statistics of some sampling algorithms that happens locally on that worker. The actual algorithm includes a combination of the variational algorithms, Stochastic Gradient EP and the Markov chain Monte Carlo on the workers themselves. So the variational part in the algorithm handles the communication part in the network whereas the MCMC part handles the sampling part that is posterior to construct the statistics that the variational part needs. For scalability, a stochastic gradient Langevin algorithm which is a simple generalization of the SGT, which includes additional injected noise, to sample from posterior noise. To experiment with this server, it was trained densely connected neural networks with 500 reLU units on MNIST dataset. You can have a detailed understanding of these examples in the keynote video. This interface between Bayesian learning and deep learning is a very exciting frontier. Researchers have brought management of uncertainties within deep learning. Also, flexibility and scalability in Bayesian modeling. Yee concludes with two questions for the audience to think about. Does being Bayesian in the space of functions makes more sense than being Bayesian in the sense of parameters? How to deal with uncertainties under model misspecification?    
Read more
  • 0
  • 0
  • 18628
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-thomas-munro-from-enterprisedb-on-parallelism-in-postgresql
Bhagyashree R
17 Dec 2019
7 min read
Save for later

Thomas Munro from EnterpriseDB on parallelism in PostgreSQL

Bhagyashree R
17 Dec 2019
7 min read
PostgreSQL is a powerful, open-source object-relational database system. Since its introduction, it has been well-received by developers for its reliability, feature robustness, data-integrity, better licensing, and much more. However, one of its limitations has been the lack of support for parallelism, which changed in the subsequent releases. At PostgresOpen 2018, Thomas Munro, a programmer at EnterpriseDB and PostgreSQL contributor talked about how parallelism has evolved in PostgreSQL over the years. In this article, we will see some of the key parallelism-specific features that Munro discussed in his talk. [box type="shadow" align="" class="" width=""] Further Learning This article gives you a glimpse of query parallelism in PostgreSQL. If you want to explore it further along with other concepts like data replication, and database performance, check out our book Mastering PostgreSQL 11 - Second Edition by Hans-Jürgen Schönig. This second edition of Mastering PostgreSQL 11 helps you build dynamic database solutions for enterprise applications using PostgreSQL, which enables database analysts to design both the physical and technical aspects of the system architecture with ease. [/box] Evolution of parallelism in PostgreSQL PostgreSQL uses a process-based architecture instead of a thread-based one. On startup, it launches a “postmaster” process and after that creates a new process for every database session. Previously, it did not support parallelism in a single connection and each query used to run serially. The absence of “intra-query parallelism” in PostgreSQL was a huge limitation for answering the queries faster. Parallelism here means allowing a single process to have multiple threads to query the system and utilize the increasing CPU core counts. The foundation for parallelism in PostgreSQL was laid out in the 9.4 and 9.5 releases. These came with infrastructure updates like dynamic shared memory segments, shared memory queues, and background workers. PostgreSQL 9.6 was actually the first release that came with user-visible features for parallel query execution. It supported executor nodes: gather, parallel sequential scan, partial aggregate, and finalize aggregate. However, this was not enabled by default. Then in 2017, PostgreSQL 10 was released, which had parallelism enabled by default. It had a few more executor nodes including gather merge, parallel index scan, and parallel bitmap heap scan. Last year, PostgreSQL 11 came out with a couple of more executor nodes including parallel append and parallel hash join. It also introduced partition-wise joins and parallel CREATE INDEX. Key parallelism-specific features in PostgreSQL Parallel sequential scans Parallel sequential scans was the very first feature for parallel query execution. Introduced in PostgreSQL 9.6, this scan distributes blocks of a table among different processes. This assignment is done one after the other to ensure that the access to the table remains sequential. The processes that run in parallel and scan the tuples of a table are called parallel workers. There is one special worker called leader, which is responsible for coordinating and collecting the output of the scan from each of the worker. The leader may or may not participate in scanning the database depending on its load in dividing and combining processes. Parallel index scan Parallel index scan is based on the same concept as parallel sequential scan, but it involves more communication and waiting. Currently, the parallel index scans are supported only for B-Tree indexes. In a parallel index scan, index pages are scanned in parallel. Each process will scan a single index block and return all tuples referenced by that block. Meanwhile, other processes will also scan different index blocks and return the tuples. The results of a parallel B-Tree scan are then returned in sorted order. Parallel bitmap heap scan Again, this also has the same concept as the parallel sequential scan. Explaining the difference, Munro said, “You’ve got a big bitmap and you are skipping ahead to the pages that contain interesting tuples.” In parallel bitmap heap scan, one process is chosen as the leader, who performs a scan of one or more indexes and creates bitmap indicating which table blocks need to be visited. These table blocks are then divided among the worker processes as in a parallel sequential scan. Here the heap scan is done in parallel, but the underlying index scan is not. Parallel joins PostgreSQL supports all three join strategies in parallel query plans: nested loop join, hash join, or merge join. However, there is no parallelism supported in the inner loop. The entire loop is scanned as a whole, and the parallelism comes into play when each worker executes the inner loop as a whole. The results of each join are sent to gather node to produce the final results. Nested loop join: The nested loop is the most basic way for PostgreSQL to perform a join. Though it is considered to be slow, it can be efficient if the inner side is an index scan. This is because the outer tuples and hence the loops that loop up values in the index will be divided among worker processes. Merge join: The inner side is executed in full. It can be inefficient when sort needs to be performed because the work and resulting data are duplicated in every cooperating process. Hash join: In this join as well, the inner side is executed in full by every worker process to build identical copies of the hash table. It is inefficient in cases when the hash table is large or the plan is expensive. However, in parallel hash join, the inner side is a parallel hash that divides the work of building a shared hash table over the cooperating processes. This is the only join in which we can have parallelism on both sides. Partition-wise join Partition-wise join is a new feature introduced in PostgreSQL 11. In partition-wise join, the planner knows that both sides of the join have matching partition schemes. Here a join between two similarly partitioned tables are broken down into joins between their matching partitions if there is an equi-join condition between the partition key of joining tables. Munro explains, “It becomes parallelizable with the advent of parallel append, which can then run different branches of that query plan in different processes. But if you do that then granularity of parallelism is partitioned, which is in some ways good and in some ways bad compared to block-based granularity.” He further adds, “It means when the last worker runs out of work to do everyone else has to wait for that before the query is finished. Whereas, if you use block-based parallelism you don’t have the problem but there are some advantages as a result of that as well.” Parallel aggregation in PostgreSQL Calculating aggregates can be very expensive and when evaluated in a single process it could take a considerable amount of time. This problem was solved in PostgreSQL 9.6 with the introduction of parallel aggregation. This is essentially a divide and conquer strategy where multiple workers calculate a part of aggregate before the final value based on these calculations is calculated by the leader. This article walked you through some of the parallelism-specific features in PostgreSQL presented by Munro in his PostgresOpen 2018 talk.  If you want to get to grips with other advanced PostgreSQL features and SQL functions, do have a look at our Mastering PostgreSQL 11 - Second Edition book by Hans-Jürgen Schönig. By the end of this book, you will be able to use your database to its utmost capacity by implementing advanced administrative tasks with ease. PostgreSQL committer Stephen Frost shares his vision for PostgreSQL version 12 and beyond Introducing PostgREST, a REST API for any PostgreSQL database written in Haskell Percona announces Percona Distribution for PostgreSQL to support open source databases 
Read more
  • 0
  • 0
  • 18587

article-image-why-are-experts-worried-about-microsofts-billion-dollar-bet-in-openais-agi-pipe-dream
Sugandha Lahoti
23 Jul 2019
6 min read
Save for later

Why are experts worried about Microsoft's billion dollar bet in OpenAI's AGI pipe dream?

Sugandha Lahoti
23 Jul 2019
6 min read
Microsoft has invested $1 billion in OpenAI with the goal of building next-generation supercomputers and a platform within Microsoft Azure which will scale to AGI (Artificial General Intelligence). This is a multiyear partnership with Microsoft becoming OpenAI’s preferred partner for commercializing new AI technologies. Open AI will become a big Azure customer, porting its services to run on Microsoft Azure. The $1 billion is a cash investment into OpenAI LP, which is Open AI’s for-profit corporate subsidiary. The investment will follow a standard capital commitment structure which means OpenAI can call for it, as they need it. But the company plans to spend it in less than five years. Per the official press release, “The companies will focus on building a computational platform in Azure for training and running advanced AI models, including hardware technologies that build on Microsoft’s supercomputing technology. These will be implemented in a safe, secure and trustworthy way and is a critical reason the companies chose to partner together.” They intend to license some of their pre-AGI technologies, with Microsoft becoming their preferred partner. “My goal in running OpenAI is to successfully create broadly beneficial A.G.I.,” Sam Altman, who co-founded Open AI with Elon Musk, said in a recent interview. “And this partnership is the most important milestone so far on that path.” Musk left the company in February 2019, to focus on Tesla and because he didn’t agree with some of what OpenAI team wanted to do. What does this partnership mean for Microsoft and Open AI OpenAI may benefit from this deal by keeping their innovations private which may help commercialization, raise more funds and get to AGI faster. For OpenAI this means the availability of resources for AGI, while potentially allowing founders and other investors with the opportunity to either double-down on OpenAI or reallocate resources to other initiatives However, this may also lead to them not disclosing progress, papers with details, and open source code as much as in the past. https://twitter.com/Pinboard/status/1153380118582054912 As for Microsoft, this deal is another attempt in quietly taking over open source. First, with the acquisition of GitHub and the subsequent launch of GitHub Sponsors, and now with becoming OpenAI’s ‘preferred partner’ for commercialization. Last year at an Investor conference, Nadella said, “AI is going to be one of the trends that is going to be the next big shift in technology. It's going to be AI at the edge, AI in the cloud, AI as part of SaaS applications, AI as part of in fact even infrastructure. And to me, to be the leader in it, it's not enough just to sort of have AI capability that we can exercise—you also need the ability to democratize it so that every business can truly benefit from it. That to me is our identity around AI.” Partnership with OpenAI seems to be a part of this plan. This deal can also possibly help Azure catch up with Google and Amazon both in hardware scalability and Artificial Intelligence offerings. A hacker news user comments, “OpenAI will adopt and make Azure their preferred platform. And Microsoft and Azure will jointly "develop new Azure AI supercomputing technologies", which I assume is advancing their FGPA-based deep learning offering. Google has a lead with TensorFlow + TPUs and this is a move to "buy their way in", which is a very Microsoft thing to do.” https://twitter.com/soumithchintala/status/1153308199610511360 It is also likely that Microsoft is investing money which will eventually be pumped back into its own company, as OpenAI buys computing power from the tech giant. Under the terms of the contract, Microsoft will eventually become the sole cloud computing provider for OpenAI, and most of that $1 billion will be spent on computing power, Altman says. OpenAI, who were previously into building ethical AI will now pivot to build cutting edge AI and move towards AGI. Sometimes even neglecting ethical ramifications, wanting to deploy tech at the earliest which is what Microsoft would be interested in monetizing. https://twitter.com/CadeMetz/status/1153291410994532352 I see two primary motivations: For OpenAI—to secure funding and to gain some control over hardware which in turn helps differentiate software. For MSFT—to elevate Azure in the minds of developers for AI training. - James Wang, Analyst at ARKInvest https://twitter.com/jwangARK/status/1153338174871154689 However, the news of this investment did not go down well with some experts in the field who saw this as a pure commercial deal and questioned whether OpenAI’s switch to for-profit research undermines its claims to be “democratizing” AI. https://twitter.com/fchollet/status/1153489165595504640 “I can't really parse its conversion into an LP—and Microsoft's huge investment—as anything but a victory for capital” - Robin Sloan, Author https://twitter.com/robinsloan/status/1153346647339876352 “What is OpenAI? I don't know anymore.” - Stephen Merity, Deep learning researcher https://twitter.com/Smerity/status/1153364705777311745 https://twitter.com/SamNazarius/status/1153290666413383682 People are also speculating whether creating AGI is really even possible. In a recent survey experts estimated that there was a 50 percent chance of creating AGI by the year 2099. Pet New York Times, most experts believe A.G.I. will not arrive for decades or even centuries. Even Altman admits OpenAI may never get there. But the race is on nonetheless. Then why is Microsoft delivering the $1 billion over five years considering that is neither enough money nor enough time to produce AGI. Although, OpenAI has certainly impressed the tech community with its AI innovations. In April, OpenAI’s new algorithm that is trained to play the complex strategy game, Dota 2, beat the world champion e-sports team OG at an event in San Francisco, winning the first two matches of the ‘best-of-three’ series. The competition included a human team of five professional Dota 2 players and AI team of five OpenAI bots. In February, they released a new AI model GPT-2, capable of generating coherent paragraphs of text without needing any task-specific training. However experts felt that the move signalled towards ‘closed AI’ and propagated the ‘fear of AI’ for its ability to write convincing fake news from just a few words. Github Sponsors: Could corporate strategy eat FOSS culture for dinner? Microsoft is seeking membership to Linux-distros mailing list for early access to security vulnerabilities OpenAI: Two new versions and the output dataset of GPT-2 out!
Read more
  • 0
  • 0
  • 18493

article-image-facebook-research-suggests-chatbots-and-conversational-ai-will-empathize-humans
Fatema Patrawala
06 Aug 2019
6 min read
Save for later

Facebook research suggests chatbots and conversational AI are on the verge of empathizing with humans

Fatema Patrawala
06 Aug 2019
6 min read
Last week, the Facebook AI research team published a progress report on dialogue research that is fundamentally building more engageable and personalized AI systems. According to the team, “Dialogue research is a crucial component of building the next generation of intelligent agents. While there’s been progress with chatbots in single-domain dialogue, agents today are far from capable of carrying an open-domain conversation across a multitude of topics. Agents that can chat with humans in the way that people talk to each other will be easier and more enjoyable to use in our day-to-day lives — going beyond simple tasks like playing a song or booking an appointment.” In their blog post, they have described new open source data sets, algorithms, and models that improve five common weaknesses of open-domain chatbots today. The weaknesses identified are maintaining consistency, specificity, empathy, knowledgeability, and multimodal understanding. Let us look at each one in detail: Dataset called Dialogue NLI introduced for maintaining consistency Inconsistencies are a common issue for chatbots partly because most models lack explicit long-term memory and semantic understanding. Facebook team in collaboration with their colleagues at NYU, developed a new way of framing consistency of dialogue agents as natural language inference (NLI) and created a new NLI data set called Dialogue NLI, used to improve and evaluate the consistency of dialogue models. The team showcased an example in the Dialogue NLI model, where in they considered two utterances in a dialogue as the premise and hypothesis, respectively. Each pair was labeled to indicate whether the premise entails, contradicts, or is neutral with respect to the hypothesis. Training an NLI model on this data set and using it to rerank the model’s responses to entail previous dialogues — or maintain consistency with them — improved the overall consistency of the dialogue agent. Across these tests they say they saw 3x lesser contradictions in the sentences. Several conversational attributes were studied to balance specificity As per the team, generative dialogue models frequently default to generic, safe responses, like “I don’t know” to some query which needs specific responses. Hence, the Facebook team in collaboration with Stanford’s AI researcher Abigail See, studied how to fix this by controlling several conversational attributes, like the level of specificity. In one experiment, they conditioned a bot on character information and asked “What do you do for a living?” A typical chatbot responds with the generic statement “I’m a construction worker.” With control methods, the chatbots proposed more specific and engaging responses, like “I build antique homes and refurbish houses." In addition to specificity, the team mentioned, “that balancing question-asking and answering and controlling how repetitive our models are make significant differences. The better the overall conversation flow, the more engaging and personable the chatbots and dialogue agents of the future will be.” Chatbot’s ability to display empathy while responding was measured The team worked with researchers from the University of Washington to introduce the first benchmark task of human-written empathetic dialogues centered on specific emotional labels to measure a chatbot’s ability to display empathy. In addition to improving on automatic metrics, the team showed that using this data for both fine-tuning and as retrieval candidates leads to responses that are evaluated by humans as more empathetic, with an average improvement of 0.95 points (on a 1-to-5 scale) across three different retrieval and generative models. The next challenge for the team is that empathy-focused models should perform well in complex dialogue situations, where agents may require balancing empathy with staying on topic or providing information. Wikipedia dataset used to make dialogue models more knowledgeable The research team has improved dialogue models’ capability of demonstrating knowledge by collecting a data set with conversations from Wikipedia, and creating new model architectures that retrieve knowledge, read it, and condition responses on it. This generative model has yielded the most pronounced improvement and it is rated by humans as 26% more engaging than their knowledgeless counterparts. To engage with images, personality based captions were used To engage with humans, agents should not only comprehend dialogue but also understand images. In this research, the team focused on image captioning that is engaging for humans by incorporating personality. They collected a data set of human comments grounded in images, and trained models capable of discussing images with given personalities, which makes the system interesting for humans to talk to. 64% humans preferred these personality-based captions over traditional captions. To build strong models, the team considered both retrieval and generative variants, and leveraged modules from both the vision and language domains. They defined a powerful retrieval architecture, named TransResNet, that works by projecting the image, personality, and caption in the same space using image, personality, and text encoders. The team showed that their system was able to produce captions that are close to matching human performance in terms of engagement and relevance. And annotators preferred their retrieval model’s captions over captions written by people 49.5% of the time. Apart from this, Facebook team has released a new data collection and model evaluation tool, a Messenger-based Chatbot game called Beat the Bot, that allows people to interact directly with bots and other humans in real time, creating rich examples to help train models. To conclude, the Facebook AI team mentions, “Our research has shown that it is possible to train models to improve on some of the most common weaknesses of chatbots today. Over time, we’ll work toward bringing these subtasks together into one unified intelligent agent by narrowing and eventually closing the gap with human performance. In the future, intelligent chatbots will be capable of open-domain dialogue in a way that’s personable, consistent, empathetic, and engaging.” On Hacker News, this research has gained positive and negative reviews. Some of them discuss that if AI will converse like humans, it will do a lot of bad. While other users say that this is an impressive improvement in the field of conversational AI. A user comment reads, “I gotta say, when AI is able to converse like humans, a lot of bad stuff will happen. People are so used to the other conversation partner having self-interest, empathy, being reasonable. When enough bots all have a “swarm” program to move conversations in a particular direction, they will overwhelm any public conversation. Moreover, in individual conversations, you won’t be able to trust anything anyone says or negotiates. Just like playing chess or poker online now. And with deepfakes, you won’t be able to trust audio or video either. The ultimate shock will come when software can render deepfakes in realtime to carry on a conversation, as your friend but not. As a politician who “said crazy stuff” but really didn’t, but it’s in the realm of believability. I would give it about 20 years until it all goes to shit. If you thought fake news was bad, realtime deepfakes and AI conversations with “friends” will be worse.  Scroll Snapping and other cool CSS features come to Firefox 68 Google Chrome to simplify URLs by hiding special-case subdomains Lyft releases an autonomous driving dataset “Level 5” and sponsors research competition
Read more
  • 0
  • 0
  • 18468

article-image-introduction-clustering-and-unsupervised-learning
Packt
23 Feb 2016
16 min read
Save for later

Introduction to Clustering and Unsupervised Learning

Packt
23 Feb 2016
16 min read
The act of clustering, or spotting patterns in data, is not much different from spotting patterns in groups of people. In this article, you will learn: The ways clustering tasks differ from the classification tasks How clustering defines a group, and how such groups are identified by k-means, a classic and easy-to-understand clustering algorithm The steps needed to apply clustering to a real-world task of identifying marketing segments among teenage social media users Before jumping into action, we'll begin by taking an in-depth look at exactly what clustering entails. (For more resources related to this topic, see here.) Understanding clustering Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data. Without advance knowledge of what comprises a cluster, how can a computer possibly know where one group ends and another begins? The answer is simple. Clustering is guided by the principle that items inside a cluster should be very similar to each other, but very different from those outside. The definition of similarity might vary across applications, but the basic idea is always the same—group the data so that the related elements are placed together. The resulting clusters can then be used for action. For instance, you might find clustering methods employed in the following applications: Segmenting customers into groups with similar demographics or buying patterns for targeted marketing campaigns Detecting anomalous behavior, such as unauthorized network intrusions, by identifying patterns of use falling outside the known clusters Simplifying extremely large datasets by grouping features with similar values into a smaller number of homogeneous categories Overall, clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of groups. It results in meaningful and actionable data structures that reduce complexity and provide insight into patterns of relationships. Clustering as a machine learning task Clustering is somewhat different from the classification, numeric prediction, and pattern detection tasks we examined so far. In each of these cases, the result is a model that relates features to an outcome or features to other features; conceptually, the model describes the existing patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label that has been inferred entirely from the relationships within the data. For this reason, you will, sometimes, see the clustering task referred to as unsupervised classification because, in a sense, it classifies unlabeled examples. The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return the groups A, B, and C—but it's up to you to apply an actionable and meaningful label. To see how this impacts the clustering task, let's consider a hypothetical example. Suppose you were organizing a conference on the topic of data science. To facilitate professional networking and collaboration, you planned to seat people in groups according to one of three research specialties: computer and/or database science, math and statistics, and machine learning. Unfortunately, after sending out the conference invitations, you realize that you had forgotten to include a survey asking which discipline the attendee would prefer to be seated with. In a stroke of brilliance, you realize that you might be able to infer each scholar's research specialty by examining his or her publication history. To this end, you begin collecting data on the number of articles each attendee published in computer science-related journals and the number of articles published in math or statistics-related journals. Using the data collected for several scholars, you create a scatterplot: As expected, there seems to be a pattern. We might guess that the upper-left corner, which represents people with many computer science publications but few articles on math, could be a cluster of computer scientists. Following this logic, the lower-right corner might be a group of mathematicians. Similarly, the upper-right corner, those with both math and computer science experience, may be machine learning experts. Our groupings were formed visually; we simply identified clusters as closely grouped data points. Yet in spite of the seemingly obvious groupings, we unfortunately have no way to know whether they are truly homogeneous without personally asking each scholar about his/her academic specialty. The labels we applied required us to make qualitative, presumptive judgments about the types of people that would fall into the group. For this reason, you might imagine the cluster labels in uncertain terms, as follows: Rather than defining the group boundaries subjectively, it would be nice to use machine learning to define them objectively. This might provide us with a rule in the form if a scholar has few math publications, then he/she is a computer science expert. Unfortunately, there's a problem with this plan. As we do not have data on the true class value for each point, a supervised learning algorithm would have no ability to learn such a pattern, as it would have no way of knowing what splits would result in homogenous groups. On the other hand, clustering algorithms use a process very similar to what we did by visually inspecting the scatterplot. Using a measure of how closely the examples are related, homogeneous groups can be identified. In the next section, we'll start looking at how clustering algorithms are implemented. This example highlights an interesting application of clustering. If you begin with unlabeled data, you can use clustering to create class labels. From there, you could apply a supervised learner such as decision trees to find the most important predictors of these classes. This is called semi-supervised learning. The k-means clustering algorithm The k-means algorithm is perhaps the most commonly used clustering method. Having been studied for several decades, it serves as the foundation for many more sophisticated clustering techniques. If you understand the simple principles it uses, you will have the knowledge needed to understand nearly any clustering algorithm in use today. Many such methods are listed on the following site, the CRAN Task View for clustering at http://cran.r-project.org/web/views/Cluster.html. As k-means has evolved over time, there are many implementations of the algorithm. One popular approach is described in : Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979; 28:100-108. Even though clustering methods have advanced since the inception of k-means, this is not to imply that k-means is obsolete. In fact, the method may be more popular now than ever. The following table lists some reasons why k-means is still used widely: Strengths Weaknesses Uses simple principles that can be explained in non-statistical terms Highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings Performs well enough under many real-world use cases Not as sophisticated as more modern clustering algorithms Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters Requires a reasonable guess as to how many clusters naturally exist in the data Not ideal for non-spherical clusters or clusters of widely varying density The k-means algorithm assigns each of the n examples to one of the k clusters, where k is a number that has been determined ahead of time. The goal is to minimize the differences within each cluster and maximize the differences between the clusters. Unless k and n are extremely small, it is not feasible to compute the optimal clusters across all the possible combinations of examples. Instead, the algorithm uses a heuristic process that finds locally optimal solutions. Put simply, this means that it starts with an initial guess for the cluster assignments, and then modifies the assignments slightly to see whether the changes improve the homogeneity within the clusters. We will cover the process in depth shortly, but the algorithm essentially involves two phases. First, it assigns examples to an initial set of k clusters. Then, it updates the assignments by adjusting the cluster boundaries according to the examples that currently fall into the cluster. The process of updating and assigning occurs several times until changes no longer improve the cluster fit. At this point, the process stops and the clusters are finalized. Due to the heuristic nature of k-means, you may end up with somewhat different final results by making only slight changes to the starting conditions. If the results vary dramatically, this could indicate a problem. For instance, the data may not have natural groupings or the value of k has been poorly chosen. With this in mind, it's a good idea to try a cluster analysis more than once to test the robustness of your findings. To see how the process of assigning and updating works in practice, let's revisit the case of the hypothetical data science conference. Though this is a simple example, it will illustrate the basics of how k-means operates under the hood. Using distance to assign and update clusters As with k-NN, k-means treats feature values as coordinates in a multidimensional feature space. For the conference data, there are only two features, so we can represent the feature space as a two-dimensional scatterplot as depicted previously. The k-means algorithm begins by choosing k points in the feature space to serve as the cluster centers. These centers are the catalyst that spurs the remaining examples to fall into place. Often, the points are chosen by selecting k random examples from the training dataset. As we hope to identify three clusters, according to this method, k = 3 points will be selected at random. These points are indicated by the star, triangle, and diamond in the following diagram: It's worth noting that although the three cluster centers in the preceding diagram happen to be widely spaced apart, this is not always necessarily the case. Since they are selected at random, the three centers could have just as easily been three adjacent points. As the k-means algorithm is highly sensitive to the starting position of the cluster centers, this means that random chance may have a substantial impact on the final set of clusters. To address this problem, k-means can be modified to use different methods for choosing the initial centers. For example, one variant chooses random values occurring anywhere in the feature space (rather than only selecting among the values observed in the data). Another option is to skip this step altogether; by randomly assigning each example to a cluster, the algorithm can jump ahead immediately to the update phase. Each of these approaches adds a particular bias to the final set of clusters, which you may be able to use to improve your results. In 2007, an algorithm called k-means++ was introduced, which proposes an alternative method for selecting the initial cluster centers. It purports to be an efficient way to get much closer to the optimal clustering solution while reducing the impact of random chance. For more information, refer to Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. 2007:1027–1035. After choosing the initial cluster centers, the other examples are assigned to the cluster center that is nearest according to the distance function. You will remember that we studied distance functions while learning about k-Nearest Neighbors. Traditionally, k-means uses Euclidean distance, but Manhattan distance or Minkowski distance are also sometimes used. Recall that if n indicates the number of features, the formula for Euclidean distance between example x and example y is: For instance, if we are comparing a guest with five computer science publications and one math publication to a guest with zero computer science papers and two math papers, we could compute this in R as follows: > sqrt((5 - 0)^2 + (1 - 2)^2) [1] 5.09902 Using this distance function, we find the distance between each example and each cluster center. The example is then assigned to the nearest cluster center. Keep in mind that as we are using distance calculations, all the features need to be numeric, and the values should be normalized to a standard range ahead of time. As shown in the following diagram, the three cluster centers partition the examples into three segments labeled Cluster A, Cluster B, and Cluster C. The dashed lines indicate the boundaries for the Voronoi diagram created by the cluster centers. The Voronoi diagram indicates the areas that are closer to one cluster center than any other; the vertex where all the three boundaries meet is the maximal distance from all three cluster centers. Using these boundaries, we can easily see the regions claimed by each of the initial k-means seeds: Now that the initial assignment phase has been completed, the k-means algorithm proceeds to the update phase. The first step of updating the clusters involves shifting the initial centers to a new location, known as the centroid, which is calculated as the average position of the points currently assigned to that cluster. The following diagram illustrates how as the cluster centers shift to the new centroids, the boundaries in the Voronoi diagram also shift and a point that was once in Cluster B (indicated by an arrow) is added to Cluster A: As a result of this reassignment, the k-means algorithm will continue through another update phase. After shifting the cluster centroids, updating the cluster boundaries, and reassigning points into new clusters (as indicated by arrows), the figure looks like this: Because two more points were reassigned, another update must occur, which moves the centroids and updates the cluster boundaries. However, because these changes result in no reassignments, the k-means algorithm stops. The cluster assignments are now final: The final clusters can be reported in one of the two ways. First, you might simply report the cluster assignments such as A, B, or C for each example. Alternatively, you could report the coordinates of the cluster centroids after the final update. Given either reporting method, you are able to define the cluster boundaries by calculating the centroids or assigning each example to its nearest cluster. Choosing the appropriate number of clusters In the introduction to k-means, we learned that the algorithm is sensitive to the randomly-chosen cluster centers. Indeed, if we had selected a different combination of three starting points in the previous example, we may have found clusters that split the data differently from what we had expected. Similarly, k-means is sensitive to the number of clusters; the choice requires a delicate balance. Setting k to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data. Ideally, you will have a priori knowledge (a prior belief) about the true groupings and you can apply this information to choosing the number of clusters. For instance, if you were clustering movies, you might begin by setting k equal to the number of genres considered for the Academy Awards. In the data science conference seating problem that we worked through previously, k might reflect the number of academic fields of study that were invited. Sometimes the number of clusters is dictated by business requirements or the motivation for the analysis. For example, the number of tables in the meeting hall could dictate how many groups of people should be created from the data science attendee list. Extending this idea to another business case, if the marketing department only has resources to create three distinct advertising campaigns, it might make sense to set k = 3 to assign all the potential customers to one of the three appeals. Without any prior knowledge, one rule of thumb suggests setting k equal to the square root of (n / 2), where n is the number of examples in the dataset. However, this rule of thumb is likely to result in an unwieldy number of clusters for large datasets. Luckily, there are other statistical methods that can assist in finding a suitable k-means cluster set. A technique known as the elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k. As illustrated in the following diagrams, the homogeneity within clusters is expected to increase as additional clusters are added; similarly, heterogeneity will also continue to decrease with more clusters. As you could continue to see improvements until each example is in its own cluster, the goal is not to maximize homogeneity or minimize heterogeneity, but rather to find k so that there are diminishing returns beyond that point. This value of k is known as the elbow point because it looks like an elbow. There are numerous statistics to measure homogeneity and heterogeneity within the clusters that can be used with the elbow method (the following information box provides a citation for more detail). Still, in practice, it is not always feasible to iteratively test a large number of k values. This is in part because clustering large datasets can be fairly time consuming; clustering the data repeatedly is even worse. Regardless, applications requiring the exact optimal set of clusters are fairly rare. In most clustering applications, it suffices to choose a k value based on convenience rather than strict performance requirements. For a very thorough review of the vast assortment of cluster performance measures, refer to: Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of Intelligent Information Systems. 2001; 17:107-145. The process of setting k itself can sometimes lead to interesting insights. By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries. Groups that are more tightly clustered will change a little, while less homogeneous groups will form and disband over time. In general, it may be wise to spend little time worrying about getting k exactly right. The next example will demonstrate how even a tiny bit of subject-matter knowledge borrowed from a Hollywood film can be used to set k such that actionable and interesting clusters are found. As clustering is unsupervised, the task is really about what you make of it; the value is in the insights you take away from the algorithm's findings. Summary This article covered only the fundamentals of clustering. As a very mature machine learning method, there are many variants of the k-means algorithm as well as many other clustering algorithms that bring unique biases and heuristics to the task. Based on the foundation in this article, you will be able to understand and apply other clustering methods to new problems. To learn more about different machine learning techniques, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r) Mastering Scientific Computing with R (https://www.packtpub.com/application-development/mastering-scientific-computing-r) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article:   Further resources on this subject: Displaying SQL Server Data using a Linq Data Source [article] Probability of R? [article] Working with Commands and Plugins [article]
Read more
  • 0
  • 0
  • 18351
article-image-getting-started-with-the-confluent-platform-apache-kafka-for-enterprise
Amarabha Banerjee
27 Feb 2018
9 min read
Save for later

Getting started with the Confluent Platform: Apache Kafka for enterprise

Amarabha Banerjee
27 Feb 2018
9 min read
This article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book will show how to use Kafka efficiently with practical solutions to the common problems that developers and administrators usually face while working with it. In today’s tutorial, we will talk about the confluent platform and how to get started with organizing and managing data from several sources in one high-performance and reliable system. The Confluent Platform is a full stream data system. It enables you to organize and manage data from several sources in one high-performance and reliable system. As mentioned in the first few chapters, the goal of an enterprise service bus is not only to provide the system a means to transport messages and data but also to provide all the tools that are required to connect the data origins (data sources), applications, and data destinations (data sinks) to the platform. The Confluent Platform has these parts: Confluent Platform open source Confluent Platform enterprise Confluent Cloud The Confluent Platform open source has the following components: Apache Kafka core Kafka Streams Kafka Connect Kafka clients Kafka REST Proxy Kafka Schema Registry The Confluent Platform enterprise has the following components: Confluent Control Center Confluent support, professional services, and consulting All the components are open source except the Confluent Control Center, which is a proprietary of Confluent Inc. An explanation of each component is as follows: Kafka core: The Kafka brokers discussed at the moment in this book. Kafka Streams: The Kafka library used to build stream processing systems. Kafka Connect: The framework used to connect Kafka with databases, stores, and filesystems. Kafka clients: The libraries for writing/reading messages to/from Kafka. Note that there clients for these languages: Java, Scala, C/C++, Python, and Go. Kafka REST Proxy: If the application doesn't run in the Kafka clients' programming languages, this proxy allows connecting to Kafka through HTTP. Kafka Schema Registry: Recall that an enterprise service bus should have a message template repository. The Schema Registry is the repository of all the schemas and their historical versions, made to ensure that if an endpoint changes, then all the involved parts are acknowledged. Confluent Control Center: A powerful web graphic user interface for managing and monitoring Kafka systems. Confluent Cloud: Kafka as a service—a cloud service to reduce the burden of operations. Installing the Confluent Platform In order to use the REST proxy and the Schema Registry, we need to install the Confluent Platform. Also, the Confluent Platform has important administration, operation, and monitoring features fundamental for modern Kafka production systems. Getting ready At the time of writing this book, the Confluent Platform Version is 4.0.0. Currently, the supported operating systems are: Debian 8 Red Hat Enterprise Linux CentOS 6.8 or 7.2 Ubuntu 14.04 LTS and 16.04 LTS macOS currently is just supported for testing and development purposes, not for production environments. Windows is not yet supported. Oracle Java 1.7 or higher is required. The default ports for the components are: 2181: Apache ZooKeeper 8081: Schema Registry (REST API) 8082: Kafka REST Proxy 8083: Kafka Connect (REST API) 9021: Confluent Control Center 9092: Apache Kafka brokers It is important to have these ports, or the ports where the components are going to run, Open How to do it There are two ways to install: downloading the compressed files or with apt-get command. To install the compressed files: Download the Confluent open source v4.0 or Confluent Enterprise v4.0 TAR files from https://www.confluent.io/download/ Uncompress the archive file (the recommended path for installation is under /opt) To start the Confluent Platform, run this command: $ <confluent-path>/bin/confluent start The output should be as follows: Starting zookeeper zookeeper is [UP] Starting kafka kafka is [UP] Starting schema-registry schema-registry is [UP] Starting kafka-rest kafka-rest is [UP] Starting connect connect is [UP] To install with the apt-get command (in Debian and Ubuntu): Install the Confluent public key used to sign the packages in the APT repository: $ wget -qO - http://packages.confluent.io/deb/4.0/archive.key |sudo apt-key add - Add the repository to the sources list: $ sudo add-apt-repository "deb [arch=amd64] http://packages.confluent.io/deb/4.0 stable main" Finally, run the apt-get update to install the Confluent Platform To install Confluent open source: $ sudo apt-get update && sudo apt-get install confluent-platformoss- 2.11 To install Confluent Enterprise: $ sudo apt-get update && sudo apt-get install confluentplatform-2.11 The end of the package name specifies the Scala version. Currently, the supported versions are 2.11 (recommended) and 2.10. There's more The Confluent Platform provides the system and component packages. The commands in this recipe are for installing all components of the platform. To install individual components, follow the instructions on this page: https://docs.confluent.io/current/installation/available_packages.html#avaiIable-packages. Using Kafka operations With the Confluent Platform installed, the administration, operation, and monitoring of Kafka become very simple. Let's review how to operate Kafka with the Confluent Platform. Getting ready For this recipe, Confluent should be installed, up, and running. How to do it The commands in this section should be executed from the directory where the Confluent Platform is installed: To start ZooKeeper, Kafka, and the Schema Registry with one command, run: $ confluent start schema-registry The output of this command should be: Starting zookeeper zookeeper is [UP] Starting kafka kafka is [UP] Starting schema-registry schema-registry is [UP] To execute the commands outside the installation directory, add Confluent's bin directory to PATH: export PATH=<path_to_confluent>/bin:$PATH To manually start each service with its own command, run: $ ./bin/zookeeper-server-start ./etc/kafka/zookeeper.properties $ ./bin/kafka-server-start ./etc/kafka/server.properties $ ./bin/schema-registry-start ./etc/schema-registry/schemaregistry. properties Note that the syntax of all the commands is exactly the same as always but without the .sh extension. To create a topic called test_topic, run the following command: $ ./bin/kafka-topics --zookeeper localhost:2181 --create --topic test_topic --partitions 1 --replication-factor 1 To send an Avro message to test_topic in the broker without writing a single line of code, use the following command: $ ./bin/kafka-avro-console-producer --broker-list localhost:9092 --topic test_topic --property value.schema='{"name":"person","type":"record", "fields":[{"name":"name","type":"string"},{"name":"age","type":"int "}]}' Send some messages and press Enter after each line: {"name": "Alice", "age": 27} {"name": "Bob", "age": 30} {"name": "Charles", "age":57} Enter with an empty line is interpreted as null. To shut down the process, press Ctrl + C. To consume the Avro messages from test_topic since the beginning, type: $ ./bin/kafka-avro-console-consumer --topic test_topic --zookeeper localhost:2181 --from-beginning The messages created in the previous step will be written to the console in the format they were introduced. To shut down the consumer, press Ctrl + C. To test the Avro schema validation, try to produce data on the same topic using an incompatible schema, for example, with this producer: $ ./bin/kafka-avro-console-producer --broker-list localhost:9092 --topic test_topic --property value.schema='{"type":"string"}' After you've hit Enter on the first message, the following exception is raised: org.apache.kafka.common.errors.SerializationException: Error registering Avro schema: "string" Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClient Exception: Schema being registered is incompatible with the latest schema; error code: 409 at io.confluent.kafka.schemaregistry.client.rest.utils.RestUtils.httpR equest(RestUtils.java:146) To shut down the services (Schema Registry, broker, and ZooKeeper) run: confluent stop To delete all the producer messages stored in the broker, run this: confluent destroy There's more With the Confluent Platform, it is possible to manage all of the Kafka system through the Kafka operations, which are classified as follows: Production deployment: Hardware configuration, file descriptors, and ZooKeeper configuration Post deployment: Admin operations, rolling restart, backup, and restoration Auto data balancing: Rebalancer execution and decommissioning brokers Monitoring: Metrics for each concept—broker, ZooKeeper, topics, producers, and consumers Metrics reporter: Message size, security, authentication, authorization, and verification Monitoring with the Confluent Control Center This recipe shows you how to use the metrics reporter of the Confluent Control Center. Getting ready The execution of the previous recipe is needed. Before starting the Control Center, configure the metrics reporter: Back up the server.properties file located at: <confluent_path>/etc/kafka/server.properties In the server.properties file, uncomment the following lines: metric.reporters=io.confluent.metrics.reporter.ConfluentMetricsRepo rter confluent.metrics.reporter.bootstrap.servers=localhost:9092 confluent.metrics.reporter.topic.replicas=1 Back up the Kafka Connect configuration located in: <confluent_path>/etc/schema-registry/connect-avrodistributed.properties Add the following lines at the end of the connect-avrodistributed.properties file: consumer.interceptor.classes=io.confluent.monitoring.clients.interc eptor.MonitoringConsumerInterceptor producer.interceptor.classes=io.confluent.monitoring.clients.interc eptor.MonitoringProducerInterceptor Start the Confluent Platform: $ <confluent_path>/bin/confluent start Before starting the Control Center, change its configuration: Back up the control-center.properties file located in: <confluent_path>/etc/confluent-control-center/controlcenter.properties Add the following lines at the end of the control-center.properties file: confluent.controlcenter.internal.topics.partitions=1 confluent.controlcenter.internal.topics.replication=1 confluent.controlcenter.command.topic.replication=1 confluent.monitoring.interceptor.topic.partitions=1 confluent.monitoring.interceptor.topic.replication=1 confluent.metrics.topic.partitions=1 confluent.metrics.topic.replication=1 Start the Control Center: <confluent_path>/bin/control-center-start How to do it Open the Control Center web graphic user interface at the following URL: http://localhost:9021/. The test_topic created in the previous recipe is needed: $ <confluent_path>/bin/kafka-topics --zookeeper localhost:2181 -- create --test_topic --partitions 1 --replication-factor 1 From the Control Center, click on the Kafka Connect button on the left. Click on the New source button: 4. From the connector class, drop down the menu and select SchemaSourceConnector. Specify Connection Name as Schema-Avro-Source. 5. In the topic name, specify test_topic. 6. Click on Continue, and then click on the Save & Finish button to apply the configuration. To create a new sink follow these steps: From Kafka Connect, click on the SINKS button and then on the New sink button: From the topics list, choose test_topic and click on the Continue button In the SINKS tab, set the connection class to SchemaSourceConnector; specify Connection Name as Schema-Avro-Source Click on the Continue button and then on Save & Finish to apply the new configuration How it works Click on the Data streams tab and a chart shows the total number of messages produced and consumed on the cluster: To summarize, we discussed how to get started with the Apache Kafka confluent platform. If you liked our post, please be sure to check out Apache Kafka 1.0 Cookbook  which consists of useful recipes to work with your Apache Kafka installation.
Read more
  • 0
  • 0
  • 18216

article-image-how-to-prevent-errors-while-using-utilities-for-loading-data-in-teradata
Pravin Dhandre
11 Jun 2018
9 min read
Save for later

How to prevent errors while using utilities for loading data in Teradata

Pravin Dhandre
11 Jun 2018
9 min read
In today’s tutorial we will assist you to overcome the errors that arise while loading, deleting or updating large volumes of data using Teradata Utilities. [box type="note" align="" class="" width=""]This article is an excerpt from Teradata Cookbook co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati. This book provides recipes to simplify the daily tasks performed by database administrators (DBA) along with providing efficient data warehousing solutions in Teradata database system.[/box] Resolving FastLoad error 2652 When data is being loaded via FastLoad, a table lock is placed on the target table. This means that the table is unavailable for any other operation. A lock on a table is only released when FastLoad encounters the END LOADING command, which terminates phase 2, the so-called application phase. FastLoad may get terminated in phase 1 due to any of the following reasons: Load script results in failure (error code 8 or 12) Load script is aborted by admin or some other session FastLoad fails due to bad record or file Forgetting to add end loading statement in script If so, it keeps a lock on the table, which needs to be released manually. In this recipe, we will see the steps to release FastLoad locks. Getting ready Identify the table on which FastLoad is been ended prematurely and tables are in locked state. You need to have valid credentials for the Teradata Database. Execute the dummy FastLoad script from the same user or the user which has write access to the lock table. A user requires the following privileges/rights in order to execute the FastLoad: SELECT and INSERT (CREATE and DROP or DELETE) access to the target or loading table CREATE and DROP TABLE on error tables SELECT, INSERT, UPDATE, and DELETE are required privileges for the user PUBLIC on the restart log table (SYSADMIN.FASTLOG). There will be a row in the FASTLOG table for each FastLoad job that has not completed in the system. How to do it... Open a notepad and create the following script: .LOGON 127.0.0.1/dbc, dbc; /* Vaild system name and credentials to your system */ .DATABASE Database_Name; /* database under which locked table is */ erorfiles errortable_name, uv_tablename /* same error table name as in script */ begin loading locked_table; /* table which is getting 2652 error */ .END LOADING; /* to end pahse 2 and release the lock */ .LOGOFF; Save it as dummy_fl.txt. Open the windows Command Prompt and execute this using the FastLoad command, as shown in the following screenshot: This dummy script with no insert statement should release the lock on the target Table. Execute Select on the locked table to see if the lock is released on the table. How it works... As FastLoad is designed to work only on empty tables, it becomes necessary that the loading of the table finishes in one go. If the load script is errored out prematurely in phase 2, without encountering the END loading command, it leaves a lock on loading the table. Fastload locks can't be released via the HUT utility, as there are no technical lock on the table. To execute FastLoad, the following are some requirements: Log table: FastLoad puts its progress information in the fastlog table. EMPTY TABLE: FastLoad needs the table to be empty before inserting rows into that table. TWO ERROR TABLES: FastLoad requires two error tables to be created; you just need to name them, and no ddl is required. The first error table records any translation or constraint violation error, whereas the second error table captures errors related to the duplication of values for Unique Primary Indexes (UPI). After the completion of FastLoad, you can analyze these error tables as to why the records got rejected. There's more... If this does not fix the issue, you need to drop the target table and error tables associated with it. Before proceeding with dropping tables, check with the administrator to abort any FastLoad sessions associated with this table. Resolving MLOAD error 2571 MLOAD works in five phases, unlike FastLoad, which only works in two phases. MLOAD can fail in either phase three or four. Figure shows 5 stages of MLOAD. Preliminary: Basic setup. Syntax checking, establishing session with the Teradata Database, creation of error tables (two error tables per target table), and the creation of work tables and log tables are done in this phase. DML Transaction phase: Request is parse through PE and a step plan is generated. Steps and DML are then sent to AMP and stored in appropriate work tables for each target table. Input data sent will be stored in these work tables, which will be applied to the target table later on. Acquisition phase: Unsorted data is sent to AMP in blocks of 64K. Rows are hashed by PI and sent to appropriate AMPs. Utility places locks on target tables in preparation for the application phase to apply rows in target tables. Application phase: Changes are applied to target tables and NUSI subtables. Lock on table is held in this phase. Cleanup phase: If the error code of all the steps is 0, MLOAD successfully completes and releases all the locks on the specified table. This being the case, all empty error tables, worktables, and the log table are dropped. Getting ready Identify the table which is getting affected by error 2571. Make sure no host utility is running on this table and the load job is in a failed state for this table. How to do it... Check on viewpoint for any active utility job for this table. If you find any active job, let it complete. If there is a reason that you need to release the lock, first abort all the sessions of the host utility from viewpoint. Ask your administrator to do it. Execute the following command: RELEASE MLOAD <databasename.tablename>; > If you get a Not able to release MLOAD Lock error, execute the following Command: /* Release lock in application phase */ RELEASE MLOAD <databasename.tablename> in apply; Once the locks are released you need to drop all the associated error tables, the log table, and work tables with it. Re-execute MLOAD after correcting the error. How it works... The Mload utility places a lock in table headers to alert other utilities that a MultiLoad is in session for this table. They include: Acquisition lock: DML allows all DDL allows DROP only Application lock: DML allows SELECT with ACCESS only DDL allows DROP only There's more... If the release lock statement still gives an error and does not release the lock on the table, you need to use SELECT with the ACCESS lock to copy the content of the locked table to a new one and drop the locked tables. If you start receiving the error 7446 Mload table %ID cannot be released because NUSI exists, you need to drop all the NUSI on the table and use ALTER Table to nonfallback to accomplish the task. Resolving failure 7547 This error is associated with the UPDATE statement, which could be SQL based or could be in MLOAD. Various times, while updating the set of rows in a table, the update fails on Failure 7547 Target row updated by multiple source rows. This error will happen when you update the target with multiple rows from the source. This means there are duplicated values present in the source tables. Getting ready Let's create sample volatile tables and insert values into them. After that, we will execute the UPDATE command, which will fail to result in 7547: Create a TARGET TABLE with the following DDL and insert values into it: ** TARGET TABLE** create volatile table accounts ( CUST_ID, CUST_NAME, Sal )with data primary index(cust_id) insert values (1,'will',2000); insert values (2,'bekky',2800); insert values (3,'himesh',4000); Create a SOURCE TABLE with the following DDL and insert values into it: ** SOURCE TABLE** create volatile table Hr_payhike ( CUST_ID, CUST_NAME, Sal_hike ) with data primary index(cust_id) insert values (1,'will',2030); insert values (1,'bekky',3800); insert values (3,'himesh',7000); Execute the MLOAD script. Following the snippet from the MLOAD script, only update part (which will fail): /* Snippet from MLOAD update */ UPDATE ACC FROM ACCOUNTS ACC , Hr_payhike SUPD SET Sal= TUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Failure: Target row updated by multiple source rows How to do it... Check for duplicate values in the source table using the following: /*Check for duplicate values in source table*/ SELECT cust_id,count(*) from Hr_payhike group by 1 order by 2 desc The output will be generated with CUST_ID =1 and has two values which are causing errors. The reason for this is that while updating the TARGET table, the optimizer won't be able to understand from which row it should update the TARGET row. Who's salary will be updated Will or Bekky? To resolve the error, execute the following update query: /* Update part of MLOAD */ UPDATE ACC FROM ACCOUNTS ACC , ( SELECT CUST_ID, CUST_NAME, SAL_HIKE FROM Hr_payhike QUALIFY ROW_NUMBER() OVER (PARTITION BY CUST_ID ORDER BY CUST_NAME,SAL_HIKE DESC)=1) SUPD SET Sal= SUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Now, the update will run without error. How it works... Failure will happen when you update the target with multiple rows from the source. If you defined a primary index column for your target, and if those columns are in an update query condition, this error will occur. To further resolve this, you can delete the duplicate from the source table itself and execute the original update without any modification. But if the source data can't be changed, then you need to change the update statement. To summarize, we have successfully learned how to overcome or prevent errors while using utilities for loading data into database. You could also check out the Teradata Cookbook  for more than 100 recipes on enterprise data warehousing solutions. 2018 is the year of graph databases. Here’s why. 6 reasons to choose MySQL 8 for designing database solutions Amazon Neptune, AWS’ cloud graph database, is now generally available
Read more
  • 0
  • 0
  • 18203

article-image-accountability-and-algorithmic-bias-why-diversity-and-inclusion-matters-neurips-invited-talk
Sugandha Lahoti
08 Dec 2018
4 min read
Save for later

Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk]

Sugandha Lahoti
08 Dec 2018
4 min read
One of the most awaited machine learning conference, NeurIPS 2018 is happening throughout this week in Montreal, Canada. It will feature a series of tutorials, invited talks, product releases, demonstrations, presentations, and announcements related to machine learning research. For the first time, NeurIPS invited a diversity and inclusion (D&I) speaker Laura Gomez to talk about the lack of diversity in the tech industry, which leads to biased algorithms, faulty products, and unethical tech. Laura Gomez is the CEO of Atipica that helps tech companies find and hire diverse candidates. Being a Latina woman herself, she had to face oppression when seeking capital and funds for her startup trying to establish herself in Silicon Valley. This experience led to her realization that there is a strong need to talk about why diversity and inclusion matters. Her efforts were not in vain and recently, she raised $2M in seed funding led by True Ventures. “At Atipica, we think of Inclusive AI in terms of data science, algorithms, and their ethical implications. This way you can rest assure our models are not replicating the biases of humans that hinder diversity while getting patent-pending aggregate demographic insights of your talent pool,” reads the website. She talks about her journey as a Latina woman in the tech industry. She reminisced on how she was the only one like her who got an internship with Hewlett Packard and the fact that she hated it. Nevertheless, she still decided to stay, determined not to let the industry turn her into a victim. She believes she made the right choice going forward with tech; now, years later, diversity is dominating the conversation in the industry. After HP, she also worked at Twitter and YouTube, helping them translate and localize their applications for a global audience. She is also a founding advisor of Project Include, which is a non-profit organization run by women, that uses data and advocacy to accelerate diversity and inclusion solutions in the tech industry. She opened her talk by agreeing to a quote from Safiya Noble, who wrote Algorithms of Oppression. “Artificial Intelligence will become a major human rights issue in the twenty-first century.” She believes we need to talk about difficult questions such as where AI is heading? And where should we hold ourselves and each other accountable.” She urges people to evaluate their role in AI, bias, and inclusion, to find the empathy and value in difficult conversations, and to go beyond your immediate surroundings to consider the broader consequences. It is important to build accountable AI in a way that allows humanity to triumph. She touched upon discriminatory moves by tech giants like Amazon and Google. Amazon recently killed off its AI recruitment tool because it couldn’t stop discriminating against women. She also criticized upon Facebook’s Myanmar operation where Facebook data scientists were building algorithms for hate speech. They didn’t understand the importance of localization or language or actually internationalize their own algorithms to be inclusive towards all the countries. She also talked about algorithmic bias in library discovery systems, as well as how even ‘black robots’ are being impacted by racism. She also condemned Palmer Luckey's work who is helping U.S. immigration agents on the border wall identify Latin refugees. Finally, she urged people to take three major steps to progress towards being inclusive: Be an ally Think of inclusion as an approach, not a feature Work towards an Ethical AI Head over to NeurIPS facebook page for the entire talk and other sessions happening at the conference this week. NeurIPS 2018: Deep learning experts discuss how to build adversarially robust machine learning models NeurIPS 2018 paper: DeepMind researchers explore autoregressive discrete autoencoders (ADAs) to model music in raw audio at scale NeurIPS 2018: A quick look at data visualization for Machine learning by Google PAIR researchers [Tutorial]
Read more
  • 0
  • 0
  • 18166
article-image-experts-present-most-pressing-issues-facing-global-lawmakers-on-citizens-privacy-democracy-and-rights-to-freedom-of-speech
Sugandha Lahoti
31 May 2019
17 min read
Save for later

Experts present most pressing issues facing global lawmakers on citizens’ privacy, democracy and rights to freedom of speech

Sugandha Lahoti
31 May 2019
17 min read
The Canadian Parliament's Standing Committee on Access to Information, Privacy, and Ethics are hosting a hearing on Big Data, Privacy and Democracy from Monday, May 27 to Wednesday, May 29 as a series of discussions with experts, and tech execs over the three days. The committee invited expert witnesses to testify before representatives from 12 countries ( Canada, United Kingdom, Singapore, Ireland, Germany, Chile, Estonia, Mexico, Morocco, Ecuador, St. Lucia, and Costa Rica) on how governments can protect democracy and citizen rights in the age of big data. The committee opened with a round table discussion where expert witnesses spoke about what they believe to be the most pressing issues facing lawmakers when it comes to protecting the rights of citizens in the digital age. Expert witnesses that took part were: Professor Heidi Tworek, University of British Columbia Jason Kint, CEO of Digital Content Next Taylor Owen, McGill University Ben Scott, The Center for Internet and Society, Stanford Law School Roger McNamee, Author of Zucked: Waking up to the Facebook Catastrophe Shoshana Zuboff, Author of The Age of Surveillance Capitalism Maria Ressa, Chief Executive Officer and Executive Editor of Rappler Inc. Jim Balsillie, Chair, Centre for International Governance Innovation The session was led by Bob Zimmer, M.P. and Chair of the Standing Committee on Access to Information, Privacy and Ethics. Other members included Nathaniel Erskine-Smith, and Charlie Angus, M.P. and Vice-Chair of the Standing Committee on Access to Information, Privacy and Ethics. Also present was Damian Collins, M.P. and Chair of the UK Digital, Culture, Media and Sport Committee. Testimonies from the witnesses “Personal data matters more than context”, Jason Kint, CEO of Digital Content Next The presentation started with Mr. Jason Kint, CEO of Digital Content Next, a US based Trade association, who thanked the committee and appreciated the opportunity to speak on behalf of 80 high-quality digital publishers globally. He begins by saying how DCN has prioritized shining a light on issues that erode trust in the digital marketplace, including a troubling data ecosystem that has developed with very few legitimate constraints on the collection and use of data about consumers. As a result personal data is now valued more highly than context, consumer expectations, copyright, and even facts themselves. He believes it is vital that policymakers begin to connect the dots between the three topics of the committee's inquiry, data privacy, platform dominance and, societal impact. He says that today personal data is frequently collected by unknown third parties without consumer knowledge or control. This data is then used to target consumers across the web as cheaply as possible. This dynamic creates incentives for bad actors, particularly on unmanaged platforms, like social media, which rely on user-generated content mostly with no liability. Here the site owners are paid on the click whether it is from an actual person or a bot on trusted information or on disinformation. He says that he is optimistic about regulations like the GDPR in the EU which contain narrow purpose limitations to ensure companies do not use data for secondary uses. He recommends exploring whether large tech platforms that are able to collect data across millions of devices, websites, and apps should even be allowed to use this data for secondary purposes. He also applauds the decision of the German cartel office to limit Facebook's ability to collect and use data across its apps and the web. He further says that issues such as bot fraud, malware, ad blockers, clickbait, privacy violations and now disinformation are just symptoms. The root cause is unbridled data collection at the most personal level.  Four years ago DC ended the original financial analysis labeling Google and Facebook the duopoly of digital advertising. In a 150+ billion dollar digital ad market across the North America and the EU, 85 to 90 percent of the incremental growth is going to just these two companies. DNC dug deeper and connected the revenue concentration to the ability of these two companies to collect data in a way that no one else can. This means both companies know much of your browsing history and your location history. The emergence of this duopoly has created a misalignment between those who create the content and those who profit from it. The scandal involving Facebook and Cambridge analytic underscores the current dysfunctional dynamic. With the power Facebook has over our information ecosystem our lives and our democratic systems it is vital to know whether we can trust the company. He also points out that although, there's been a well documented and exhausting trail of apologies, there's been little or no change in the leadership or governance of Facebook. In fact the company has repeatedly refused to have its CEO offer evidence to pressing international government. He believes there should be a deeper probe as there's still much to learn about what happened and how much Facebook knew about the Cambridge Analytica scandal before it became public. Facebook should be required to have an independent audit of its user account practices and its decisions to preserve or purge real and fake accounts over the past decade. He ends his testimony saying that it is critical to shed light on these issues to understand what steps must be taken to improve data protection. This includes providing consumers with greater transparency and choice over their personal data when using practices that go outside of the normal expectations of consumers. Policy makers globally must hold digital platforms accountable for helping to build a healthy marketplace and for restoring consumer trust and restoring competition. “We need a World Trade Organization 2.0 “, Jim Balsillie, Chair, Centre for International Governance Innovation; Retired Chairman and co-CEO of BlackBerry Jim begins by saying that Data governance is the most important public policy issue of our time. It is cross-cutting with economic, social, and security dimension. It requires both national policy frameworks and international coordination. A specific recommendation he brought forward in this hearing was to create a new institution for like-minded nations to address digital cooperation and stability. “The data driven economies effects cannot be contained within national borders”, he said, “we need new or reformed rules of the road for digitally mediated global commerce, a World Trade Organization 2.0”. He gives the example of Financial Stability Board which was created in the aftermath of the 2008 financial crisis to foster global financial cooperation and stability. He recommends forming a similar global institution, for example, digital stability board, to deal with the challenges posed by digital transformation. The nine countries on this committee plus the five other countries attending, totaling 14 could constitute founding members of this board which would undoubtedly grow over time. “Check business models of Silicon Valley giants”, Roger McNamee, Author of Zucked: Waking up to the Facebook Catastrophe Roger begins by saying that it is imperative that this committee and that nations around the world engage in a new thought process relative to the ways of controlling companies in Silicon Valley, especially to look at their business models. By nature these companies invade privacy and undermine democracy. He assures that there is no way to stop that without ending the business practices as they exist. He then commends Sri Lanka who chose to shut down the platforms in response to a terrorist act. He believes that that is the only way governments are going to gain enough leverage in order to have reasonable conversations. He explains more on this in his formal presentation, which took place yesterday. “Stop outsourcing policies to the private sector”, Taylor Owen, McGill University He begins by making five observations about the policy space that we’re in right now. First, self-regulation and even many of the forms of co-regulation that are being discussed have and will continue to prove insufficient for this problem. The financial incentives are simply powerfully aligned against meaningful reform. These are publicly traded largely unregulated companies whose shareholders and directors expect growth by maximizing a revenue model that it is self part of the problem. This growth may or may not be aligned with the public interest. Second, disinformation, hate speech, election interference, privacy breaches, mental health issues and anti-competitive behavior must be treated as symptoms of the problem not its cause. Public policy should therefore focus on the design and the incentives embedded in the design of the platforms themselves. If democratic governments determine that structure and design is leading to negative social and economic outcomes, then it is their responsibility to govern. Third, governments that are taking this problem seriously are converging on a markedly similar platform governance agenda. This agenda recognizes that there are no silver bullets to this broad set of problems and that instead, policies must be domestically implemented and internationally coordinated across three categories: Content policies which seek to address a wide range of both supply and demand issues about the nature amplification and legality of content in our digital public sphere. Data policies which ensure that public data is used for the public good and that citizens have far greater rights over the use, mobility, and monetization of their data. Competition policies which promote free and competitive markets in the digital economy. Fourth, the propensity when discussing this agenda to overcomplicate solutions serves the interests of the status quo. He then recommends sensible policies that could and should be implemented immediately: The online ad micro targeting market could be made radically more transparent and in many cases suspended entirely. Data privacy regimes could be updated to provide far greater rights to individuals and greater oversight and regulatory power to punish abuses. Tax policy can be modernized to better reflect the consumption of digital goods and to crack down on tax base erosion and profit sharing. Modernized competition policy can be used to restrict and rollback acquisitions and a separate platform ownership from application and product development. Civic media can be supported as a public good. Large-scale and long term civic literacy and critical thinking efforts can be funded at scale by national governments, not by private organizations. He then asks difficult policy questions for which there are neither easy solutions, meaningful consensus nor appropriate existing international institutions. How we regulate harmful speech in the digital public sphere? He says, that at the moment we've largely outsourced the application of national laws as well as the interpretation of difficult trade-offs between free speech and personal and public harms to the platforms themselves. Companies who seek solutions rightly in their perspective that can be implemented at scale globally. In this case, he argues that what is possible technically and financially for the companies might be insufficient for the goals of the public good or the public policy goals. What is liable for content online? He says that we’ve clearly moved beyond the notion of platform neutrality and absolute safe harbor but what legal mechanisms are best suited to holding platforms, their design, and those that run them accountable. Also, he asks how are we going to bring opaque artificial intelligence systems into our laws and norms and regulations? He concludes saying that these difficult conversation should not be outsourced to the private sector. They need to be led by democratically accountable governments and their citizens. “Make commitments to public service journalism”, Ben Scott, The Center for Internet and Society, Stanford Law School Ben states that technology doesn't cause the problem of data misinformation, and irregulation. It infact accelerates it. This calls for policies to be made to limit the exploitation of these technology tools by malignant actors and by companies that place profits over the public interest. He says, “we have to view our technology problem through the lens of the social problems that we're experiencing.” This is why the problem of political fragmentation or hate speech tribalism and digital media looks different in each countries. It looks different because it feeds on the social unrest, the cultural conflict, and the illiberalism that is native to each society. He says we need to look at problems holistically and understand that social media companies are a part of a system and they don't stand alone as the super villains. The entire media market has bent itself to the performance metrics of Google and Facebook. Television, radio, and print have tortured their content production and distribution strategies to get likes shares and and to appear higher in the Google News search results. And so, he says, we need a comprehensive public policy agenda and put red lines around the illegal content. To limit data collection and exploitation we need to modernize competition policy to reduce the power of monopolies. He also says, that we need to publicly educate people on how to help themselves and how to stop being exploited. We need to make commitments to public service journalism to provide alternatives for people, alternatives to the mindless stream of clickbait to which we have become accustomed. “Pay attention to the physical infrastructure”, Professor Heidi Tworek, University of British Columbia Taking inspiration from Germany's vibrant interwar media democracy as it descended into an authoritarian Nazi regime, Heidi lists five brief lessons that she thinks can guide policy discussions in the future. These can enable governments to build robust solutions that can make democracies stronger. Disinformation is also an international relations problem Information warfare has been a feature not a bug of the international system for at least a century. So the question is not if information warfare exists but why and when states engage in it. This happens often when a state feels encircled, weak or aspires to become a greater power than it already is. So if many of the causes of disinformation are geopolitical, we need to remember that many of the solutions will be geopolitical and diplomatic as well, she adds. Pay attention to the physical infrastructure Information warfare disinformation is also enabled by physical infrastructure whether it is the submarine cables a century ago or fiber optic cables today. 95 to 99 percent of international data flows through undersea fiber-optic cables. Google partly owns 8.5 percent of those submarine cables. Content providers also own physical infrastructure She says, Russia and China, for example are surveying European and North American cables. China we know as of investing in 5G but combining that with investments in international news networks. Business models matter more than individual pieces of content Individual harmful content pieces go viral because of the few companies that control the bottleneck of information. Only 29% of Americans or Brits understand that their Facebook newsfeed is algorithmically organized. The most aware are the Finns and there are only 39% of them that understand that. That invisibility can provide social media platforms an enormous amount of power that is not neutral. At a very minimum, she says, we need far more transparency about how algorithms work and whether they are discriminatory. Carefully design robust regulatory institutions She urges governments and the committee to democracy-proof whatever solutions,  come up with. She says, “we need to make sure that we embed civil society or whatever institutions we create.” She suggests an idea of forming social media councils that could meet regularly to actually deal with many such problems. The exact format and the geographical scope are still up for debate but it's an idea supported by many including the UN Special Rapporteur on freedom of expression and opinion, she adds. Address the societal divisions exploited by social media Heidi says, that the seeds of authoritarianism need fertile soil to grow and if we do not attend to the underlying economic and social discontents, better communications cannot obscure those problems forever. “Misinformation is effect of one shared cause, Surveillance Capitalism”, Shoshana Zuboff, Author of The Age of Surveillance Capitalism Shoshana also agrees with the committee about how the themes of platform accountability, data security and privacy, fake news and misinformation are all effects of one shared cause. She identifies this underlying cause as surveillance capitalism and defines  surveillance capitalism as a comprehensive systematic economic logic that is unprecedented. She clarifies that surveillance capitalism is not technology. It is also not a corporation or a group of corporations. This is infact a virus that has infected every economic sector from insurance, retail, publishing, finance all the way through to product and service manufacturing and administration all of these sectors. According to her, Surveillance capitalism cannot also be reduced to a person or a group of persons. Infact surveillance capitalism follows the history of market capitalism in the following way - it takes something that exists outside the marketplace and it brings it into the market dynamic for production and sale. It claims private human experience for the market dynamic. Private human experience is repurposed as free raw material which are rendered as behavioral data. Some of these behavioral data are certainly fed back into product and service improvement but the rest are declared of behavioral surplus identified for their rich predictive value. These behavioral surplus flows are then channeled into the new means of production what we call machine intelligence or artificial intelligence. From these come out prediction products. Surveillance capitalists own and control not one text but two. First is the public facing text which is derived from the data that we have provided to these entities. What comes out of these, the prediction products, is the proprietary text, a shadow text from which these companies have amassed high market capitalization and revenue in a very short period of time. These prediction products are then sold into a new kind of marketplace that trades exclusively in human futures. The first name of this marketplace was called online targeted advertising and the human predictions that were sold in those markets were called click-through rates. By now that these markets are no more confined to that kind of marketplace. This new logic of surveillance capitalism is being applied to anything and everything. She promises to discuss on more of this in further sessions. “If you have no facts then you have no truth. If you have no truth you have no trust”, Maria Ressa, Chief Executive Officer and Executive Editor of Rappler Inc. Maria believes that in the end it comes down to the battle for truth and journalists are on the front line of this along with activists. Information is power and if you can make people believe lies, then you can control them. Information can be used for commercial benefits as well as a means to gain geopolitical power. She says,  If you have no facts then you have no truth. If you have no truth you have no trust. She then goes on to introduce a bit about her formal presentation tomorrow saying that she will show exactly how quickly a nation, a democracy can crumble because of information operations. She says she will provide data that shows it is systematic and that it is an erosion of truth and trust.  She thanks the committee saying that what is so interesting about these types of discussions is that the countries that are most affected are democracies that are most vulnerable. Bob Zimmer concluded the meeting saying that the agenda today was to get the conversation going and more of how to make our data world a better place will be continued in further sessions. He said, “as we prepare for the next two days of testimony, it was important for us to have this discussion with those who have been studying these issues for years and have seen firsthand the effect digital platforms can have on our everyday lives. The knowledge we have gained tonight will no doubt help guide our committee as we seek solutions and answers to the questions we have on behalf of those we represent. My biggest concerns are for our citizens’ privacy, our democracy and that our rights to freedom of speech are maintained according to our Constitution.” Although, we have covered most of the important conversations, you can watch the full hearing here. Time for data privacy: DuckDuckGo CEO Gabe Weinberg in an interview with Kara Swisher ‘Facial Recognition technology is faulty, racist, biased, abusive to civil rights; act now to restrict misuse’ say experts to House Oversight and Reform Committee. A brief list of drafts bills in US legislation for protecting consumer data privacy
Read more
  • 0
  • 0
  • 18054

article-image-tinkering-with-ticks-in-matplotlib-2-0
Sugandha Lahoti
13 Dec 2017
6 min read
Save for later

Tinkering with ticks in Matplotlib 2.0

Sugandha Lahoti
13 Dec 2017
6 min read
[box type="note" align="" class="" width=""]This is an excerpt from the book titled Matplotlib 2.x By Example written by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim,. The book covers basic know-how on how to create and customize plots by Matplotlib. It will help you learn to visualize geographical data on maps and implement interactive charts. [/box] The article talks about how you can manipulate ticks in Matplotlib 2.0. It includes steps to adjust tick spacing, customizing tick formats, trying out the ticker locator and formatter, and rotating tick labels. What are Ticks Ticks are dividers on an axis that help readers locate the coordinates. Tick labels allow estimation of values or, sometimes, labeling of a data series as in bar charts and box plots. Adjusting tick spacing Tick spacing can be adjusted by calling the locator methods: ax.xaxis.set_major_locator(xmajorLocator) ax.xaxis.set_minor_locator(xminorLocator) ax.yaxis.set_major_locator(ymajorLocator) ax.yaxis.set_minor_locator(yminorLocator) Here, ax refers to axes in a Matplotlib figure. Since set_major_locator() or set_minor_locator() cannot be called from the pyplot interface but requires an axis, we call pyplot.gca() to get the current axes. We can also store a figure and axes as variables at initiation, which is especially useful when we want multiple axes. Removing ticks NullLocator: No ticks Drawing ticks in multiples Spacing ticks in multiples of a given number is the most intuitive way. This can be done by using MultipleLocator space ticks in multiples of a given value. Automatic tick settings MaxNLocator: This finds the maximum number of ticks that will display nicely AutoLocator: MaxNLocator with simple defaults AutoMinorLocator: Adds minor ticks uniformly when the axis is linear Setting ticks by the number of data points IndexLocator: Sets ticks by index (x = range(len(y)) Set scaling of ticks by mathematical functions LinearLocator: Linear scale LogLocator: Log scale SymmetricalLogLocator: Symmetrical log scale, log with a range of linearity LogitLocator: Logit scaling Locating ticks by datetime There is a series of locators dedicated to displaying date and time: MinuteLocator: Locate minutes HourLocator: Locate hours DayLocator: Locate days of the month WeekdayLocator: Locate days of the week MonthLocator: Locate months, for example, 8 for August YearLocator: Locate years that in multiples RRuleLocator: Locate using matplotlib.dates.rrulewrapper The rrulewrapper is a simple wrapper around a dateutil.rrule (dateutil) that allows almost arbitrary date tick specifications AutoDateLocator: On autoscale, this class picks the best MultipleDateLocator to set the view limits and the tick locations Customizing tick formats Tick formatters control the style of tick labels. They can be called to set the major and minor tick formats on the x and y axes as follows: ax.xaxis.set_major_formatter( xmajorFormatter ) ax.xaxis.set_minor_formatter( xminorFormatter ) ax.yaxis.set_major_formatter( ymajorFormatter ) ax.yaxis.set_minor_formatter( yminorFormatter ) Removing tick labels NullFormatter: No tick labels Fixing labels FixedFormatter: Labels are set manually Setting labels with strings IndexFormatter: Take labels from a list of strings StrMethodFormatter: Use the string format method Setting labels with user-defined functions FuncFormatter: Labels are set by a user-defined function Formatting axes by numerical values ScalarFormatter: The format string is automatically selected for scalars by default The following formatters set values for log axes: LogFormatter: Basic log axis LogFormatterExponent: Log axis using exponent = log_base(value) LogFormatterMathtext: Log axis using exponent = log_base(value) using Math text LogFormatterSciNotation: Log axis with scientific notation LogitFormatter: Probability formatter Trying out the ticker locator and formatter To demonstrate the ticker locator and formatter, here we use Netflix subscriber data as an example. Business performance is often measured seasonally. Television shows are even more "seasonal". Can we better show it in the timeline? import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as ticker """ Number for Netflix streaming subscribers from 2012-2017 Data were obtained from Statista on https://www.statista.com/statistics/250934/quarterly-number-of-netflix-stre aming-subscribers-worldwide/ on May 10, 2017. The data were originally published by Netflix in April 2017. """ # Prepare the data set x = range(2011,2018) y = [26.48,27.56,29.41,33.27,36.32,37.55,40.28,44.35, 48.36,50.05,53.06,57.39,62.27,65.55,69.17,74.76,81.5, 83.18,86.74,93.8,98.75] # quarterly subscriber count in millions # Plot lines with different line styles plt.plot(y,'^',label = 'Netflix subscribers',ls='-') # get current axes and store it to ax ax = plt.gca() # set ticks in multiples for both labels ax.xaxis.set_major_locator(ticker.MultipleLocator(4)) # set major marks # every 4 quarters, ie once a year ax.xaxis.set_minor_locator(ticker.MultipleLocator(1)) # set minor marks # for each quarter ax.yaxis.set_major_locator(ticker.MultipleLocator(10)) # ax.yaxis.set_minor_locator(ticker.MultipleLocator(2)) # label the start of each year by FixedFormatter  ax.get_xaxis().set_major_formatter(ticker.FixedFormatter(x)) plt.legend() plt.show() From this plot, we see that Netflix has a pretty linear growth of subscribers from the year 2012 to 2017. We can tell the seasonal growth better after formatting the x axis in a quarterly manner. In 2016, Netflix was doing better in the latter half of the year. Any TV shows you watched in each season? Rotating tick labels A figure can get too crowded or some tick labels may get skipped when we have too many tick labels or if the label strings are too long. We can solve this by rotating the ticks, for example, by pyplot.xticks(rotation=60): import matplotlib.pyplot as plt import numpy as np import matplotlib as mpl mpl.style.use('seaborn') techs = ['Google Adsense','DoubleClick.Net','Facebook Custom Audiences','Google Publisher Tag', 'App Nexus'] y_pos = np.arange(len(techs)) # Number of websites using the advertising technologies # Data were quoted from builtwith.com on May 8th 2017 websites = [14409195,1821385,948344,176310,283766] plt.bar(y_pos, websites, align='center', alpha=0.5) # set x-axis tick rotation plt.xticks(y_pos, techs, rotation=25) plt.ylabel('Live site count') plt.title('Online advertising technologies usage') plt.show() Use pyplot.tight_layout() to avoid image clipping. Using rotated labels can sometimes result in image clipping, as follows, if you save the figure by pyplot.savefig(). You can call pyplot.tight_layout() before pyplot.savefig() to ensure a complete image output. We saw how ticks can be adjusted, customized, rotated and formatted in Matplotlib 2.0 for easy readability, labelling and estimation of values. To become well-versed with Matplotlib for your day to day work, check out this book Matplotlib 2.x By Example.  
Read more
  • 0
  • 0
  • 18047
Modal Close icon
Modal Close icon