Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-google-confirms-it-paid-135-million-as-exit-packages-to-senior-execs-accused-of-sexual-harassment
Natasha Mathur
12 Mar 2019
4 min read
Save for later

Google confirms it paid $135 million as exit packages to senior execs accused of sexual harassment

Natasha Mathur
12 Mar 2019
4 min read
According to a complaint filed in a lawsuit yesterday, Google paid $135 million in total as exit packages to top two senior execs, namely Andy Rubin (creator of Android) and Amit Singhal (former senior VP of Google search) after they were accused of sexual misconduct in the company. The lawsuit was filed by an Alphabet shareholder, James Martin, in the Santa Clara, California Court. Google also confirmed paying the exit packages to senior execs to The Verge, yesterday. Speaking of the lawsuit, the complaint is against certain directors and officers of Alphabet, Google’s parent company, for their active and direct participation in “multi-year scheme” to hide sexual harassment and discrimination at Alphabet. It also states that the misconduct by these directors has caused severe financial and reputational damage to Alphabet. The exit packages for Rubin and Singhal were approved by the Leadership Development and Compensation Committee (LLDC). The news of Google paying high exit packages to its top execs first came to light last October, after the New York Times released a report on Google, stating that the firm paid $90 million to Rubin and $15 million to Singhal. Rubin had previously also received an offer for a $150 million stock grant, which he then further use to negotiate the $90 million in severance pay, even though he should have been fired for cause without any pay, states the lawsuit. To protest against the handling of sexual misconduct within Google, more than 20,000 Google employees along with vendors, and contractors, temps, organized Google “walkout for real change” and walked out of their offices in November 2018. Googlers also launched an industry-wide awareness campaign to fight against forced arbitration in January, where they shared information about arbitration on their Twitter and Instagram accounts throughout the day.   Last year in November, Google ended its forced arbitration ( a move that was soon followed by Facebook) for its employees (excluding temps, vendors, etc) and only in the case of sexual harassment. This led to contractors writing an open letter on Medium to Sundar Pichai, CEO, Google, in December, demanding him to address their demands of better conditions and equal benefits for contractors. In response to the Google walkout and the growing public pressure, Google finally decided to end its forced arbitration policy for all employees (including contractors) and for all kinds of discrimination within Google, last month. The changes will go into effect for all the Google employees starting March 21st, 2019. Yesterday, the Google walkout for real change group tweeted condemning the multi-million dollar payouts and has asked people to use the hashtag #Googlepayoutsforall to highlight other better ways that money could have been used. https://twitter.com/GoogleWalkout/status/1105450565193121792 “The conduct of Rubin and other executives was disgusting, illegal, immoral, degrading to women and contrary to every principle that Google claims it abides by”, reads the lawsuit. James Martin also filed a lawsuit against Alphabet’s board members, Larry Page, Sergey Brin, and Eric Schmidt earlier this year in January for covering up the sexual harassment allegations against the former top execs at Google. Martin had sued Alphabet for breaching its fiduciary duty to shareholders, unjust enrichment, abuse of power, and corporate waste. “The directors’ wrongful conduct allowed illegal conduct to proliferate and continue. As such, members of the Alphabet’s board were knowing direct enables of sexual harassment and discrimination”, reads the lawsuit. It also states that the board members not only violated the California and federal law but it also violated the ethical standards and guidelines set by Alphabet. Public reaction to the news is largely negative with people condemning Google’s handling of sexual misconduct: https://twitter.com/awesome/status/1105295877487263744 https://twitter.com/justkelly_ok/status/1105456081663225856 https://twitter.com/justkelly_ok/status/1105457965790707713 https://twitter.com/conradwt/status/1105386882135875584 https://twitter.com/mer__edith/status/1105464808831361025 For more information, check out the official lawsuit here. Recode Decode #GoogleWalkout interview shows why data and evidence don’t always lead to right decisions in even the world’s most data-driven company Liz Fong Jones, prominent ex-Googler shares her experience at Google and ‘grave concerns’ for the company Google’s pay equity analysis finds men, not women, are underpaid; critics call out design flaws in the analysis
Read more
  • 0
  • 0
  • 35139

article-image-key-skills-every-database-programmer-should-have
Sugandha Lahoti
05 Sep 2019
7 min read
Save for later

Key skills every database programmer should have

Sugandha Lahoti
05 Sep 2019
7 min read
According to Robert Half Technology’s 2019 IT salary report, ‘Database programmer’ is one of the 13 most in-demand tech jobs for 2019. For an entry-level programmer, the average salary is $98,250 which goes up to $167,750 for a seasoned expert. A typical database programmer is responsible for designing, developing, testing, deploying, and maintaining databases. In this article, we will list down the top critical tech skills essential to database programmers. #1 Ability to perform Data Modelling The first step is to learn to model the data. In Data modeling, you create a conceptual model of how data items relate to each other. In order to efficiently plan a database design, you should know the organization you are designing the database from. This is because Data models describe real-world entities such as ‘customer’, ‘service’, ‘products’, and the relation between these entities. Data models provide an abstraction for the relations in the database. They aid programmers in modeling business requirements and in translating business requirements into relations. They are also used for exchanging information between the developers and business owners. During the design phase, the database developer should pay great attention to the underlying design principles, run a benchmark stack to ensure performance, and validate user requirements. They should also avoid pitfalls such as data redundancy, null saturation, and tight coupling. #2 Know a database programming language, preferably SQL Database programmers need to design, write and modify programs to improve their databases. SQL is one of the top languages that are used to manipulate the data in a database and to query the database. It's also used to define and change the structure of the data—in other words, to implement the data model. Therefore it is essential that you learn SQL. In general, SQL has three parts: Data Definition Language (DDL): used to create and manage the structure of the data Data Manipulation Language (DML): used to manage the data itself Data Control Language (DCL): controls access to the data Considering, data is constantly inserted into the database, changed, or retrieved DML is used more often in day-to-day operations than the DDL, so you should have a strong grasp on DML. If you plan to grow in a database architect role in the near future, then having a good grasp of DDL will go a long way. Another reason why you should learn SQL is that almost every modern relational database supports SQL. Although different databases might support different features and implement their own dialect of SQL, the basics of the language remain the same. If you know SQL, you can quickly adapt to MySQL, for example. At present, there are a number of categories of database models predominantly, relational, object-relational, and NoSQL databases. All of these are meant for different purposes. Relational databases often adhere to SQL. Object-relational databases (ORDs) are also similar to relational databases. NoSQL, which stands for "not only SQL," is an alternative to traditional relational databases useful for working with large sets of distributed data. They provide benefits such as availability, schema-free, and horizontal scaling, but also have limitations such as performance, data retrieval constraints, and learning time. For beginners, it is advisable to first start with experimenting on relational databases learning SQL, gradually transitioning to NoSQL DBMS. #3 Know how to Extract, Transform, Load various data types and sources A database programmer should have a good working knowledge of ETL (Extract, Transform Load) programming. ETL developers basically extract data from different databases, transform it and then load the data into the Data Warehouse system. A Data Warehouse provides a common data repository that is essential for business needs. A database programmer should know how to tune existing packages, tables, and queries for faster ETL processing. They should conduct unit tests before applying any change to the existing ETL process. Since ETL takes data from different data sources (SQL Server, CSV, and flat files), a database developer should have knowledge on how to deal with different data sources. #4 Design and test Database plans Database programmers o perform regular tests to identify ways to solve database usage concerns and malfunctions. As databases are usually found at the lowest level of the software architecture, testing is done in an extremely cautious fashion. This is because changes in the database schema affect many other software components. A database developer should make sure that when changing the database structure, they do not break existing applications and that they are using the new structures properly. You should be proficient in Unit testing your database. Unit tests are typically used to check if small units of code are functioning properly. For databases, unit testing can be difficult. So the easiest way to do all of that is by writing the tests as SQL scripts. You should also know about System Integration Testing which is done on the complete system after the hardware and software modules of that system have been integrated. SIT validates the behavior of the system and ensures that modules in the system are functioning suitably. #5 Secure your Database Data protection and security are essential for the continuity of business. Databases often store sensitive data, such as user information, email addresses, geographical addresses, and payment information. A robust security system to protect your database against any data breach is therefore necessary. While a database architect is responsible for designing and implementing secure design options, a database admin must ensure that the right security and privacy policies are in place and are being observed. However, this does not absolve database programmers from adopting secure coding practices. Database programmers need to ensure that data integrity is maintained over time and is secure from unauthorized changes or theft. They need to especially be careful about Table Permissions i.e who can read and write to what tables. You should be aware of who is allowed to perform the 4 basic operations of INSERT, UPDATE, DELETE and SELECT against which tables. Database programmers should also adopt authentication best practices depending on the infrastructure setup, the application's nature, the user's characteristics, and data sensitivity. If the database server is accessed from the outside world, it is beneficial to encrypt sessions using SSL certificates to avoid packet sniffing. Also, you should secure database servers that trust all localhost connections, as anyone who accesses the localhost can access the database server. #6 Optimize your database performance A database programmer should also be aware of how to optimize their database performance to achieve the best results. At the basic level, they should know how to rewrite SQL queries and maintain indexes. Other aspects of optimizing database performance, include hardware configuration, network settings, and database configuration. Generally speaking, tuning database performance requires knowledge about the system's nature. Once the database server is configured you should calculate the number of transactions per second (TPS) for the database server setup. Once the system is up and running, and you should set up a monitoring system or log analysis, which periodically finds slow queries, the most time-consuming queries, etc. #7 Develop your soft skills Apart from the above technical skills, a database programmer needs to be comfortable communicating with developers, testers and project managers while working on any software project. A keen eye for detail and critical thinking can often spot malfunctions and errors that may otherwise be overlooked. A database programmer should be able to quickly fix issues within the database and streamline the code. They should also possess quick-thinking to prioritize tasks and meet deadlines effectively. Often database programmers would be required to work on documentation and technical user guides so strong writing and technical skills are a must. Get started If you want to get started with becoming a Database programmer, Packt has a range of products. Here are some of the best: PostgreSQL 11 Administration Cookbook Learning PostgreSQL 11 - Third Edition PostgreSQL 11 in 7 days [ Video ] Using MySQL Databases With Python [ Video ] Basic Relational Database Design [ Video ] How to learn data science: from data mining to machine learning How to ace a data science interview 5 barriers to learning and technology training for small software development teams
Read more
  • 0
  • 0
  • 34749

article-image-bitbucket-to-no-longer-support-mercurial-users-must-migrate-to-git-by-may-2020
Fatema Patrawala
21 Aug 2019
6 min read
Save for later

Bitbucket to no longer support Mercurial, users must migrate to Git by May 2020

Fatema Patrawala
21 Aug 2019
6 min read
Yesterday marked an end of an era for Mercurial users, as Bitbucket announced to no longer support Mercurial repositories after May 2020. Bitbucket, owned by Atlassian, is a web-based version control repository hosting service, for source code and development projects. It has used Mercurial since the beginning in 2008 and then Git since October 2011. Now almost after ten years of sharing its journey with Mercurial, the Bitbucket team has decided to remove the Mercurial support from the Bitbucket Cloud and its API. The official announcement reads, “Mercurial features and repositories will be officially removed from Bitbucket and its API on June 1, 2020.” The Bitbucket team also communicated the timeline for the sunsetting of the Mercurial functionality. After February 1, 2020 users will no longer be able to create new Mercurial repositories. And post June 1, 2020 users will not be able to use Mercurial features in Bitbucket or via its API and all Mercurial repositories will be removed. Additionally all current Mercurial functionality in Bitbucket will be available through May 31, 2020. The team said the decision was not an easy one for them and Mercurial held a special place in their heart. But according to a Stack Overflow Developer Survey, almost 90% of developers use Git, while Mercurial is the least popular version control system with only about 3% developer adoption. Apart from this Mercurial usage on Bitbucket saw a steady decline, and the percentage of new Bitbucket users choosing Mercurial fell to less than 1%. Hence they decided on removing the Mercurial repos. How can users migrate and export their Mercurial repos Bitbucket team recommends users to migrate their existing Mercurial repos to Git. They have also extended support for migration, and kept the available options open for discussion in their dedicated Community thread. Users can discuss about conversion tools, migration, tips, and also offer troubleshooting help. If users prefer to continue using the Mercurial system, there are a number of free and paid Mercurial hosting services for them. The Bitbucket team has also created a Git tutorial that covers everything from the basics of creating pull requests to rebasing and Git hooks. Community shows anger and sadness over decision to discontinue Mercurial support There is an outrage among the Mercurial users as they are extremely unhappy and sad with this decision by Bitbucket. They have expressed anger not only on one platform but on multiple forums and community discussions. Users feel that Bitbucket’s decision to stop offering Mercurial support is bad, but the decision to also delete the repos is evil. On Hacker News, users speculated that this decision was influenced by potential to market rather than based on technically superior architecture and ease of use. They feel GitHub has successfully marketed Git and that's how both have become synonymous to the developer community. One of them comments, “It's very sad to see bitbucket dropping mercurial support. Now only Facebook and volunteers are keeping mercurial alive. Sometimes technically better architecture and user interface lose to a non user friendly hard solutions due to inertia of mass adoption. So a lesson in Software development is similar to betamax and VHS, so marketing is still a winner over technically superior architecture and ease of use. GitHub successfully marketed git, so git and GitHub are synonymous for most developers. Now majority of open source projects are reliant on a single proprietary solution Github by Microsoft, for managing code and project. Can understand the difficulty of bitbucket, when Python language itself moved out of mercurial due to the same inertia. Hopefully gitlab can come out with mercurial support to migrate projects using it from bitbucket.” Another user comments that Mercurial support was the only reason for him to use Bitbucket when GitHub is miles ahead of Bitbucket. Now when it stops supporting Mercurial too, Bitbucket will end soon. The comment reads, “Mercurial support was the one reason for me to still use Bitbucket: there is no other Bitbucket feature I can think of that Github doesn't already have, while Github's community is miles ahead since everyone and their dog is already there. More importantly, Bitbucket leaves the migration to you (if I read the article correctly). Once I download my repo and convert it to git, why would I stay with the company that just made me go through an annoying (and often painful) process, when I can migrate to Github with the exact same command? And why isn't there a "migrate this repo to git" button right there? I want to believe that Bitbucket has smart people and that this choice is a good one. But I'm with you there - to me, this definitely looks like Bitbucket will die.” On Reddit, programming folks see this as a big change from Bitbucket as they are the major mercurial hosting provider. And they feel Bitbucket announced this at a pretty short notice and they require more time for migration. Apart from the developer community forums, on Atlassian community blog as well users have expressed displeasure. A team of scientists commented, “Let's get this straight : Bitbucket (offering hosting support for Mercurial projects) was acquired by Atlassian in September 2010. Nine years later Atlassian decides to drop Mercurial support and delete all Mercurial repositories. Atlassian, I hate you :-) The image you have for me is that of a harmful predator. We are a team of scientists working in a university. We don't have computer scientists, we managed to use a version control simple as Mercurial, and it was a hard work to make all scientists in our team to use a version control system (even as simple as Mercurial). We don't have the time nor the energy to switch to another version control system. But we will, forced and obliged. I really don't want to check out Github or something else to migrate our projects there, but we will, forced and obliged.” Atlassian Bitbucket, GitHub, and GitLab take collective steps against the Git ransomware attack Attackers wiped many GitHub, GitLab, and Bitbucket repos with ‘compromised’ valid credentials leaving behind a ransom note BitBucket goes down for over an hour
Read more
  • 0
  • 0
  • 34327

article-image-master-the-art-of-face-swapping-with-opencv-and-python-by-sylwek-brzeczkowski-developer-at-truststamp
Vincy Davis
12 Dec 2019
8 min read
Save for later

Master the art of face swapping with OpenCV and Python by Sylwek Brzęczkowski, developer at TrustStamp

Vincy Davis
12 Dec 2019
8 min read
No discussion on image processing can be complete without talking about OpenCV. Its 2500+ algorithms, extensive documentation and sample code are considered world-class for exploring real-time computer vision. OpenCV supports a wide variety of programming languages such as C++, Python, Java, etc., and is also available on different platforms including Windows, Linux, OS X, Android, and iOS. OpenCV-Python, the Python API for OpenCV is one of the most popular libraries used to solve computer vision problems. It combines the best qualities of OpenCV, C++ API, and the Python language. The OpenCV-Python library uses Numpy, which is a highly optimized library for numerical operations with a MATLAB-style syntax. This makes it easier to integrate the Python API with other libraries that use Numpy such as SciPy and Matplotlib. This is the reason why it is used by many developers to execute different computer vision experiments. Want to know more about OpenCV with Python? [box type="shadow" align="" class="" width=""]If you are interested in developing your computer vision skills, you should definitely master the algorithms in OpenCV 4 and Python explained in our book ‘Mastering OpenCV 4 with Python’ written by Alberto Fernández Villán. This book will help you build complete projects in relation to image processing, motion detection, image segmentation, and many other tasks by exploring the deep learning Python libraries and also by learning the OpenCV deep learning capabilities.[/box] At the PyData Warsaw 2018 conference, Sylwek Brzęczkowski walked through how to implement a face swap using OpenCV and Python. Face swaps are used by apps like Snapchat to dispense various face filters. Brzęczkowski is a Python developer at TrustStamp. Steps to implement face swapping with OpenCV and Python #1 Face detection using histogram of oriented gradients (HOG) Histogram of oriented gradients (HOG) is a feature descriptor that is used to detect objects in computer vision and image processing. Brzęczkowski demonstrated the working of a HOG using square patches which when hovered over an array of images produces a histogram of oriented gradients feature vectors. These feature vectors are then passed to the classifier to generate a result having the highest matching samples. In order to implement face detection using HOG in Python, the image needs to be imported using import OpenCV. Next a frontal face detector object is created for the loaded image detector=dlib.get_frontal_face_detector(). The detector then produces the vector with the detected face. #2 Facial landmark detection aka face alignment Face landmark detection is the process of finding points of interest in an image of a human face. When dlib is used for facial landmark detection, it returns 68 unique fashion landmarks for the whole face. After the first iteration of the algorithm, the value of T equals 0. This value increases linearly such that at the end of the iteration, T gets the value 10. The image evolved at this stage produces the ‘ground truth’, which means that the iteration can stop now. Due to this working, this stage of the process is also called as face alignment. To implement this stage, Brzęczkowski showed how to add a predictor in the Python program with the values shape_predictor_68_face_landmarks.dat such that it produces a model of around 100 megabytes. This process generally takes up a long time as we tend to pick the biggest clearer image for detection. #3 Finding face border using convex hull The convex hull is a set of points defined as the smallest convex polygon, which encloses all of the points in the set. This means that for a given set of points, the convex hull is the subset of these points such that all the given points are inside the subset. To find the face border in an image, we need to change the structure a bit. The structure is first passed to the convex hull function with return points to false, this means that we get an output of indexes. Brzęczkowski then exhibited the face border in the image in blue color using the find_convex_hull.py function. #4 Approximating nonlinear operations with linear operations In a linear filtering of an image, the value of an output pixel is a linear combination of the values of the pixels. Brzęczkowski put forth the example of Affine transformation which is a type of linear mapping method and is used to preserve points, straight lines, and planes. On the other hand, a non-linear filtering produces an output which is not a linear function of its input. He then goes on to unveil both the transitions using his own image. Brzęczkowski then advised users to check the website learnOpenCV.com to learn how to create a nonlinear operation with a linear one. #5 Finding triangles in an image using Delaunay triangulation A Delaunay triangulation subdivides a set of points in a plane into triangles such that the points become vertices of the triangles. This means that this method subdivides the space or the surface into triangles in such a way that if you look at any triangle on the image, it will not have another point inside the triangle. Brzęczkowski then demonstrates how the image developed in the previous stage contained “face points from which you can identify my teeth and then create sub div to the object, insert all these points that I created or all detected.” Next, he deploys Delaunay triangulation to produce a list of two angles. This list is then used to obtain the triangles in the image. Post this step, he uses the delaunay_triangulation.py function to generate these triangles on the images. #6 Blending one face into another To recap, we started from detecting a face using HOG and finding its border using convex hull, followed it by adding mouth points to indicate specific indexes. Next, Delaunay triangulation was implemented to obtain all the triangles on the images. Next, Brzęczkowski begins the blending of images using seamless cloning. A seamless cloning combines the attributes of other cloning methods to create a unique solution to allow “sequence-independent and scarless insertion of one or more fragments of DNA into a plasmid vector.” This cloning method also provides a variety of skin colors to choose from. Brzęczkowski then explains a feature called ‘pass on edit image’ in the Poisson image editing which uses the value of the gradients instead of the identities or the values of the pixels of the image. To implement the same method in OpenCV, he further demonstrates how information like source, destination, source image destination, mask and center (which is the location where the cloned part should be placed) is required to blend the two faces. Brzęczkowski then depicts a  string of illustrations to transform his image with the images of popular artists like Jamie Foxx, Clint Eastwood, and others. #7 Stabilization using optical flow with the Lucas-Kanade method In computer vision, the Lucas-Kanade method is a widely used differential method for optical flow estimation. It assumes that the flow is essentially constant in a local neighborhood of the pixel under consideration, and solves the basic optical flow equations for all the pixels in that neighborhood, by the least-squares criterion. Thus by combining information from several nearby pixels, the Lucas–Kanade method resolves the inherent ambiguity of the optical flow equation. This method is also less sensitive to noises in an image. By using this method to implement the stabilization of the face swapped image, it is assumed that the optical flow is essentially constant in a local neighborhood of the pixel under consideration in human language. This means that “if we have a red point in the center we assume that all the points around, let's say in this example is three on three pixels we assume that all of them have the same optical flow and thanks to that assumption we have nine equations and only two unknowns.” This makes the computation fairly easy to solve. By using this assumption the optical flow works smoothly if we have the previous gray position of the image. This means that for face swapping images using OpenCV, a user needs to have details of the previous points of the image along with the current points of the image. By combining all this information, the actual point becomes a combination of the detected landmark and the predicted landmark. Thus by implementing the Lucas-Kanade method for stabilizing the image, Brzęczkowski implements a non-shaky version of his face-swapped image. Watch Brzęczkowski’s full video to see a step-by-step implementation of a face-swapping task. You can learn advanced applications like facial recognition, target tracking, or augmented reality from our book, ‘Mastering OpenCV 4 with Python’ written by Alberto Fernández Villán. This book will also help you understand the application of artificial intelligence and deep learning techniques using popular Python libraries like TensorFlow and Keras. Getting to know PyMC3, a probabilistic programming framework for Bayesian Analysis in Python How to perform exception handling in Python with ‘try, catch and finally’ Implementing color and shape-based object detection and tracking with OpenCV and CUDA [Tutorial] OpenCV 4.0 releases with experimental Vulcan, G-API module and QR-code detector among others
Read more
  • 0
  • 0
  • 34232

article-image-selecting-statistical-based-features-in-machine-learning-application
Pravin Dhandre
14 Mar 2018
16 min read
Save for later

Selecting Statistical-based Features in Machine Learning application

Pravin Dhandre
14 Mar 2018
16 min read
In today’s tutorial, we will work on one of the methods of executing feature selection, the statistical-based method for interpreting both quantitative and qualitative datasets. Feature selection attempts to reduce the size of the original dataset by subsetting the original features and shortlisting the best ones with the highest predictive power. We may intelligently choose which feature selection method might work best for us, but in reality, a very valid way of working in this domain is to work through examples of each method and measure the performance of the resulting pipeline. To begin, let's take a look at the subclass of feature selection modules that are reliant on statistical tests to select viable features from a dataset. Statistical-based feature selections Statistics provides us with relatively quick and easy methods of interpreting both quantitative and qualitative data. We have used some statistical measures in previous chapters to obtain new knowledge and perspective around our data, specifically in that we recognized mean and standard deviation as metrics that enabled us to calculate z-scores and scale our data. In this tutorial, we will rely on two new concepts to help us with our feature selection: Pearson correlations hypothesis testing Both of these methods are known as univariate methods of feature selection, meaning that they are quick and handy when the problem is to select out single features at a time in order to create a better dataset for our machine learning pipeline. Using Pearson correlation to select features We have actually looked at correlations in this book already, but not in the context of feature selection. We already know that we can invoke a correlation calculation in pandas by calling the following method: credit_card_default.corr() The output of the preceding code produces is the following: As a continuation of the preceding table we have: The Pearson correlation coefficient (which is the default for pandas) measures the linear relationship between columns. The value of the coefficient varies between -1 and +1, where 0 implies no correlation between them. Correlations closer to -1 or +1 imply an extremely strong linear relationship. It is worth noting that Pearson’s correlation generally requires that each column be normally distributed (which we are not assuming). We can also largely ignore this requirement because our dataset is large (over 500 is the threshold). The pandas .corr() method calculates a Pearson correlation coefficient for every column versus every other column. This 24 column by 24 row matrix is very unruly, and in the past, we used heatmaps to try and make the information more digestible: # using seaborn to generate heatmaps import seaborn as sns import matplotlib.style as style # Use a clean stylizatino for our charts and graphs style.use('fivethirtyeight') sns.heatmap(credit_card_default.corr()) The heatmap generated will be as follows: Note that the heatmap function automatically chose the most correlated features to show us. That being said, we are, for the moment, concerned with the features correlations to the response variable. We will assume that the more correlated a feature is to the response, the more useful it will be. Any feature that is not as strongly correlated will not be as useful to us. Correlation coefficients are also used to determine feature interactions and redundancies. A key method of reducing overfitting in machine learning is spotting and removing these redundancies. We will be tackling this problem in our model-based selection methods. Let's isolate the correlations between the features and the response variable, using the following code: # just correlations between every feature and the response credit_card_default.corr()['default payment next month'] LIMIT_BAL -0.153520 SEX -0.039961 EDUCATION 0.028006 MARRIAGE -0.024339 AGE 0.013890 PAY_0 0.324794 PAY_2 0.263551 PAY_3 0.235253 PAY_4 0.216614 PAY_5 0.204149 PAY_6 0.186866 BILL_AMT1 -0.019644 BILL_AMT2 -0.014193 BILL_AMT3 -0.014076 BILL_AMT4 -0.010156 BILL_AMT5 -0.006760 BILL_AMT6 -0.005372 PAY_AMT1 -0.072929 PAY_AMT2 -0.058579 PAY_AMT3 -0.056250 PAY_AMT4 -0.056827 PAY_AMT5 -0.055124 PAY_AMT6 -0.053183 default payment next month 1.000000 We can ignore the final row, as is it is the response variable correlated perfectly to itself. We are looking for features that have correlation coefficient values close to -1 or +1. These are the features that we might assume are going to be useful. Let's use pandas filtering to isolate features that have at least .2 correlation (positive or negative). Let's do this by first defining a pandas mask, which will act as our filter, using the following code: # filter only correlations stronger than .2 in either direction (positive or negative) credit_card_default.corr()['default payment next month'].abs() > .2 LIMIT_BAL False SEX False EDUCATION False MARRIAGE False AGE False PAY_0 True PAY_2 True PAY_3 True PAY_4 True PAY_5 True PAY_6 False BILL_AMT1 False BILL_AMT2 False BILL_AMT3 False BILL_AMT4 False BILL_AMT5 False BILL_AMT6 False PAY_AMT1 False PAY_AMT2 False PAY_AMT3 False PAY_AMT4 False PAY_AMT5 False PAY_AMT6 False default payment next month True Every False in the preceding pandas Series represents a feature that has a correlation value between -.2 and .2 inclusive, while True values correspond to features with preceding correlation values .2 or less than -0.2. Let's plug this mask into our pandas filtering, using the following code: # store the features highly_correlated_features = credit_card_default.columns[credit_card_default.corr()['default payment next month'].abs() > .2] highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5', u'default payment next month'], dtype='object') The variable highly_correlated_features is supposed to hold the features of the dataframe that are highly correlated to the response; however, we do have to get rid of the name of the response column, as including that in our machine learning pipeline would be cheating: # drop the response variable highly_correlated_features = highly_correlated_features.drop('default payment next month') highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5'], dtype='object') So, now we have five features from our original dataset that are meant to be predictive of the response variable, so let's try it out with the help of the following code: # only include the five highly correlated features X_subsetted = X[highly_correlated_features] get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y) # barely worse, but about 20x faster to fit the model Best Accuracy: 0.819666666667 Best Parameters: {'max_depth': 3} Average Time to Fit (s): 0.01 Average Time to Score (s): 0.002 Our accuracy is definitely worse than the accuracy to beat, .8203, but also note that the fitting time saw about a 20-fold increase. Our model is able to learn almost as well as with the entire dataset with only five features. Moreover, it is able to learn as much in a much shorter timeframe. Let's bring back our scikit-learn pipelines and include our correlation choosing methodology as a part of our preprocessing phase. To do this, we will have to create a custom transformer that invokes the logic we just went through, as a pipeline-ready class. We will call our class the CustomCorrelationChooser and it will have to implement both a fit and a transform logic, which are: The fit logic will select columns from the features matrix that are higher than a specified threshold The transform logic will subset any future datasets to only include those columns that were deemed important from sklearn.base import TransformerMixin, BaseEstimator class CustomCorrelationChooser(TransformerMixin, BaseEstimator): def __init__(self, response, cols_to_keep=[], threshold=None): # store the response series self.response = response # store the threshold that we wish to keep self.threshold = threshold # initialize a variable that will eventually # hold the names of the features that we wish to keep self.cols_to_keep = cols_to_keep def transform(self, X): # the transform method simply selects the appropiate # columns from the original dataset return X[self.cols_to_keep] def fit(self, X, *_): # create a new dataframe that holds both features and response df = pd.concat([X, self.response], axis=1) # store names of columns that meet correlation threshold self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() > Self.threshold] # only keep columns in X, for example, will remove response Variable self.cols_to_keep = [c for c in self.cols_to_keep if c in X.columns] return self Let's take our new correlation feature selector for a spin, with the help of the following code: # instantiate our new feature selector ccc = CustomCorrelationChooser(threshold=.2, response=y) ccc.fit(X) ccc.cols_to_keep ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5'] Our class has selected the same five columns as we found earlier. Let's test out the transform functionality by calling it on our X matrix, using the following code: ccc.transform(X).head() The preceding code produces the following table as the output: We see that the transform method has eliminated the other columns and kept only the features that met our .2 correlation threshold. Now, let's put it all together in our pipeline, with the help of the following code: # instantiate our feature selector with the response variable set ccc = CustomCorrelationChooser(response=y) # make our new pipeline, including the selector ccc_pipe = Pipeline([('correlation_select', ccc), ('classifier', d_tree)]) # make a copy of the decisino tree pipeline parameters ccc_pipe_params = deepcopy(tree_pipe_params) # update that dictionary with feature selector specific parameters ccc_pipe_params.update({ 'correlation_select__threshold':[0, .1, .2, .3]}) print ccc_pipe_params #{'correlation_select__threshold': [0, 0.1, 0.2, 0.3], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # better than original (by a little, and a bit faster on # average overall get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'correlation_select__threshold': 0.1, 'classifier__max_depth': 5} Average Time to Fit (s): 0.105 Average Time to Score (s): 0.003 Wow! Our first attempt at feature selection and we have already beaten our goal (albeit by a little bit). Our pipeline is showing us that if we threshold at 0.1, we have eliminated noise enough to improve accuracy and also cut down on the fitting time (from .158 seconds without the selector). Let's take a look at which columns our selector decided to keep: # check the threshold of .1 ccc = CustomCorrelationChooser(threshold=0.1, response=y) ccc.fit(X) # check which columns were kept ccc.cols_to_keep ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] It appears that our selector has decided to keep the five columns that we found, as well as two more, the LIMIT_BAL and the PAY_6 columns. Great! This is the beauty of automated pipeline gridsearching in scikit-learn. It allows our models to do what they do best and intuit things that we could not have on our own. Feature selection using hypothesis testing Hypothesis testing is a methodology in statistics that allows for a bit more complex statistical testing for individual features. Feature selection via hypothesis testing will attempt to select only the best features from a dataset, just as we were doing with our custom correlation chooser, but these tests rely more on formalized statistical methods and are interpreted through what are known as p-values. A hypothesis test is a statistical test that is used to figure out whether we can apply a certain condition for an entire population, given a data sample. The result of a hypothesis test tells us whether we should believe the hypothesis or reject it for an alternative one. Based on sample data from a population, a hypothesis test determines whether or not to reject the null hypothesis. We usually use a p-value (a non-negative decimal with an upper bound of 1, which is based on our significance level) to make this conclusion. In the case of feature selection, the hypothesis we wish to test is along the lines of: True or False: This feature has no relevance to the response variable. We want to test this hypothesis for every feature and decide whether the features hold some significance in the prediction of the response. In a way, this is how we dealt with the correlation logic. We basically said that, if a column's correlation with the response is too weak, then we say that the hypothesis that the feature has no relevance is true. If the correlation coefficient was strong enough, then we can reject the hypothesis that the feature has no relevance in favor of an alternative hypothesis, that the feature does have some relevance. To begin to use this for our data, we will have to bring in two new modules: SelectKBest and f_classif, using the following code: # SelectKBest selects features according to the k highest scores of a given scoring function from sklearn.feature_selection import SelectKBest # This models a statistical test known as ANOVA from sklearn.feature_selection import f_classif # f_classif allows for negative values, not all do # chi2 is a very common classification criteria but only allows for positive values # regression has its own statistical tests SelectKBest is basically just a wrapper that keeps a set amount of features that are the highest ranked according to some criterion. In this case, we will use the p-values of completed hypothesis testings as a ranking. Interpreting the p-value The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test. Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it. The big take away from this is that the f_classif function will perform an ANOVA test (a type of hypothesis test) on each feature on its own (hence the name univariate testing) and assign that feature a p-value. The SelectKBest will rank the features by that p-value (the lower the better) and keep only the best k (a human input) features. Let's try this out in Python. Ranking the p-value Let's begin by instantiating a SelectKBest module. We will manually enter a k value, 5, meaning we wish to keep only the five best features according to the resulting p-values: # keep only the best five features according to p-values of ANOVA test k_best = SelectKBest(f_classif, k=5) We can then fit and transform our X matrix to select the features we want, as we did before with our custom selector: # matrix after selecting the top 5 features k_best.fit_transform(X, y) # 30,000 rows x 5 columns array([[ 2, 2, -1, -1, -2], [-1, 2, 0, 0, 0], [ 0, 0, 0, 0, 0], ..., [ 4, 3, 2, -1, 0], [ 1, -1, 0, 0, 0], [ 0, 0, 0, 0, 0]]) If we want to inspect the p-values directly and see which columns were chosen, we can dive deeper into the select k_best variables: # get the p values of columns k_best.pvalues_ # make a dataframe of features and p-values # sort that dataframe by p-value p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value') # show the top 5 features p_values.head() The preceding code produces the following table as the output: We can see that, once again, our selector is choosing the PAY_X columns as the most important. If we take a look at our p-value column, we will notice that our values are extremely small and close to zero. A common threshold for p-values is 0.05, meaning that anything less than 0.05 may be considered significant, and these columns are extremely significant according to our tests. We can also directly see which columns meet a threshold of 0.05 using the pandas filtering methodology: # features with a low p value p_values[p_values['p_value'] < .05] The preceding code produces the following table as the output: The majority of the columns have a low p-value, but not all. Let's see the columns with a higher p_value, using the following code: # features with a high p value p_values[p_values['p_value'] >= .05] The preceding code produces the following table as the output: These three columns have quite a high p-value. Let's use our SelectKBest in a pipeline to see if we can grid search our way into a better machine learning pipeline, using the following code: k_best = SelectKBest(f_classif) # Make a new pipeline with SelectKBest select_k_pipe = Pipeline([('k_best', k_best), ('classifier', d_tree)]) select_k_best_pipe_params = deepcopy(tree_pipe_params) # the 'all' literally does nothing to subset select_k_best_pipe_params.update({'k_best__k':range(1,23) + ['all']}) print select_k_best_pipe_params # {'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all'], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # comparable to our results with correlationchooser get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'k_best__k': 7, 'classifier__max_depth': 5} Average Time to Fit (s): 0.102 Average Time to Score (s): 0.002 It seems that our SelectKBest module is getting about the same accuracy as our custom transformer, but it's getting there a bit quicker! Let's see which columns our tests are selecting for us, with the help of the following code: k_best = SelectKBest(f_classif, k=7) # lowest 7 p values match what our custom correlationchooser chose before # ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] p_values.head(7) The preceding code produces the following table as the output: They appear to be the same columns that were chosen by our other statistical method. It's possible that our statistical method is limited to continually picking these seven columns for us. There are other tests available besides ANOVA, such as Chi2 and others, for regression tasks. They are all included in scikit-learn's documentation. For more info on feature selection through univariate testing, check out the scikit-learn documentation here: http:/​/​scikit-​learn.​org/​stable/​modules/​feature_​selection. html#univariate-​feature-​selection Before we move on to model-based feature selection, it's helpful to do a quick sanity check to ensure that we are on the right track. So far, we have seen two statistical methods for feature selection that gave us the same seven columns for optimal accuracy. But what if we were to take every column except those seven? We should expect a much lower accuracy and worse pipeline overall, right? Let's make sure. The following code helps us to implement sanity checks: # sanity check # If we only the worst columns the_worst_of_X = X[X.columns.drop(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])] # goes to show, that selecting the wrong features will # hurt us in predictive performance get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y) Best Accuracy: 0.783966666667 Best Parameters: {'max_depth': 5} Average Time to Fit (s): 0.21 Average Time to Score (s): 0.002 Hence, by selecting the columns except those seven, we see not only worse accuracy (almost as bad as the null accuracy), but also slower fitting times on average. We statistically selected features from the dataset for our machine learning pipeline. [box type="note" align="" class="" width=""]This article is an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla.  Do check out the book to get access to alternative techniques such as the model-based method to achieve optimum results from the machine learning application.[/box]      
Read more
  • 0
  • 0
  • 34086

article-image-jupyter-and-python-scripting
Packt
21 Oct 2016
9 min read
Save for later

Jupyter and Python Scripting

Packt
21 Oct 2016
9 min read
In this article by Dan Toomey, author of the book Learning Jupyter, we will see data access in Jupyter with Python and the effect of pandas on Jupyter. We will also see Python graphics and lastly Python random numbers. (For more resources related to this topic, see here.) Python data access in Jupyter I started a view for pandas using Python Data Access as the name. We will read in a large dataset and compute some standard statistics on the data. We are interested in seeing how we use pandas in Jupyter, how well the script performs, and what information is stored in the metadata (especially if it is a larger dataset). Our script accesses the iris dataset built into one of the Python packages. All we are looking to do is read in a slightly large number of items and calculate some basic operations on the dataset. We are really interested in seeing how much of the data is cached in the PYNB file. The Python code is: # import the datasets package from sklearn import datasets # pull in the iris data iris_dataset = datasets.load_iris() # grab the first two columns of data X = iris_dataset.data[:, :2] # calculate some basic statistics x_count = len(X.flat) x_min = X[:, 0].min() - .5 x_max = X[:, 0].max() + .5 x_mean = X[:, 0].mean() # display our results x_count, x_min, x_max, x_mean I broke these steps into a couple of cells in Jupyter, as shown in the following screenshot: Now, run the cells (using Cell | Run All) and you get this display below. The only difference is the last Out line where our values are displayed. It seemed to take longer to load the library (the first time I ran the script) than to read the data and calculate the statistics. If we look in the PYNB file for this notebook, we see that none of the data is cached in the PYNB file. We simply have code references to the library, our code, and the output from when we last calculated the script: { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(300, 3.7999999999999998, 8.4000000000000004, 5.8433333333333337)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate some basic statisticsn", "x_count = len(X.flat)n", "x_min = X[:, 0].min() - .5n", "x_max = X[:, 0].max() + .5n", "x_mean = X[:, 0].mean()n", "n", "# display our resultsn", "x_count, x_min, x_max, x_mean" ] } Python pandas in Jupyter One of the most widely used features of Python is pandas. pandas are built-in libraries of data analysis packages that can be used freely. In this example, we will develop a Python script that uses pandas to see if there is any effect to using them in Jupyter. I am using the Titanic dataset from http://www.kaggle.com/c/titanic-gettingStarted/download/train.csv. I am sure the same data is available from a variety of sources. Here is our Python script that we want to run in Jupyter: from pandas import * training_set = read_csv('train.csv') training_set.head() male = training_set[training_set.sex == 'male'] female = training_set[training_set.sex =='female'] womens_survival_rate = float(sum(female.survived))/len(female) mens_survival_rate = float(sum(male.survived))/len(male) The result is… we calculate the survival rates of the passengers based on sex. We create a new notebook, enter the script into appropriate cells, include adding displays of calculated data at each point and produce our results. Here is our notebook laid out where we added displays of calculated data at each cell,as shown in the following screenshot: When I ran this script, I had two problems: On Windows, it is common to use backslash ("") to separate parts of a filename. However, this coding uses the backslash as a special character. So, I had to change over to use forward slash ("/") in my CSV file path. I originally had a full path to the CSV in the above code example. The dataset column names are taken directly from the file and are case sensitive. In this case, I was originally using the 'sex' field in my script, but in the CSV file the column is named Sex. Similarly I had to change survived to Survived. The final script and result looks like the following screenshot when we run it: I have used the head() function to display the first few lines of the dataset. It is interesting… the amount of detail that is available for all of the passengers. If you scroll down, you see the results as shown in the following screenshot: We see that 74% of the survivors were women versus just 19% men. I would like to think chivalry is not dead! Curiously the results do not total to 100%. However, like every other dataset I have seen, there is missing and/or inaccurate data present. Python graphics in Jupyter How do Python graphics work in Jupyter? I started another view for this named Python Graphics so as to distinguish the work. If we were to build a sample dataset of baby names and the number of births in a year of that name, we could then plot the data. The Python coding is simple: import pandas import matplotlib %matplotlib inline baby_name = ['Alice','Charles','Diane','Edward'] number_births = [96, 155, 66, 272] dataset = list(zip(baby_name,number_births)) df = pandas.DataFrame(data = dataset, columns=['Name', 'Number']) df['Number'].plot() The steps of the script are as follows: We import the graphics library (and data library) that we need Define our data Convert the data into a format that allows for easy graphical display Plot the data We would expect a resultant graph of the number of births by baby name. Taking the above script and placing it into cells of our Jupyter node, we get something that looks like the following screenshot: I have broken the script into different cells for easier readability. Having different cells also allows you to develop the script easily step by step, where you can display the values computed so far to validate your results. I have done this in most of the cells by displaying the dataset and DataFrame at the bottom of those cells. When we run this script (Cell | Run All), we see the results at each step displayed as the script progresses: And finally we see our plot of the births as shown in the following screenshot. I was curious what metadata was stored for this script. Looking into the IPYNB file, you can see the expected value for the formula cells. The tabular data display of the DataFrame is stored as HTML—convenient: { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>n", "<table border="1" class="dataframe">n", "<thead>n", "<tr style="text-align: right;">n", "<th></th>n", "<th>Name</th>n", "<th>Number</th>n", "</tr>n", "</thead>n", "<tbody>n", "<tr>n", "<th>0</th>n", "<td>Alice</td>n", "<td>96</td>n", "</tr>n", "<tr>n", "<th>1</th>n", "<td>Charles</td>n", "<td>155</td>n", "</tr>n", "<tr>n", "<th>2</th>n", "<td>Diane</td>n", "<td>66</td>n", "</tr>n", "<tr>n", "<th>3</th>n", "<td>Edward</td>n", "<td>272</td>n", "</tr>n", "</tbody>n", "</table>n", "</div>" ], "text/plain": [ " Name Numbern", "0 Alice 96n", "1 Charles 155n", "2 Diane 66n", "3 Edward 272" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], The graphic output cell that is stored like this: { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x47cf8f0>" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "<a few hundred lines of hexcodes> …/wc/B0RRYEH0EQAAAABJRU5ErkJggg==n", "text/plain": [ "<matplotlib.figure.Figure at 0x47d8e30>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the datan", "df['Number'].plot()n" ] } ], Where the image/png tag contains a large hex digit string representation of the graphical image displayed on screen (I abbreviated the display in the coding shown). So, the actual generated image is stored in the metadata for the page. Python random numbers in Jupyter For many analyses we are interested in calculating repeatable results. However, much of the analysis relies on some random numbers to be used. In Python, you can set the seed for the random number generator to achieve repeatable results with the random_seed() function. In this example, we simulate rolling a pair of dice and looking at the outcome. We would example the average total of the two dice to be 6—the halfway point between the faces. The script we are using is this: import pylab import random random.seed(113) samples = 1000 dice = [] for i in range(samples): total = random.randint(1,6) + random.randint(1,6) dice.append(total) pylab.hist(dice, bins= pylab.arange(1.5,12.6,1.0)) pylab.show() Once we have the script in Jupyter and execute it, we have this result: I had added some more statistics. Not sure if I would have counted on such a high standard deviation. If we increased the number of samples, this would decrease. The resulting graph was opened in a new window, much as it would if you ran this script in another Python development environment. The toolbar at the top of the graphic is extensive, allowing you to manipulate the graphic in many ways. Summary In this article, we walked through simple data access in Jupyter through Python. Then we saw an example of using pandas. We looked at a graphics example. Finally, we looked at an example using random numbers in a Python script. Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Mining Twitter with Python – Influence and Engagement [article] Unsupervised Learning [article]
Read more
  • 0
  • 0
  • 34017
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-ensemble-methods-optimize-machine-learning-models
Guest Contributor
07 Nov 2017
8 min read
Save for later

Ensemble Methods to Optimize Machine Learning Models

Guest Contributor
07 Nov 2017
8 min read
[box type="info" align="" class="" width=""]We are happy to bring you an elegant guest post on ensemble methods by Benjamin Rogojan, popularly known as The Seattle Data Guy.[/box] How do data scientists improve their algorithm’s accuracy or improve the robustness of a model? A method that is tried and tested is ensemble learning. It is a must know topic if you claim to be a data scientist and/or a machine learning engineer. Especially, if you are planning to go in for a data science/machine learning interview. Essentially, ensemble learning stays true to the meaning of the word ‘ensemble’. Rather than having several people who are singing at different octaves to create one beautiful harmony (each voice filling in the void of the other), ensemble learning uses hundreds to thousands of models of the same algorithm that work together to find the correct classification. Another way to think about ensemble learning is the fable of the blind men and the elephant. Each blind man in the story seeks to identify the elephant in front of them. However, they work separately and come up with their own conclusions about the animal. Had they worked in unison, they might have been able to eventually figure out what they were looking at. Similarly, ensemble learning utilizes the workings of different algorithms and combines them for a successful and optimal classification. Ensemble methods such as Boosting and Bagging have led to an increased robustness of statistical models with decreased variance. Before we begin with explaining the various ensemble methods, let us have a glance at the common bond between them, Bootstrapping. Bootstrap: The common glue Explaining Bootstrapping can occasionally be missed by many data scientists. However, an understanding of bootstrapping is essential as both the ensemble methods, Boosting and Bagging, are based on the concept of bootstrapping. Figure 1: Bootstrapping In machine learning terms, bootstrap method refers to random sampling with replacement. This sample, after replacement, is referred as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances, and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics which the sample might have contained. This would, in turn, affect the overall mean, standard deviation, and other descriptive metrics of a data set. Ultimately, leading to the development of more robust models. The above diagram depicts each sample population having different and non-identical pieces. Bootstrapping is also great for small size data sets that may have a tendency to overfit. In fact, we recommended this to one company who was concerned because their data sets were far from “Big Data”. Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping are more robust and can handle new datasets depending on the methodology chosen (boosting or bagging). The bootstrap method can also test the stability of a solution. By using multiple sample data sets and then testing multiple models, it can increase robustness.  In certain cases, one sample data set may have a larger mean than another or a different standard deviation. This might break a model that was overfitted and not tested using data sets with different variations. One of the many reasons bootstrapping has become so common is because of the increase in computing power. This allows multiple permutations to be done with different resamples. Let us now move on to the most prominent ensemble methods: Bagging and Boosting. Ensemble Method 1: Bagging Bagging actually refers to Bootstrap Aggregators. Most papers or posts that explain bagging algorithms are bound to refer to Leo Breiman’s work, a paper published in 1996 called  “Bagging Predictors”. In the paper, Leo describes bagging as: “Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.” Bagging helps reduce variance from models that are accurate only on the data they were trained on. This problem is also known as overfitting. Overfitting happens when a function fits the data too well. Typically this is because the actual equation is highly complicated to take into account each data point and the outlier. Figure 2: Overfitting Another example of an algorithm that can overfit easily is a decision tree. The models that are developed using decision trees require very simple heuristics. Decision trees are composed of a set of if-else statements done in a specific order. Thus, if the data set is changed to a new data set that might have some bias or difference in the spread of underlying features compared to the previous set, the model will fail to be as accurate as before. This is because the data will not fit the model well. Bagging gets around the overfitting problem by creating its own variance amongst the data. This is done by sampling and replacing data while it tests multiple hypotheses (models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes (median, average, etc). Once each model has developed a hypothesis, the models use voting for classification or averaging for regression. This is where the “Aggregating” of the “Bootstrap Aggregating” comes into play. As in the figure shown below, each hypothesis has the same weight as all the others. (When we later discuss boosting, this is one of the places the two methodologies differ.) Figure 3: Bagging Essentially, all these models run at the same time and vote on the hypothesis which is the most accurate. This helps to decrease variance i.e. reduce the overfit. Ensemble Method 2: Boosting Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Unlike bagging (that has each model run independently and then aggregate the outputs at the end without preference to any model), boosting is all about “teamwork”. Each model that runs dictates what features the next model will focus on. Boosting also requires bootstrapping. However, there is another difference here. Unlike bagging, boosting weights each sample of data. This means some samples will be run more often than others. Why put weights on the samples of data? Figure 4: Boosting When boosting runs each model, it tracks which data samples are the most successful and which are not. The data sets with the most misclassified outputs are given heavier weights. This is because such data sets are considered to have more complexity. Thus, more iterations would be required to properly train the model. During the actual classification stage, boosting tracks the model's error rates to ensure that better models are given better weights. That way, when the “voting” occurs, like in bagging, the models with better outcomes have a stronger pull on the final output. Which of these ensemble methods is right for me? Ensemble methods generally out-perform a single model. This is why many Kaggle winners have utilized ensemble methodologies. Another important ensemble methodology, not discussed here, is stacking. Boosting and bagging are both great techniques to decrease variance. However, they won’t fix every problem, and they themselves have their own issues. There are different reasons why you would use one over the other. Bagging is great for decreasing variance when a model is overfitted. However, boosting is likely to be a better pick of the two methods. This is because it is also great for decreasing bias in an underfit model. On the other hand, boosting is likely to suffer performance issues. This is where experience and subject matter expertise comes in! It may seem easy to jump on the first model that works. However, it is important to analyze the algorithm and all the features it selects. For instance, a decision tree that sets specific leafs shouldn’t be implemented if it can’t be supported with other data points and visuals. It is not just about trying AdaBoost, or Random forests on various datasets. The final algorithm is driven depending on the results an algorithm is getting and the support provided. [author title="About the Author"] Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/author]
Read more
  • 0
  • 0
  • 33954

article-image-questions-tensorflow-2-0-tf-prebuilt-binaries-tensorboard-keras-python-support
Sugandha Lahoti
10 Dec 2019
5 min read
Save for later

#AskTensorFlow: Twitterati ask questions on TensorFlow 2.0 - TF prebuilt binaries, Tensorboard, Keras, and Python support

Sugandha Lahoti
10 Dec 2019
5 min read
TensorFlow 2.0 was released recently with tighter integration with Keras, eager execution enabled by default, three times faster training performance, a cleaned-up API, and more.  TensorFlow 2.0 had a major API Cleanup. Many API symbols are removed or renamed for better consistency and clarity. It now enables eager execution by default which effectively means that your TensorFlow code runs like numpy code. Keras has been introduced as the main high-level API to enable developers to easily leverage Keras’ various model-building APIs. TensorFlow 2.0 also has the SavedModel API that allows you to save your trained Machine learning model into a language-neutral format.  In May, Paige Bailey, Product Manager (TensorFlow) and Laurence Moroney,  Developer Advocate at Google sat down to discuss frequently asked questions on TensorFlow 2.0. They talked about TensorFlow prebuilt binaries, the TF 2.0 upgrade script, Tensorflow Datasets, and Python support. Can I ask about any prebuilt binary for the RTX 2080 GPU on Ubuntu 16?  Prebuilt binaries for TensorFlow tend to be associated with a specific driver from Nvidia. If you're taking a look at any of the prebuilt binaries, take a look at what driver or what version of the driver you have supported on that specific card. It's easy for you to go to the driver vendor and download the latest version. But that may not be the one that TensorFlow is built for or the one that it supports. So, just make sure that they actually match each other.  Do my TensorFlow scripts work with TensorFlow 2.0?  Generally, TensorFlow scripts do not work with TensorFlow 2.0. But TensorFlow 2.0 has created an upgrade utility that is automatically downloaded with TensorFlow 2.0. For more information, you can check out this medium blog post that Paige and her colleague Anna created. It shows how you can upgrade script on an end file - any arbitrary Python file or even Jupyter Notebooks. It'll give you an export.txt file that shows you all of the symbol renames, the added keywords, and then some manual changes.  When will TensorFlow be supported in Python 3.7 and hence be accessed in Anaconda 3? TensorFlow has made the commitment that as of January 1, 2020, they no longer support Python 2. They are firmly committed to Python 3 and Python 3 support.  Is it possible to run Tensorboard on colabs? You can run Tensorboard on colabs and do different operations like smoothing, changing some of the values, and using the embedding visualizer directly from your collab notebook in order to understand accuracies and to be able to model performance debugging. You also don't have to specify ports which means you need not remember to have multiple tensor board instances running. Tensorboard automatically selects one that would be a good candidate.  How would you use [TensorFlow’s] feature_columns with Keras? TensorFlow's feature_columns API is quite useful for non-numerical feature processing. Feature columns are a way of getting your data efficiently into Estimators and you can use them in Keras. TensorFlow 2.0 also has a migration guide if you wanted to migrate your models from using Estimators to being more of a TensorFlow 2.0 format with Keras.   What are some simple data sets for testing and comparing different training methods for artificial neural networks? Are there any in TensorFlow 2.0? Although MNIST and Fashion-MNIST are great, TensorFlow 2.0 also has TensorFlow Datasets which provide a collection of datasets ready to use with TensorFlow. It handles downloading and preparing the data and constructing a tf.data. TensorFlow Datasets is compatible with both TensorFlow Eager mode and Graph mode. Also, you can use them with all of your deep learning and machine learning models with just a few lines of code.  What about all the web developers who are new to AI, how does TensorFlow 2.0 help them get started? With TensorFlow 2.0, the web models that you create using saved model can be deployed to TFLite, or TensorFlow.js. The Keras layers are also supported in TensorFlow.js, so it's not just for Python developers but also for JS developers or even R developers.  You can watch Paige and Lawrence answering more questions in this three-part video series available on YouTube. Some of the other  questions asked were: Is there any TensorFlow.js transfer learning example for object detection? Are you going to publish the updated version of TensorFlow from Poets tutorial from Pete Warden implementing TF2.0. TFLite 2.0 and NN-API for faster inference on Android devices equipped with NPU/DSP? Will the frozen graph generated from TF 1.x work on TF 2.0? Which is the preferred format for saving the model GOIU forward saved_model (SM) or hd5? What is the purpose of keeping Estimators and Keras as separate APIs?  If you want to quickly start with building machine learning projects with TensorFlow 2.0, read our book TensorFlow 2.0 Quick Start Guide by Tony Holdroyd. In this book, you will get acquainted with some new practices introduced in TensorFlow 2.0. You will also learn to train your own models for effective prediction, using high-level Keras API.  TensorFlow.js contributor Kai Sasaki on how TensorFlow.js eases web-based machine learning application development Introducing Spleeter, a Tensorflow based python library that extracts voice and sound from any music track. TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more! Brad Miro talks TensorFlow 2.0 features and how Google is using it internally
Read more
  • 0
  • 0
  • 33856

article-image-15-things-every-bi-professional-should-know-about-tableau
Fatema Patrawala
17 Dec 2019
8 min read
Save for later

15 things every BI professional should know about Tableau

Fatema Patrawala
17 Dec 2019
8 min read
“The art and practice of visualizing data is becoming ever more important in bridging the human-computer gap to mediate analytical insight in a meaningful way.” ―Edd Dumbill Tableau is a powerful data visualization and discovery tool. It is an important part of a data analyst or data scientist’s - skill set, with many organizations specifying it as a key skill in job adverts. In this article, we’ll take a look at few things in Tableau you need to know to successfully make a mark in your business intelligence career. While architecture of traditional BI tools has hardware limitations, Tableau does not have such dependencies and it can function independently and requires minimum hardware support. Traditional tools are based on a complex set of technologies when Tableau is based on Associative Search technology making it intuitive, fast and dynamic. Tableau supports in-memory, multi-thread and multi-core computing and more advanced capabilities while traditional BI tools do not offer such functionalities. Various Tableau products Tableau Desktop is a self service business analytics and data visualization suite that anyone can use. With tableau desktop, you can extract massive data offline from your data warehouse for live up to date data analysis. Tableau Online / Tableau Server is an online hosting platform designed for enterprise users. It lets users working on Tableau publish and share dashboards across organization and teams. Tableau Reader is a free desktop application that enables you to open and view visualizations that are built in Tableau Desktop. Tableau Public is a free Tableau software which you can use to make visualizations but you will need to save your workbook or worksheets in the Tableau Server for anyone else to view them. Different data types in Tableau All fields in a data source have a data type. The data type reflects the kind of information stored in that field, for example integers (410), dates (1/23/2015) and strings (“Wisconsin”). The data type of a field is identified in the Data pane by one of the icons shown below. Data type icons in Tableau Icon Data type Text (string) values Date values Date & Time values Numerical values Boolean values (relational only) for example True/False Geographic values (used with maps) Cluster Group   Source: Tableau website Measures and Dimensions in Tableau Measures contain numeric, quantitative values that you can measure. Measures can be aggregated. When you drag a measure into the view, Tableau applies an aggregation to that measure (by default). Dimensions, on the other hand, contain qualitative values (such as names, dates, or geographical data). You can use dimensions to categorize, segment, and reveal the details in your data. Dimensions affect the level of detail in the view. Ways to connect data in Tableau We can either connect live to your data set or extract data into Tableau. Live: Connecting live to a data set leverages its computational processing and storage. New queries will go to the database and will be reflected as new or updated within the data. Extract: The Extract API allows you to programmatically extract and combine any data sources for use in Tableau. There can be multiple data source connections to different sources in the same workbook. Each connection will show up under the Data tab on the left sidebar. The benefit of Tableau extract over live connection is that extract can be used anywhere without any connection and you can build your own visualization without connecting to database. You can read a complete section on how to extract data in Tableau from this book, Learning Tableau 2019 - Third Edition, written by Joshua Milligan. This book takes you through the foundations of the Tableau 2019 paradigm to the advanced topics.  Joins and Blends in Tableau Joining tables and blending data sources are two different ways to link related data together in Tableau. Joins are performed to link tables of data together on a row-by-row basis. Blends are performed to link together multiple data sources at an aggregate level.  Different filters in Tableau and different use cases in which these filters are more relevant than others In Tableau, filters are used to restrict the data from database. Often, you will want to filter data in Tableau in order to perform an analysis on a subset of data, narrow your focus, or drill into detail. Tableau offers multiple ways to filter data. If you want to limit the scope of your analysis to a subset of data, you can filter the data at the source using one of the following techniques: Data Source Filters are applied before all other filters and are useful when you want to limit your analysis to a subset of data. These filters are applied before any other filters. Extract Filters limit the data that is stored in an extract (.tde or .hyper). Data source filters are often converted into extract filters if they are present when you extract the data. Custom SQL Filters can be accomplished using a live connection with custom SQL, which has a Tableau parameter in the WHERE clause.    Dual axis in Tableau Dual Axis is an excellent phenomenon supported by Tableau that helps users view two scales of two measures in the same graph. Many websites like Indeed.com and other make use of dual axis to show the comparison between two measures and their growth rate in a septic set of years. Dual axis let you compare multiple measures at once, having two independent axis layered on top of one another.  Key components of a Tableau Dashboard Horizontal – Horizontal layout containers allow the designer to group worksheets and dashboard components left to right across your page and edit the height of all elements at once. Vertical – Vertical containers allow the user to group worksheets and dashboard components top to bottom down your page and edit the width of all elements at once. Text – All textual fields. Image Extract  – A Tableau workbook is in XML format. In order to extract images, Tableau applies some codes to extract an image which can be stored in XML. Web [URL ACTION] – A URL action is a hyperlink that points to a Web page, file, or other web-based resource outside of Tableau. You can use URL actions to link to more information about your data that may be hosted outside of your data source. To make the link relevant to your data, you can substitute field values of a selection into the URL as parameters. If you want to learn how to design dashboards in Tableau, this book Learning Tableau 2019, will give you a step by step process for designing dashboards.  Why automate reports in Tableau Once you have automated reporting, you’ll have time to spend on innovative projects. What can be done manually could be performed by automation, delivering the same results in a fraction of the time. Reducing such a time-consuming and repetitive task will make you more productive, and more efficient.  What is story in Tableau? Why would create a story and what are they used for? A story is a sheet that contains a sequence of worksheets or dashboards that work together to convey information. You can create stories to show how facts are connected, provide context, demonstrate how decisions relate to outcomes, or simply make a compelling case. Each individual sheet in a story is called a story point. The primary objective of creating stories in Tableau is to communicate data to a certain audience with an intended result.  How can you create stories in Tableau? There is a feature in Tableau named as Stories that allows you to tell a story using interactive snapshots of dashboards and views. The snapshots become points in a story. This allows you to construct guided narrative or even an entire presentation. Read this chapter, ‘Telling a Data Story with Dashboards’ from this book, Learning Tableau 2019, to create insightful dashboards in Tableau.    How to embed views into Webpages? You can embed interactive Tableau views and dashboards into web pages, blogs, wiki pages, web applications, and intranet portals. Embedded views update as the underlying data changes, or as their workbooks are updated on Tableau Server. Embedded views follow the same licensing and permission restrictions used on Tableau Server. That is, to see a Tableau view that’s embedded in a web page, the person accessing the view must also have an account on Tableau Server. Alternatively, if your organization uses a core-based license on Tableau Server, a Guest account is available. This allows people in your organization to view and interact with Tableau views embedded in web pages without having to sign in to the server. Contact your server or site administrator to find out if the Guest user is enabled for the site you publish to.  What is Tableau Prep? Can we clean messy data with Tableau? Tableau Prep extends the Tableau platform with robust options for cleaning and structuring data for analysis in Tableau. In the same way that Tableau Desktop provides a hands-on, visual experience for visualizing and analyzing data, Tableau Prep provides a hands-on, visual experience for cleaning and shaping data. If you wish to know more about Tableau Prep or how to clean messy data to create powerful data visualizations and unlock intelligent business insights, read this book Learning Tableau 2019, written by Joshua N. Milligan. ‘Tableau Day’ highlights: Augmented Analytics, Tableau Prep Builder and Conductor, and more! Alteryx vs. Tableau: Choosing the right data analytics tool for your business How to do data storytelling well with Tableau [Video]
Read more
  • 0
  • 0
  • 33762

article-image-how-to-perform-audio-video-image-scraping-with-python
Amarabha Banerjee
08 Mar 2018
9 min read
Save for later

How to perform Audio-Video-Image Scraping with Python

Amarabha Banerjee
08 Mar 2018
9 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server. Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally. Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It's a simple step from there to also transcode video with ffmpeg. Downloading media content from the web Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content. Getting ready There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file. How to do it Here is how we proceed with the recipe: The URLUtility class can download content from a URL. The code in the recipe's file is the following: import const from util.urls import URLUtility util = URLUtility(const.ApodEclipseImage()) print(len(util.data)) When running this you will see the following output:  Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014 The example reads 171014 bytes of data. How it works The URL is defined as a constant const.ApodEclipseImage() in the const module: def ApodEclipseImage(): return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg" The constructor of the URLUtility class has the following implementation: def __init__(self, url, readNow=True): """ Construct the object, parse the URL, and download now if specified""" self._url = url self._response = None self._parsed = urlparse(url) if readNow: self.read() The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method: def read(self): self._response = urllib.request.urlopen(self._url) self._data = self._response.read() This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property: @property def data(self): self.ensure_response() return self._data The code then simply reports on the length of that data, with the value of 171014. There's more This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next. Parsing a URL with urllib to get the filename When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name? Getting ready We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py. How to do it Execute the recipe's file with your python interpreter. It will run the following code: util = URLUtility(const.ApodEclipseImage()) print(util.filename_without_ext) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The filename is: BT5643s How it works In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively: >>> parsed = urlparse(const.ApodEclipseImage()) >>> parsed ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='') The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension: @property def filename_without_ext(self): filename = os.path.splitext(os.path.basename(self._parsed.path))[0] return filename The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename). There's more It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That's our next recipe. Determining the type of content for a URL When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content. Getting ready We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py. How to do it We proceed as follows: Execute the script for the recipe. It contains the following code: util = URLUtility(const.ApodEclipseImage()) print("The content type is: " + util.contenttype) With the following result: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The content type is: image/jpeg How it works The .contentype property is implemented as follows: @property def contenttype(self): self.ensure_response() return self._response.headers['content-type'] The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed. There's more The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned: >>> response = urllib.request.urlopen(const.ApodEclipseImage()) >>> for header in response.headers: print(header) Date Server Last-Modified ETag Accept-Ranges Content-Length Connection Content-Type Strict-Transport-Security And we can see the values for each of these headers. >>> for header in response.headers: print(header + " ==> " + response.headers[header]) Date ==> Tue, 26 Sep 2017 19:31:41 GMT Server ==> WebServer/1.0 Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT ETag ==> "547bb44-29c06-5581275ce2b86" Accept-Ranges ==> bytes Content-Length ==> 171014 Connection ==> close Content-Type ==> image/jpeg Strict-Transport-Security ==> max-age=31536000; includeSubDomains Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist. Determining the file extension from a content type It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file. Getting ready We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):. How to do it Proceed by running the recipe's script. An extension for the media type can be found using the .extension property: util = URLUtility(const.ApodEclipseImage()) print("Filename from content-type: " + util.extension_from_contenttype) print("Filename from url: " + util.extension_from_url) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes Filename from content-type: .jpg Filename from url: .jpg This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same. How it works The following is the implementation of the .extension_from_contenttype property: @property def extension_from_contenttype(self): self.ensure_response() map = const.ContentTypeToExtensions() if self.contenttype in map: return map[self.contenttype] return None The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension: def ContentTypeToExtensions(): return { "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/png": ".png" } If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url: @property def extension_from_url(self): ext = os.path.splitext(os.path.basename(self._parsed.path))[1] return ext This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename. To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python. If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.
Read more
  • 0
  • 0
  • 33710
article-image-auto-generate-texts-shakespeare-writing-using-deep-recurrent-neural-networks
Savia Lobo
16 Feb 2018
6 min read
Save for later

How to auto-generate texts from Shakespeare writing using deep recurrent neural networks

Savia Lobo
16 Feb 2018
6 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled as Natural Language Processing with Python Cookbook. This book will give unique recipes to know various aspects of performing Natural Language Processing with NLTK—a leading Python platform for NLP.[/box] Today we will learn to use deep recurrent neural networks (RNN) to predict the next character based on the given length of a sentence. This way of training a model is able to generate automated text continuously, which can imitate the writing style of the original writer with enough training on the number of epochs and so on. Getting ready... The Project Gutenberg eBook of the complete works of William Shakespeare's dataset is used to train the network for automated text generation. Data can be downloaded from http:// www.gutenberg.org/ for the raw file used for training: >>> from  future import print_function >>> import numpy as np >>> import random >>> import sys The following code is used to create a dictionary of characters to indices and vice-versa mapping, which we will be using to convert text into indices at later stages. This is because deep learning models cannot understand English and everything needs to be mapped into indices to train these models: >>> path = 'C:UsersprataDocumentsbook_codes NLP_DL shakespeare_final.txt' >>> text = open(path).read().lower() >>> characters = sorted(list(set(text))) >>> print('corpus length:', len(text)) >>> print('total chars:', len(characters)) >>> char2indices = dict((c, i) for i, c in enumerate(characters)) >>> indices2char = dict((i, c) for i, c in enumerate(characters)) How to do it… Before training the model, various preprocessing steps are involved to make it work. The following are the major steps involved: Preprocessing: Prepare X and Y data from the given entire story text file and converting them into indices vectorized format. Deep learning model training and validation: Train and validate the deep learning model. Text generation: Generate the text with the trained model. How it works... The following lines of code describe the entire modeling process of generating text from Shakespeare's writings. Here we have chosen character length. This needs to be considered as 40 to determine the next best single character, which seems to be very fair to consider. Also, this extraction process jumps by three steps to avoid any overlapping between two consecutive extractions, to create a dataset more fairly: # cut the text in semi-redundant sequences of maxlen characters >>> maxlen = 40 >>> step = 3 >>> sentences = [] >>> next_chars = [] >>> for i in range(0, len(text) - maxlen, step): ... sentences.append(text[i: i + maxlen]) ... next_chars.append(text[i + maxlen]) ... print('nb sequences:', len(sentences)) The following screenshot depicts the total number of sentences considered, 193798, which is enough data for text generation: The next code block is used to convert the data into a vectorized format for feeding into deep learning models, as the models cannot understand anything about text, words, sentences and so on. Initially, total dimensions are created with all zeros in the NumPy array and filled with relevant places with dictionary mappings: # Converting indices into vectorized format >>> X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool) >>> y = np.zeros((len(sentences), len(characters)), dtype=np.bool) >>> for i, sentence in enumerate(sentences): ... for t, char in enumerate(sentence): ... X[i, t, char2indices[char]] = 1 ... y[i, char2indices[next_chars[i]]] = 1 >>> from keras.models import Sequential >>> from keras.layers import Dense, LSTM,Activation,Dropout >>> from keras.optimizers import RMSprop The deep learning model is created with RNN, more specifically Long Short-Term Memory networks with 128 hidden neurons, and the output is in the dimensions of the characters. The number of columns in the array is the number of characters. Finally, the softmax function is used with the RMSprop optimizer. We encourage readers to try with other various parameters to check out how results vary: #Model Building >>> model = Sequential() >>> model.add(LSTM(128, input_shape=(maxlen, len(characters)))) >>> model.add(Dense(len(characters))) >>> model.add(Activation('softmax')) >>> model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01)) >>> print (model.summary()) As mentioned earlier, deep learning models train on number indices to map input to output (given a length of 40 characters, the model will predict the next best character). The following code is used to convert the predicted indices back to the relevant character by determining the maximum index of the character: # Function to convert prediction into index >>> def pred_indices(preds, metric=1.0): ... preds = np.asarray(preds).astype('float64') ... preds = np.log(preds) / metric ... exp_preds = np.exp(preds) ... preds = exp_preds/np.sum(exp_preds) ... probs = np.random.multinomial(1, preds, 1) ... return np.argmax(probs) The model will be trained over 30 iterations with a batch size of 128. And also, the diversity has been changed to see the impact on the predictions: # Train and Evaluate the Model >>> for iteration in range(1, 30): ... print('-' * 40) ... print('Iteration', iteration) ... model.fit(X, y,batch_size=128,epochs=1).. ... start_index = random.randint(0, len(text) - maxlen - 1) ... for diversity in [0.2, 0.7,1.2]: ... print('n----- diversity:', diversity) ... generated = '' ... sentence = text[start_index: start_index + maxlen] ... generated += sentence ... print('----- Generating with seed: "' + sentence + '"') ... sys.stdout.write(generated) ... for i in range(400): ... x = np.zeros((1, maxlen, len(characters))) ... for t, char in enumerate(sentence): ... x[0, t, char2indices[char]] = 1. ... preds = model.predict(x, verbose=0)[0] ... next_index = pred_indices(preds, diversity) ... pred_char = indices2char[next_index] ... generated += pred_char ... sentence = sentence[1:] + pred_char ... sys.stdout.write(pred_char) ... sys.stdout.flush() ... print("nOne combination completed n") The results are shown in the next screenshot to compare the first iteration (Iteration 1) and final iteration (Iteration 29). It is apparent that with enough training, the text generation seems to be much better than with Iteration 1: Text generation after Iteration 29 is shown in this image: Though the text generation seems to be magical, we have generated text using Shakespeare's writings, proving that with the right training and handling, we can imitate any style of writing of a particular writer. If you found this post useful, you may check out this book Natural Language Processing with Python Cookbook to analyze sentence structure and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and other NLP techniques.  
Read more
  • 0
  • 0
  • 33693

article-image-kriging-interpolation-geostatistics
Guest Contributor
15 Nov 2017
7 min read
Save for later

Using R to implement Kriging - A Spatial Interpolation technique for Geostatistics data

Guest Contributor
15 Nov 2017
7 min read
The Kriging interpolation technique is being increasingly used in geostatistics these days. But how does Kriging work to create a prediction, after all? To start with, Kriging is a method where the distance and direction between the sample data points indicate a spatial correlation. This correlation is then used to explain the different variations in the surface. In cases where the distance and direction give appropriate spatial correlation, Kriging will be able to predict surface variations in the most effective way. As such, we often see Kriging being used in Geology and Soil Sciences. Kriging generates an optimal output surface for prediction which it estimates based on a scattered set with z-values. The procedure involves investigating the z-values’ spatial behavior in an ‘interactive’ manner where advanced statistical relationships are measured (autocorrelation). Mathematically speaking, Kriging is somewhat similar to regression analysis and its whole idea is to predict the unknown value of a function at a given point by calculating the weighted average of all known functional values in the neighborhood. To get the output value for a location, we take the weighted sum of already measured values in the surrounding (all the points that we intend to consider around a specific radius), using a  formula such as the following: In a regression equation, λi would represent the weights of how far the points are from the prediction location. However, in Kriging, λi represent not just the weights of how far the measured points are from prediction location, but also how the measured points are arranged spatially around the prediction location. First, the variograms and covariance functions are generated to create the spatial autocorrelation of data. Then, that data is used to make predictions. Thus, unlike the deterministic interpolation techniques like Inverse Distance Weighted (IDW) and Spline interpolation tools, Kriging goes beyond just estimating a prediction surface. Here, it brings an element of certainty in that prediction surface. That is why experts rate kriging so highly for a strong prediction. Instead of a weather report forecasting a 2 mm rain on a certain Saturday, Kriging also tells you what is the "probability" of a 2 mm rain on that Saturday. We hope you enjoy this simple R tutorial on Kriging by Berry Boessenkool. Geostatistics: Kriging - spatial interpolation between points, using semivariance We will be covering following sections in our tutorial with supported illustrations: Packages read shapefile Variogram Kriging Plotting Kriging: packages install.packages("rgeos") install.packages("sf") install.packages("geoR") library(sf) # for st_read (read shapefiles), # st_centroid, st_area, st_union library(geoR) # as.geodata, variog, variofit, # krige.control, krige.conv, legend.krige ## Warning: package ’sf’ was built under R version 3.4.1 Kriging: read shapefile / few points for demonstration x <- c(1,1,2,2,3,3,3,4,4,5,6,6,6) y <- c(4,7,3,6,2,4,6,2,6,5,1,5,7) z <- c(5,9,2,6,3,5,9,4,8,8,3,6,7) plot(x,y, pch="+", cex=z/4) Kriging: read shapefile II GEODATA <- as.geodata(cbind(x,y,z)) plot(GEODATA) Kriging: Variogram I EMP_VARIOGRAM <- variog(GEODATA) ## variog: computing omnidirectional variogram FIT_VARIOGRAM <- variofit(EMP_VARIOGRAM) ## variofit: covariance model used is matern ## variofit: weights used: npairs ## variofit: minimisation function used: optim ## Warning in variofit(EMP_VARIOGRAM): initial values not provided - running the default search ## variofit: searching for best initial value ... selected values: ## sigmasq phi tausq kappa ## initial.value "9.19" "3.65" "0" "0.5" ## status "est" "est" "est" "fix" ## loss value: 401.578968904954 Kriging: Variogram II plot(EMP_VARIOGRAM) lines(FIT_VARIOGRAM) Kriging: Kriging res <- 0.1 grid <- expand.grid(seq(min(x),max(x),res), seq(min(y),max(y),res)) krico <- krige.control(type.krige="OK", obj.model=FIT_VARIOGRAM) krobj <- krige.conv(GEODATA, locations=grid, krige=krico) ## krige.conv: model with constant mean ## krige.conv: Kriging performed using global neighbourhood # KRigingObjekt Kriging: Plotting I image(krobj, col=rainbow2(100)) legend.krige(col=rainbow2(100), x.leg=c(6.2,6.7), y.leg=c(2,6), vert=T, off=-0.5, values=krobj$predict) contour(krobj, add=T) colPoints(x,y,z, col=rainbow2(100), legend=F) points(x,y) Kriging: Plotting II library("berryFunctions") # scatterpoints by color colPoints(x,y,z, add=F, cex=2, legargs=list(y1=0.8,y2=1)) Kriging: Plotting III colPoints(grid[ ,1], grid[ ,2], krobj$predict, add=F, cex=2, col2=NA, legargs=list(y1=0.8,y2=1)) Time for a real dataset Precipitation from ca 250 gauges in Brandenburg  as Thiessen Polygons with steep gradients at edges: Exercise 41: Kriging Load and plot the shapefile in PrecBrandenburg.zip with sf::st_read. With colPoints in the package berryFunctions, add the precipitation  values at the centroids of the polygons. Calculate the variogram and fit a semivariance curve. Perform kriging on a grid with a useful resolution (keep in mind that computing time rises exponentially  with grid size). Plot the interpolated  values with image or an equivalent (Rclick 4.15) and add contour lines. What went wrong? (if you used the defaults, the result will be dissatisfying.) How can you fix it? Solution for exercise 41.1-2: Kriging Data # Shapefile: p <- sf::st_read("data/PrecBrandenburg/niederschlag.shp", quiet=TRUE) # Plot prep pcol <- colorRampPalette(c("red","yellow","blue"))(50) clss <- berryFunctions::classify(p$P1, breaks=50)$index # Plot par(mar = c(0,0,1.2,0)) plot(p, col=pcol[clss], max.plot=1) # P1: Precipitation # kriging coordinates cent <- sf::st_centroid(p) berryFunctions::colPoints(cent$x, cent$y, p$P1, add=T, cex=0.7, legargs=list(y1=0.8,y2=1), col=pcol) points(cent$x, cent$y, cex=0.7) Solution for exercise 41.3: Variogram library(geoR) # Semivariance: geoprec <- as.geodata(cbind(cent$x,cent$y,p$P1)) vario <- variog(geoprec, max.dist=130000) ## variog: computing omnidirectional variogram fit <-variofit(vario) ## Warning in variofit(vario): initial values not provided - running the default search ## variofit: searching for best initial value ... selected values: ## sigmasq phi tausq kappa ## initial.value "1326.72" "19999.93" "0" "0.5" ## status "est" "est" "est" "fix" ## loss value: 107266266.76371 plot(vario) ; lines(fit) # distance to closest other point: d <- sapply(1:nrow(cent), function(i) min(berryFunctions::distance( cent$x[i], cent$y[i], cent$x[-i], cent$y[-i]))) hist(d/1000, breaks=20, main="distance to closest gauge [km]") mean(d) # 8 km ## [1] 8165.633 Solution for exercise 41.4-5: Kriging # Kriging: res <- 1000 # 1 km, since stations are 8 km apart on average grid <- expand.grid(seq(min(cent$x),max(cent$x),res), seq(min(cent$y),max(cent$y),res)) krico <- krige.control(type.krige="OK", obj.model=fit) krobj <- krige.conv(geoprec, locations=grid, krige=krico) ## krige.conv: model with constant mean ## krige.conv: Kriging performed using global neighbourhood # Set values outside of Brandenburg to NA: grid_sf <- sf::st_as_sf(grid, coords=1:2, crs=sf::st_crs(p)) isinp <- sapply(sf::st_within(grid_sf, p), length) > 0 krobj2 <- krobj krobj2$predict[!isinp] <- NA Solution for exercise 41.5: Kriging Visualization geoR:::image.kriging(krobj2, col=pcol) colPoints(cent$x, cent$y, p$P1, col=pcol, zlab="Prec", cex=0.7, legargs=list(y1=0.1,y2=0.8, x1=0.78, x2=0.87, horiz=F)) plot(p, add=T, col=NA, border=8)#; points(cent$x,cent$y, cex=0.7) [author title="About the author"]Berry started working with R in 2010 during his studies of Geoecology at Potsdam University, Germany. He has since then given a number of R programming workshops and tutorials, including full-week workshops in Kyrgyzstan and Kazachstan. He has left the department for environmental science in summer 2017 to focus more on software development and teaching in the data science industry. Please follow the Github link for detailed explanations on Berry’s R courses. [/author]
Read more
  • 0
  • 0
  • 33391

article-image-chatgpt-for-sql-queries
Chaitanya Yadav
20 Oct 2023
10 min read
Save for later

ChatGPT for SQL Queries

Chaitanya Yadav
20 Oct 2023
10 min read
Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionChatGPT is an efficient language that may be used in a range of tasks, including the creation of SQL queries. In this article, you will get to know how effectively you will be able to use SQL queries by using ChatGPT to optimize and craft them correctly to get perfect results.It is necessary to have sufficient SQL knowledge before you can use ChatGPT for the creation of SQL queries. The language that the databases are communicating with is SQL. This is meant to be used for the production, reading, updating, and deletion of data from databases. SQL is the most specialized language in this domain. It's one of the main components in a lot of existing applications because it deals with structured data that can be retrieved from tables.There are a number of different SQL queries, but some more common ones include the following:SELECT: It will select data from a database.INSERT: It will insert new data into a database.UPDATE: This query will update the existing data in a database.DELETE: This query is used to delete data from a database.Using ChatGPT to write SQL queriesOnce you have a basic understanding of SQL, you can start using ChatGPT to write SQL queries. To do this, you need to provide ChatGPT with a description of the query that you want to write. After that, ChatGPT will generate the SQL code for you.For example, you could just give ChatGPT the query below to write an SQL query to select all of the customers in your database.Select all of the customers in my databaseFollowing that, ChatGPT will provide the SQL code shown below:SELECT * FROM customers;The customer table's entire set of columns will be selected by this query. Additionally, ChatGPT can be used to create more complex SQL statements.How to Use ChatGPT to Describe Your IntentionsNow let’s have a look at some examples where we will ask ChatGPT to generate SQL code by asking it queries from our side.For Example:We'll be creating a sample database for ChatGPT, so we can ask them to set up restaurant databases and two tables.ChatGPT prompt:Create a sample database with two tables: GuestInfo and OrderRecords. The GuestInfo table should have the following columns: guest_id, first_name, last_name, email_address, and contact_number. The OrderRecords table should have the following columns: order_id, guest_id, product_id, quantity_ordered, and order_date.ChatGPT SQL Query Output:We requested that ChatGPT create a database and two tables in this example. After it generated a SQL query. The following SQL code is to be executed on the Management Studio software for SQL Server. As we are able to see the code which we got from ChatGPT successfully got executed in the SSMS Database software.How ChatGPT Can Be Used for Optimizing, Crafting, and Debugging Your QueriesSQL is an efficient tool to manipulate and interrogate data in the database. However, in particular, for very complex datasets it may be difficult to write efficient SQL queries. The ChatGPT Language Model is a robust model to help you with many tasks, such as optimizing SQL queries.Generating SQL queriesThe creation of SQL queries from Natural Language Statements is one of the most common ways that ChatGPT can be used for SQL optimization. Users who don't know SQL may find this helpful, as well as users who want to quickly create the query for a specific task.For example, you could ask for ChatGPT in the following way:Generate an SQL query to select all customers who have placed an order in the last month.ChatGPT would then generate the following query:SELECT * FROM customers WHERE order_date >= CURRENT_DATE - INTERVAL 1 MONTH;Optimizing existing queriesThe optimization of current SQL queries can also be achieved with ChatGPT. You can do this by giving ChatGPT the query that you want improved performance of and it will then suggest improvements to your query.For example, you could ask for ChatGPT in the following way:SELECT * FROM products WHERE product_name LIKE '%shirt%';ChatGPT might suggest the following optimizations:Add an index to the products table on the product_name column.Use a full-text search index on the product_name column.Use a more specific LIKE clause, such as WHERE product_name = 'shirt' if you know that the product name will be an exact match.Crafting queriesBy providing an interface between SQL and Natural Language, ChatGPT will be able to help with the drafting of complicated SQL queries. For users who are not familiar with SQL and need to create a quick query for a specific task, it can be helpful.For Example:Let's say we want to know which customers have placed an order within the last month, and spent more than $100 on it, then write a SQL query. The following query could be generated by using ChatGPT:SELECT * FROM customers WHERE order_date >= CURRENT_DATE - INTERVAL 1 MONTH AND order_total > 100;This query is relatively easy to perform, but ChatGPT can also be used for the creation of more complicated queries. For example, to select all customers who have placed an order in the last month and who have purchased a specific product, we could use ChatGPT to generate a query.SELECT * FROM customers WHERE order_date >= CURRENT_DATE - INTERVAL 1 MONTH AND order_items LIKE '%product_name%';Generating queries for which more than one table is involved can also be done with ChatGPT. For example, to select all customers who have placed an order in the last month and have also purchased a specific product from a specific category, we could use ChatGPT to generate a query.SELECT customers.*FROM customersINNER JOIN orders ON customers.id = orders.customer_idINNER JOIN order_items ON orders.id = order_items.order_idWHERE order_date >= CURRENT_DATE - INTERVAL 1 MONTHAND order_items_product_id = (SELECT id FROM products WHERE product_name = 'product_name')AND product_category_id = (SELECT id FROM product_categories WHERE category_name = 'category_name');The ChatGPT tool is capable of providing assistance with the creation of complex SQL queries. The ChatGPT feature facilitates users' writing efficient and accurate queries by providing an interface to SQL in a natural language.Debugging SQL queriesFor debugging SQL queries, the ChatGPT can also be used. To get started, you can ask ChatGPT to deliver a query that does not return the anticipated results. It will try to figure out why this is happening.For example, you could ask for ChatGPT in the following way:SELECT * FROM customers WHERE country = 'United States';Let's say that more results than expected are returned by this query. If there are multiple rows in a customer table, or the country column isn't being populated correctly for all clients, ChatGPT may suggest that something is wrong.How ChatGPT can help diagnose SQL query errors and suggest potential fixesYou may find that ChatGPT is useful for diagnosing and identifying problems, as well as suggesting possible remedies when you encounter errors or unexpected results in your SQL queries.To illustrate how ChatGPT could help you diagnose and correct SQL queries, we'll go over a hands-on example.Scenario: You'll be working with a database for Internet store transactions. The 'Products' table is where you would like to see the total revenue from a specific product named "Laptop". But you'll get unexpected results while running a SQL query.Your SQL Query:SELECT SUM(price) AS total_revenue FROM Products WHERE product_name = 'Laptop'; Issue: The query is not providing the expected results. You're not sure what went wrong.ChatGPT Assistance:Diagnosing the Issue:You can ask ChatGPT something like, "What could be the issue with my SQL query to calculate the total revenue of 'Laptop' from the Products table?"ChatGPT’s Response:The ChatGPT believes that the problem may arise from a WHERE clause. It suggests that because the names of products may not be distinctive, and there might be a lot of entries called 'Laptops', it is suggested to use ProductID rather than the product name. This query could be modified as follows:SELECT SUM(price) AS total_revenue FROM Products WHERE product_id = (SELECT product_id FROM Products WHERE product_name = 'Laptop');Explanation and Hands-on Practice:The reasoning behind this adjustment is explained by ChatGPT. In order to check if the revised query is likely to lead to an expected overall profit for a 'Laptop' product, you can then try running it.SELECT SUM(price) AS total_revenue FROM Products WHERE product_id = (SELECT product_id FROM Products WHERE product_name = 'Laptop');We have obtained the correct overall revenue from a 'Laptop' product with this query, which has resolved your unanticipated results issue.This hands-on example demonstrates how ChatGPT can help you diagnose and resolve your SQL problems, provide tailored suggestions, explain the solutions to fix them, and guide you through the process of strengthening your SQL skills by using practical applications.ConclusionIn conclusion, this article provides insight into the important role that ChatGPT plays when it comes to generating efficient SQL queries. In view of the key role played by SQL in database management for structured data, which is essential to modern applications, it stressed that there should be a solid knowledge base on SQL so as to effectively use ChatGPT when creating queries. We explored how ChatGPT could help you generate, optimize, and analyze SQL queries by presenting practical examples and use cases.It explains to users how ChatGPT is able to diagnose SQL errors and propose a solution, which in the end can help them solve unforeseen results and improve their ability to use SQL. In today's data-driven world where effective data manipulation is a necessity, ChatGPT becomes an essential ally for those who seek to speed up the SQL query development process, enhance accuracy, and increase productivity. It will open up new possibilities for data professionals and developers, allowing them to interact more effectively with databases.If you want to learn more about SQL, you can read this book: SQL for Data AnalyticsThis book goes beyond basic SQL and teaches you how to analyze real-world data using advanced techniques like joins, window functions, and statistical analysis. It includes hands-on exercises and case studies to help you build practical, job-ready data analytics skills.Author BioChaitanya Yadav is a data analyst, machine learning, and cloud computing expert with a passion for technology and education. He has a proven track record of success in using technology to solve real-world problems and help others to learn and grow. He is skilled in a wide range of technologies, including SQL, Python, data visualization tools like Power BI, and cloud computing platforms like Google Cloud Platform. He is also 22x Multicloud Certified.In addition to his technical skills, he is also a brilliant content creator, blog writer, and book reviewer. He is the Co-founder of a tech community called "CS Infostics" which is dedicated to sharing opportunities to learn and grow in the field of IT.
Read more
  • 0
  • 0
  • 33339
article-image-iclr-2019-highlights-algorithmic-fairness-ai-for-social-good-climate-change-protein-structures-gan-magic-adversarial-ml-and-much-more
Amrata Joshi
09 May 2019
7 min read
Save for later

ICLR 2019 Highlights: Algorithmic fairness, AI for social good, climate change, protein structures, GAN magic, adversarial ML and much more

Amrata Joshi
09 May 2019
7 min read
The ongoing ICLR 2019 (International Conference on Learning Representations) has brought a pack full of surprises and key specimens of innovation. The conference started on Monday, this week and it’s already the last day today! This article covers the highlights of ICLR 2019 and introduces you to the ongoing research carried out by experts in the field of deep learning, data science, computational biology, machine vision, speech recognition, text understanding, robotics and much more. The team behind ICLR 2019, invited papers based on Unsupervised objectives for agents, Curiosity and intrinsic motivation, Few shot reinforcement learning, Model-based planning and exploration, Representation learning for planning, Learning unsupervised goal spaces, Unsupervised skill discovery and Evaluation of unsupervised agents. https://twitter.com/alfcnz/status/1125399067490684928 ICLR 2019, sponsored by Google marks the presence of 200 researchers contributing to and learning from the academic research community by presenting papers and posters. ICLR 2019 Day 1 highlights: Neural network, Algorithmic fairness, AI for social good and much more Algorithmic fairness https://twitter.com/HanieSedghi/status/1125401294880083968 The first day of the conference started with a talk on Highlights of Recent Developments in Algorithmic Fairness by Cynthia Dwork, an American computer scientist at Harvard University. She focused on "group fairness" notions that address the relative treatment of different demographic groups. And she talked on research in the ML community that explores fairness via representations. The investigation of scoring, classifying, ranking, and auditing fairness was also discussed in this talk by Dwork. Generating high fidelity images with Subscale Pixel Networks and Multidimensional Upscaling https://twitter.com/NalKalchbrenner/status/1125455415553208321 Jacob Menick, a senior research engineer at Google, Deep Mind and Nal Kalchbrenner, staff research scientist and co-creator of the Google Brain Amsterdam research lab talked on Generating high fidelity images with Subscale Pixel Networks and Multidimensional Upscaling. They talked about the challenges involved in generating the images and how they address this issue with the help of Subscale Pixel Network (SPN). It is a conditional decoder architecture that helps in generating an image as a sequence of image slices of equal size. They also explained how Multidimensional Upscaling is used to grow an image in both size and depth via intermediate stages corresponding to distinct SPNs. There were in all 10 workshops conducted on the same day based on AI and deep learning covering topics such as, The 2nd Learning from Limited Labeled Data (LLD) Workshop: Representation Learning for Weak Supervision and Beyond Deep Reinforcement Learning Meets Structured Prediction AI for Social Good Debugging Machine Learning Models The first day also witnessed a few interesting talks on neural networks covering topics such as The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, How Powerful are Graph Neural Networks? etc. Overall the first day was quite enriching and informative. ICLR 2019 Day 2 highlights: AI in climate change, Protein structure, adversarial machine learning, CNN models and much more AI’s role in climate change https://twitter.com/natanielruizg/status/1125763990158807040 Tuesday, also the second day of the conference, started with an interesting talk on Can Machine Learning Help to Conduct a Planetary Healthcheck? by Emily Shuckburgh, a Climate scientist and deputy head of the Polar Oceans team at the British Antarctic Survey. She talked about the sophisticated numerical models of the Earth’s systems which have been developed so far based on physics, chemistry and biology. She then highlighted a set of "grand challenge" problems and discussed various ways in which Machine Learning is helping to advance our capacity to address these. Protein structure with a differentiable simulator On the second day of ICLR 2019, Chris Sander, computational biologist, John Ingraham, Adam J Riesselman, and Debora Marks from Harvard University, talked on Learning protein structure with a differentiable simulator. They about the protein folding problem and their aim to bridge the gap between the expressive capacity of energy functions and the practical capabilities of their simulators by using an unrolled Monte Carlo simulation as a model for data. They also composed a neural energy function with a novel and efficient simulator which is based on Langevin dynamics for building an end-to-end-differentiable model of atomic protein structure given amino acid sequence information. They also discussed certain techniques for stabilizing backpropagation and demonstrated the model's capacity to make multimodal predictions. Adversarial Machine Learning https://twitter.com/natanielruizg/status/1125859734744117249 Day 2 was long and had Ian Goodfellow, a machine learning researcher and inventor of GANs, to talk on Adversarial Machine Learning. He talked about supervised learning works and making machine learning private, getting machine learning to work for new tasks and also reducing the dependency on large amounts of labeled data. He then discussed how the adversarial techniques in machine learning are involved in the latest research frontiers. Day 2 covered poster presentation and a few talks on Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset,  Learning to Remember More with Less Memorization, Learning to Remember More with Less Memorization, etc. ICLR 2019 Day 3 highlights: GAN, Autonomous learning and much more Developmental autonomous learning: AI, Cognitive Sciences and Educational Technology https://twitter.com/drew_jaegle/status/1125522499150721025 Day 3 of ICLR 2019 started with Pierre-Yves Oudeyer’s, research director at Inria talk on Developmental Autonomous Learning: AI, Cognitive Sciences and Educational Technology. He presented a research program that focuses on computational modeling of child development and learning mechanisms. He then discussed the several developmental forces that guide exploration in large real-world spaces. He also talked about the models of curiosity-driven autonomous learning that enables machines to sample and explore their own goals and learning strategies. He then explained how these models and techniques can be successfully applied in the domain of educational technologies. Generating knockoffs for feature selection using Generative Adversarial Networks (GAN) Another interesting topic on the third day of ICLR 2019 was Generating knockoffs for feature selection using Generative Adversarial Networks (GAN) by James Jordon from Oxford University, Jinsung Yoon from California University, and Mihaela Schaar Professor at UCLA. The experts talked about the Generative Adversarial Networks framework that helps in generating knockoffs with no assumptions on the feature distribution. They also talked about the model they created which consists of 4 networks, a generator, a discriminator, a stability network and a power network. They further demonstrated the capability of their model to perform feature selection. Followed by few more interesting topics like Deterministic Variational Inference for Robust Bayesian Neural Networks, there were series of poster presentations. ICLR 2019 Day 4 highlights: Neural networks, RNN, neuro-symbolic concepts and much more Learning natural language interfaces with neural models Today’s focus was more on neural models and neuro symbolic concepts. The day started with a talk on Learning natural language interfaces with neural models by Mirella Lapata, a computer scientist. She gave an overview of recent progress on learning natural language interfaces which allow users to interact with various devices and services using everyday language. She also addressed the structured prediction problem of mapping natural language utterances onto machine-interpretable representations. She further outlined the various challenges it poses and described a general modeling framework based on neural networks which tackle these challenges. Ordered neurons: Integrating tree structures into Recurrent Neural Networks https://twitter.com/mmjb86/status/1126272417444311041 The next interesting talk was on Ordered neurons: Integrating tree structures into Recurrent Neural Networks by Professors Yikang Shen, Aaron Courville and Shawn Tan from Montreal University, and, Alessandro Sordoni, a researcher at Microsoft. In this talk, the experts focused on how they proposed a new RNN unit: ON-LSTM, which achieves good performance on four different tasks including language modeling, unsupervised parsing, targeted syntactic evaluation, and logical inference. The last day of ICLR 2019 was exciting and helped the researchers present their innovations and attendees got a chance to interact with the experts. To have a complete overview of each of these sessions, you can head over to ICLR’s Facebook page. Paper in Two minutes: A novel method for resource efficient image classification Google I/O 2019 D1 highlights: smarter display, search feature with AR capabilities, Android Q, linguistically advanced Google lens and more Google I/O 2019: Flutter UI framework now extended for Web, Embedded, and Desktop
Read more
  • 0
  • 0
  • 33272

article-image-why-choose-opencv-over-matlab-for-your-next-computer-vision-project
Vincy Davis
20 Dec 2019
6 min read
Save for later

Why choose OpenCV over MATLAB for your next Computer Vision project

Vincy Davis
20 Dec 2019
6 min read
Scientific Computing relies on executing computer algorithms coded in different programming languages. One such interdisciplinary scientific field is the study of Computer Vision, often abbreviated as CV. Computer Vision is used to develop techniques that can automate tasks like acquiring, processing, analyzing and understanding digital images. It is also utilized for extracting high-dimensional data from the real world to produce symbolic information. In simple words, Computer Vision gives computers the ability to see, understand and process images and videos like humans. The vast advances in hardware, machine learning tools, and frameworks have resulted in the implementation of Computer Vision in various fields like IoT, manufacturing, healthcare, security, etc. Major tech firms like Amazon, Google, Microsoft, and Facebook are investing immensely in the research and development of this field. Out of the many tools and libraries available for Computer Vision nowadays, there are two major tools OpenCV and Matlab that stand out in terms of their speed and efficiency. In this article, we will have a detailed look at both of them. Further Reading [box type="shadow" align="" class="" width=""]To learn how to build interesting image recognition models like setting up license plate recognition using OpenCV, read the book “Computer Vision Projects with OpenCV and Python 3” by author Matthew Rever. The book will also guide you to design and develop production-grade Computer Vision projects by tackling real-world problems.[/box] OpenCV: An open-source multiplatform solution tailored for Computer Vision OpenCV, developed by Intel and now supported by Willow Garage, is released under the BSD 3-Clause license and is free for commercial use. It is one of the most popular computer vision tools aimed at providing a well-optimized, well tested, and open-source (C++)-based implementation for computer vision algorithms. The open-source library has interfaces for multiple languages like C++, Python, and Java and supports Linux, macOS, Windows, iOS, and Android. Many of its functions are implemented on GPU. The first stable release of OpenCV version 1.0 was in the year 2006. The OpenCV community has grown rapidly ever since and with its latest release, OpenCV version 4.1.1, it also brings improvements in the dnn (Deep Neural Networks) module, which is a popular module in the library that implements forward pass (inferencing) with deep networks, which are pre-trained using popular deep learning frameworks.  Some of the features offered by OpenCV include: imread function to read the images in the BGR (Blue-Green-Red) format by default. Easy up and downscaling for resizing an image. Supports various interpolation and downsampling methods like INTER_NEAREST to represent the nearest neighbor interpolation. Supports multiple variations of thresholding like adaptive thresholding, bitwise operations, edge detection, image filtering, image contours, and more. Enables image segmentation (Watershed Algorithm) to classify each pixel in an image to a particular class of background and foreground. Enables multiple feature-matching algorithms, like brute force matching, knn feature matching, among others. With its active community and regular updates for Machine Learning, OpenCV is only going to grow by leaps and bounds in the field of Computer Vision projects.  MATLAB: A licensed quick prototyping tool with OpenCV integration One disadvantage of OpenCV, which makes novice computer vision users tilt towards Matlab is the former's complex nature. OpenCV is comparatively harder to learn due to lack of documentation and error handling codes. Matlab, developed by MathWorks is a proprietary programming language with a multi-paradigm numerical computing environment. It has over 3 million users worldwide and is considered one of the easiest and most productive software for engineers and scientists. It has a very powerful and swift matrix library.  Matlab also works in integration with OpenCV. This enables MATLAB users to explore, analyze, and debug designs that incorporate OpenCV algorithms. The support package of MATLAB includes the data type conversions necessary for MATLAB and OpenCV. MathWorks provided Computer Vision Toolbox renders algorithms, functions, and apps for designing and testing computer vision, 3D vision, and video processing systems. It also allows detection, tracking, feature extraction, and matching of objects. Matlab can also train custom object detectors using deep learning and machine learning algorithms such as YOLO v2, Faster R-CNN, and ACF. Most of the toolbox algorithms in Matlab support C/C++ code generation for integrating with existing code, desktop prototyping, and embedded vision system deployment. However, Matlab does not contain as many functions for computer vision as OpenCV, which has more of its functions implemented on GPU. Another issue with Matlab is that it's not open-source, it’s license is costly and the programs are not portable.  Another important factor which matters a lot in computer vision is the performance of a code, especially when working on real-time video processing.  Which has a faster execution time? OpenCV or Matlab? Along with Computer Vision, other fields also require faster execution while choosing a programming language or library for implementing any function. This factor is analyzed in detail in a paper titled “Matlab vs. OpenCV: A Comparative Study of Different Machine Learning Algorithms”.  The paper provides a very practical comparative study between Matlab and OpenCV using 20 different real datasets. The differentiation is based on the execution time for various machine learning algorithms like Classification and Regression Trees (CART), Naive Bayes, Boosting, Random Forest and K-Nearest Neighbor (KNN). The experiments were run on an Intel core 2 duo P7450 machine, with 3GB RAM, and Ubuntu 11.04 32-bit operating system on Matlab version 7.12.0.635 (R2011a), and OpenCV C++ version 2.1.  The paper states, “To compare the speed of Matlab and OpenCV for a particular machine learning algorithm, we run the algorithm 1000 times and take the average of the execution times. Averaging over 1000 experiments is more than necessary since convergence is reached after a few hundred.” The outcome of all the experiments revealed that though Matlab is a successful scientific computing environment, it is outrun by OpenCV for almost all the experiments when their execution time is considered. The paper points out that this could be due to a combination of a number of dimensionalities, sample size, and the use of training sets. One of the listed machine learning algorithms KNN produced a log time ratio of 0.8 and 0.9 on datasets D16 and D17 respectively.  Clearly, Matlab is great for exploring and fiddling with computer vision concepts as researchers and students at universities that can afford the software. However, when it comes to building production-ready real-world computer vision projects, OpenCV beats Matlab hand down. You can learn about building more Computer Vision projects like human pose estimation using TensorFlow from our book ‘Computer Vision Projects with OpenCV and Python 3’. Master the art of face swapping with OpenCV and Python by Sylwek Brzęczkowski, developer at TrustStamp NVIDIA releases Kaolin, a PyTorch library to accelerate research in 3D computer vision and AI Generating automated image captions using NLP and computer vision [Tutorial] Computer vision is growing quickly. Here’s why. Introducing Intel’s OpenVINO computer vision toolkit for edge computing
Read more
  • 0
  • 0
  • 33178
Modal Close icon
Modal Close icon