How-To Tutorials

article-image-creating-2d-3d-plots-using-matplotlib

22 Mar 2018

10 min read

Creating 2D and 3D plots using Matplotlib

22 Mar 2018

0
0
91754

article-image-how-to-secure-data-in-salesforce-einstein-analytics

Amey Varangaonkar

22 Mar 2018

5 min read

How to secure data in Salesforce Einstein Analytics

Amey Varangaonkar

22 Mar 2018

5 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Learning Einstein Analytics written by Santosh Chitalkar. This book includes techniques to build effective dashboards and Business Intelligence metrics to gain useful insights from data.[/box] Before getting into security in Einstein Analytics, it is important to set up your organization, define user types so that it is available to use. In this article we explore key aspects of security in Einstein Analytics. The following are key points to consider for data security in Salesforce: Salesforce admins can restrict access to data by setting up field-level security and object-level security in Salesforce. These settings prevent data flow from loading sensitive Salesforce data into a dataset. Dataset owners can restrict data access by using row-level security. Analytics supports security predicates, a robust row-level security feature that enables you to model many different types of access control on datasets. Analytics also supports sharing inheritance. Take a look at the following diagram: Salesforce data security In Einstein Analytics, dataflows bring the data to the Analytics Cloud from Salesforce. It is important that Einstein Analytics has all the necessary permissions and access to objects as well as fields. If an object or a field is not accessible to Einstein then the data flow fails and it cannot extract data from Salesforce. So we need to make sure that the required access is given to the integration user and security user. We can configure the permission set for these users. Let’s configure permissions for an integration user by performing the following steps: Switch to classic mode and enter Profiles in the Quick Find / Search… box Select and clone the Analytics Cloud Integration User profile and Analytics Cloud Security User profile for the integration user and security user respectively: Save the cloned profiles and then edit them Set the permission to Read for all objects and fields Save the profile and assign it to users Take a look at the following diagram: Data pulled from Salesforce can be made secure from both sides: Salesforce as well as Einstein Analytics. It is important to understand that Salesforce and Einstein Analytics are two independent databases. So, a user security setting given to Einstein will not affect the data in Salesforce. There are the following ways to secure data pulled from Salesforce: Salesforce Security Einstein Analytics Security Roles and profiles Inheritance security Organization-Wide Defaults (OWD) and record ownership Security predicates Sharing rules Application-level security Sharing mechanism in Einstein All Analytics users start off with Viewer access to the default Shared App that’s available out-of-the-box; administrators can change this default setting to restrict or extend access. All other applications created by individual users are private, by default; the application owner and administrators have Manager access and can extend access to other Users, groups, or roles. The following diagram shows how the sharing mechanism works in Einstein Analytics: Here’s a summary of what users can do with Viewer, Editor, and Manager access: Action / Access level Viewer Editor Manager View dashboards, lenses, and datasets in the application. If the underlying dataset is in a different application than a lens or dashboard, the user must have access to both applications to view the lens or dashboard. Yes Yes Yes See who has access to the application. Yes Yes Yes Save contents of the application to another application that the user has Editor or Manager access to. Yes Yes Yes Save changes to existing dashboards, lenses, and datasets in the application (saving dashboards requires the appropriate permission set license and permission). Yes Yes Change the application’s sharing settings. Yes Rename the application. Yes Delete the application. Yes Confidentiality, integrity, and availability together are referred to as the CIA Triad and it is designed to help organizations decide what security policies to implement within the organization. Salesforce knows that keeping information private and restricting access by unauthorized users is essential for business. By sharing the application, we can share a lens, dashboard, and dataset all together with one click. To share the entire application, do the following: Go to your Einstein Analytics and then to Analytics Studio Click on the APPS tab and then the icon for your application that you want to share, as shown in the following screenshot: 3. Click on Share and it will open a new popup window, as shown in the following screenshot: Using this window, you can share the application with an individual user, a group of users, or a particular role. You can define the access level as Viewer, Editor, or Manager After selecting User, click on the user you wish to add and click on Add Save and then close the popup And that’s it. It’s done. Mass-sharing the application Sometimes, we are required to share the application with a wide audience: There are multiple approaches to mass-sharing the Wave application such as by role or by username In Salesforce classic UI, navigate to Setup|Public Groups | New For example, to share a sales application, label a public group as Analytics_Sales_Group Search and add users to a group by Role, Roles and Subordinates, or by Users (username): 5. Search for the Analytics_Sales public group 6. Add the Viewer option as shown in the following screenshot: 7. Click on Save Protecting data from breaches, theft, or from any unauthorized user is very important. And we saw that Einstein Analytics provides the necessary tools to ensure the data is secure. If you found this excerpt useful and want to know more about securing your analytics in Einstein, make sure to check out this book Learning Einstein Analytics.

0
0
51523

article-image-embed-einstein-dashboards-salesforce-classic

Amey Varangaonkar

21 Mar 2018

5 min read

How to embed Einstein dashboards on Salesforce Classic

Amey Varangaonkar

21 Mar 2018

5 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Learning Einstein Analytics written by Santosh Chitalkar. This book highlights the key techniques and know-how to unlock critical insights from your data using Salesforce Einstein Analytics.[/box] With Einstein Analytics, users have the power to embed their dashboards on various third-party applications and even on their web applications. In this article, we will show how to embed an Einstein dashboard on Salesforce Classic. In order to start embedding the dashboard, let's create a sample dashboard by performing the following steps: Navigate to Analytics Studio | Create | Dashboard. Add three chart widgets on the dashboard. Click on the Chart button in the middle and select the Opportunity dataset. Select Measures as Sum of Amount and select BillingCountry under Group by. Click on Done. Repeat the second step for the second widget, but select Account Source under Group by and make it a donut chart. Repeat the second step for the third widget but select Stage under Group by and make it a funnel chart. Click on Save (s) and enter Embedding Opportunities in the title field, as shown in the following screenshot: Now that we have created a dashboard, let's embed this dashboard in Salesforce Classic. In order to start embedding the dashboard, exit from the Einstein Analytics platform and go to Classic mode. The user can embed the dashboard on the record detail page layout in Salesforce Classic. The user can view the dashboard, drill in, and apply a filter, just like in the Einstein Analytics window. Let's add the dashboard to the account detail page by performing the following steps: Navigate to Setup | Customize | Accounts | Page Layouts as shown in the following screenshot: Click on Edit of Account Layout and it will open a page layout editor which has two parts: a palette on the upper portion of the screen, and the page layout on the lower portion of the screen. The palette contains the user interface elements that you can add to your page layout, such as Fields, Buttons, Links, and Actions, and Related Lists, as shown in the following screenshot: Click on the Wave Analytics Assets option from the palette and you can see all the dashboards on the right-side panel. Drag and drop a section onto the page layout, name it Einstein Dashboard, and click on OK. Drag and drop the dashboard which you wish to add to the record detail page. We are going to add Embedded Opportunities. Click on Save. Go to any accounting record and you should see a new section within the dashboard: Users can easily configure the embedded dashboards by using attributes. To access the dashboard properties, go to edit page layout again, and go to the section where we added the dashboard to the layout. Hover over the dashboard and click on the Tool icon. It will open an Asset Properties window: The Asset Properties window gives the user the option to change the following features: Width (in pixels or %): This feature allows you to adjust the width of the dashboard section. Height (in pixels): This feature allows you to adjust the height of the dashboard section. Show Title: This feature allows you to display or hide the title of the dashboard. Show Sharing Icon: Using this feature, by default, the share icon is disabled. The Show Sharing Icon option gives the user a flexibility to include the share icon on the dashboard. Show Header: This feature allows you to display or hide the header. Hide on error: This feature gives you control over whether the Analytics asset appears if there is an error. Field mapping: Last but not least, field mapping is used to filter the relevant data to the record on the dashboard. To set up the dashboard to show only the data that’s relevant to the record being viewed, use field mapping. Field mapping links data fields in the dashboard to the object’s fields. We are using the Embedded Opportunity dashboard. Let's add field mapping to it. The following is the format for field mapping: { "datasets": { "datasetName":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset Fieldname"]} }] } Let's add field mapping for account by using the following format: { "datasets": { "Account":[{ "fields":["Name"], "filter":{"operator": "matches", "values":["$Name"]} }] } } If your dashboard uses multiple datasets, then you can use the following format: { "datasets": { "datasetName1":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset1 Fieldname"]} }], "datasetName2":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset2 Fieldname"]} }] } Let's add field mapping for account and opportunities: { "datasets": { "Opportunities":[{ "fields":["Account.Name"], "Filter":{"operator": "Matches", "values":["$Name"]} }], "Account":[{ "fields":["Name"], "filter":{"operator": "matches", "values":["$Name"]} }] } } Now that we have added field mapping, save the page layout and go to the actual record. Observe that the dashboard is getting filtered now per record, as shown in the following screenshot: To summarize, we saw it’s fairly easy to embed your custom dashboards in Salesforce. Similarly, you can do so on other platforms such as Lightning, Visualforce pages, and even on your websites and web applications. If you are keen to learn more, you may check out the book Learning Einstein Analytics.

0
0
3847

article-image-write-high-quality-code-python-15-tips-data-scientists-researchers

Aarthi Kumaraswamy

21 Mar 2018

5 min read

How to write high quality code in Python: 15+ tips for data scientists and researchers

Aarthi Kumaraswamy

21 Mar 2018

5 min read

Writing code is easy. Writing high quality code is much harder. Quality is to be understood both in terms of actual code (variable names, comments, docstrings, and so on) and architecture (functions, modules, and classes). In general, coming up with a well-designed code architecture is much more challenging than the implementation itself. In this post, we will give a few tips about how to write high quality code. This is a particularly important topic in academia, as more and more scientists without prior experience in software development need to code. High quality code writing first principles Writing readable code means that other people (or you in a few months or years) will understand it quicker and will be more willing to use it. It also facilitates bug tracking. Modular code is also easier to understand and to reuse. Implementing your program's functionality in independent functions that are organized as a hierarchy of packages and modules is an excellent way of achieving high code quality. It is easier to keep your code loosely coupled when you use functions instead of classes. Spaghetti code is really hard to understand, debug, and reuse. Iterate between bottom-up and top-down approaches while working on a new project. Starting with a bottom-up approach lets you gain experience with the code before you start thinking about the overall architecture of your program. Still, make sure you know where you're going by thinking about how your components will work together. How these high quality code writing first principles translate in Python? Take the time to learn the Python language seriously. Review the list of all modules in the standard library—you may discover that functions you implemented already exist. Learn to write Pythonic code, and do not translate programming idioms from other languages such as Java or C++ to Python. Learn common design patterns; these are general reusable solutions to commonly occurring problems in software engineering. Use assertions throughout your code (the assert keyword) to prevent future bugs (defensive programming). Start writing your code with a bottom-up approach; write independent Python functions that implement focused tasks. Do not hesitate to refactor your code regularly. If your code is becoming too complicated, think about how you can simplify it. Avoid classes when you can. If you can use a function instead of a class, choose the function. A class is only useful when you need to store persistent state between function calls. Make your functions as pure as possible (no side effects). In general, prefer Python native types (lists, tuples, dictionaries, and types from Python's collections module) over custom types (classes). Native types lead to more efficient, readable, and portable code. Choose keyword arguments over positional arguments in your functions. Argument names are easier to remember than argument ordering. They make your functions self-documenting. Name your variables carefully. Names of functions and methods should start with a verb. A variable name should describe what it is. A function name should describe what it does. The importance of naming things well cannot be overstated. Every function should have a docstring describing its purpose, arguments, and return values, as shown in the following example. You can also look at the conventions chosen in popular libraries such as NumPy. The exact convention does not matter, the point is to be consistent within your code. You can use a markup language such as Markdown or reST to do that. Follow (at least partly) Guido van Rossum's Style Guide for Python, also known as Python Enhancement Proposal number 8 (PEP8). It is a long read, but it will help you write well-readable Python code. It covers many little things such as spacing between operators, naming conventions, comments, and docstrings. For instance, you will learn that it is considered a good practice to limit any line of your code to 79 or 99 characters. This way, your code can be correctly displayed in most situations (such as in a command-line interface or on a mobile device) or side by side with another file. Alternatively, you can decide to ignore certain rules. In general, following common guidelines is beneficial on projects involving many developers. You can check your code automatically against most of the style conventions in PEP8 with the pycodestyle Python package. You can also automatically make your code PEP8-compatible with the autopep8 package. Use a tool for static code analysis such as flake8 or Pylint. It lets you find potential errors or low-quality code statically, that is, without running your code. Use blank lines to avoid cluttering your code (see PEP8). You can also demarcate sections in a long Python module with salient comments. A Python module should not contain more than a few hundreds lines of code. Having too many lines of code in a module may be a sign that you need to split it into several modules. Organize important projects (with tens of modules) into subpackages (subdirectories). Take a look at how major Python projects are organized. For example, the code of IPython is well-organized into a hierarchy of subpackages with focused roles. Reading the code itself is also quite instructive. Learn best practices to create and distribute a new Python package. Make sure that you know setuptools, pip, wheels, virtualenv, PyPI, and so on. Also, you are highly encouraged to take a serious look at conda, a powerful and generic packaging system created by Anaconda. Packaging has long been a rapidly evolving topic in Python, so read only the most recent references. You enjoyed an excerpt from Cyrille Rossant’s latest book, IPython Cookbook, Second Edition. This book contains 100+ recipes for high-performance scientific computing and data analysis, from the latest IPython/Jupyter features to the most advanced tricks, to help you write better and faster code. For free recipes from the book, head over to the Ipython Cookbook Github page. If you loved what you saw, support Cyrille’s work by buying a copy of the book today!

0
1
28444

article-image-getting-started-with-python-web-scraping

Amarabha Banerjee

20 Mar 2018

13 min read

Getting started with Python Web Scraping

Amarabha Banerjee

20 Mar 2018

13 min read

0
0
17002

article-image-cambridge-analytica-ethics-data-science

Richard Gall

20 Mar 2018

5 min read

The Cambridge Analytica scandal and ethics in data science

Richard Gall

20 Mar 2018

5 min read

Earlier this month, Stack Overflow published the results of its 2018 developer survey. In it, there was an interesting set of questions around the concept of 'ethical code'. The main takeaway was ultimately that the area remains a gray area. The Cambridge Analytica scandal, however, has given the issue of 'ethical code' a renewed urgency in the last couple of days. The data analytics company are alleged to have not only been involved in votes in the UK and US, but also of harvesting copious amounts of data from Facebook (illegally). For whistleblower Christopher Wylie, the issue of ethical code is particularly pronounced. “I created Steve Bannon’s psychological mindfuck tool” he told Carole Cadwalladr in an interview in the Guardian. Cambridge Analytica: psyops or just market research? Wylie is a data scientist whose experience over the last half a decade or so has been impressive. It’s worth noting however, that Wylie’s career didn’t begin in politics. His academic career was focused primarily on fashion forecasting. That might all seem a little prosaic, but it underlines the fact that data science never happens in a vacuum. Data scientists always operate within a given field. It might be tempting to view the world purely through the prism of impersonal data and cold statistics. To a certain extent you have to if you’re a data scientist or a statistician. But at the very least this can be unhelpful; at worst a potential threat to global democracy. At one point in the interview Wylie remarks that: ...it’s normal for a market research company to amass data on domestic populations. And if you’re working in some country and there’s an auxiliary benefit to a current client with aligned interests, well that’s just a bonus. This is potentially the most frightening thing. Cambridge Analytica’s ostensible role in elections and referenda isn’t actually that remarkable. For all the vested interests and meetings between investors, researchers and entrepreneurs, the scandal is really just the extension of data mining and marketing tactics employed by just about every organization with a digital presence on the planet. Data scientists are always going to be in a difficult position. True, we're not all going to end up working alongside Steve Bannon. But your skills are always being deployed with a very specific end in mind. It’s not always easy to see the effects and impact of your work until later, but it’s still essential for data scientists and analysts to be aware of whose data is being collected and used, how it’s being used and why. Who is responsible for the ethics around data and code? There was another interesting question in the Stack Overflow survey that's relevant to all of this. The survey asked respondents who was ultimately most responsible for code that accomplishes something unethical. 57.5% claimed upper management were responsible, 22.8% said the person who came up with the idea, and 19.7% said it was the responsibility of the developer themselves. Clearly the question is complex. The truth lies somewhere between all three. Management make decisions about what’s required from an organizational perspective, but the engineers themselves are, of course, a part of the wider organizational dynamic. They should be in a position where they are able to communicate any personal misgivings or broader legal issues with the work they are being asked to do. The case of Wylie and Cambridge Analytica is unique, however. But it does highlight that data science can be deployed in ways that are difficult to predict. And without proper channels of escalation and the right degree of transparency it's easy for things to remain secretive, hidden in small meetings, email threads and paper trails. That's another thing that data scientists need to remember. Office politics might be a fact of life, but when you're a data scientist you're sitting on the apex of legal, strategic and political issues. To refuse to be aware of this would be naive. What the Cambridge Analytica story can teach data scientists But there's something else worth noting. This story also illustrates something more about the world in which data scientists are operating. This is a world where traditional infrastructure is being dismantled. This is a world where privatization and outsourcing is viewed as the route towards efficiency and 'value for money'. Whether you think that’s a good or bad thing isn’t really the point here. What’s important is that it makes the way we use data, even the code we write more problematic than ever because it’s not always easy to see how it’s being used. Arguably Wylie was naive. His curiosity and desire to apply his data science skills to intriguing and complex problems led him towards people who knew just how valuable he could be. Wylie has evidently developed greater self-awareness. This is perhaps the main reason why he has come forward with his version of events. But as this saga unfolds it’s worth remembering the value of data scientists in the modern world - for a range of organizations. It’s made the concept of the 'citizen data scientist' take on an even more urgent and literal meaning. Yes data science can help to empower the economy and possibly even toy with democracy. But it can also be used to empower people, improve transparency in politics and business. If anything, the Cambridge Analytica saga proves that data science is a dangerous field - not only the sexiest job of the twenty-first century, but one of the most influential in shaping the kind of world we're going to live in. That's frightening, but it's also pretty exciting.

0
0
49887

article-image-25-datasets-deep-learning-iot

Sugandha Lahoti

20 Mar 2018

8 min read

25 Datasets for Deep Learning in IoT

Sugandha Lahoti

20 Mar 2018

8 min read

Deep Learning is one of the major players for facilitating the analytics and learning in the IoT domain. A really good roundup of the state of deep learning advances for big data and IoT is described in the paper Deep Learning for IoT Big Data and Streaming Analytics: A Survey by Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. In this article, we have attempted to draw inspiration from this research paper to establish the importance of IoT datasets for deep learning applications. The paper also provides a handy list of commonly used datasets suitable for building deep learning applications in IoT, which we have added at the end of the article. IoT and Big Data: The relationship IoT and Big data have a two-way relationship. IoT is the main producer of big data, and as such an important target for big data analytics to improve the processes and services of IoT. However, there is a difference between the two. Large-Scale Streaming data: IoT data is a large-scale streaming data. This is because a large number of IoT devices generate streams of data continuously. Big data, on the other hand, lack real-time processing. Heterogeneity: IoT data is heterogeneous as various IoT data acquisition devices gather different information. Big data devices are generally homogeneous in nature. Time and space correlation: IoT sensor devices are also attached to a specific location, and thus have a location and time-stamp for each of the data items. Big data sensors lack time-stamp resolution. High noise data: IoT data is highly noisy, owing to the tiny pieces of data in IoT applications, which are prone to errors and noise during acquisition and transmission. Big data, in contrast, is generally less noisy. Big data, on the other hand, is classified according to conventional 3V’s, Volume, Velocity, and Variety. As such techniques used for Big data analytics are not sufficient to analyze the kind of data, that is being generated by IoT devices. For instance, autonomous cars need to make fast decisions on driving actions such as lane or speed change. These decisions should be supported by fast analytics with data streaming from multiple sources (e.g., cameras, radars, left/right signals, traffic light etc.). This changes the definition of IoT big data classification to 6V’s. Volume: The quantity of generated data using IoT devices is much more than before and clearly fits this feature. Velocity: Advanced tools and technologies for analytics are needed to efficiently operate the high rate of data production. Variety: Big data may be structured, semi-structured, and unstructured data. The data types produced by IoT include text, audio, video, sensory data and so on. Veracity: Veracity refers to the quality, consistency, and trustworthiness of the data, which in turn leads to accurate analytics. Variability: This property refers to the different rates of data flow. Value: Value is the transformation of big data to useful information and insights that bring competitive advantage to organizations. Despite the recent advancement in DL for big data, there are still significant challenges that need to be addressed to mature this technology. Every 6 characteristics of IoT big data imposes a challenge for DL techniques. One common denominator for all is the lack of availability of IoT big data datasets. IoT datasets and why are they needed Deep learning methods have been promising with state-of-the-art results in several areas, such as signal processing, natural language processing, and image recognition. The trend is going up in IoT verticals as well. IoT datasets play a major role in improving the IoT analytics. Real-world IoT datasets generate more data which in turn improve the accuracy of DL algorithms. However, the lack of availability of large real-world datasets for IoT applications is a major hurdle for incorporating DL models in IoT. The shortage of these datasets acts as a barrier to deployment and acceptance of IoT analytics based on DL since the empirical validation and evaluation of the system should be shown promising in the natural world. The lack of availability is mainly because: Most IoT datasets are available with large organizations who are unwilling to share it so easily. Access to the copyrighted datasets or privacy considerations. These are more common in domains with human data such as healthcare and education. While there is a lot of ground to be covered in terms of making datasets for IoT available, here is a list of commonly used datasets suitable for building deep learning applications in IoT. Dataset Name Domain Provider Notes Address/Link CGIAR dataset Agriculture, Climate CCAFS High-resolution climate datasets for a variety of fields including agricultural http://www.ccafs-climate.org/ Educational Process Mining Education University of Genova Recordings of 115 subjects’ activities through a logging application while learning with an educational simulator http://archive.ics.uci.edu/ml/datasets/Educational+Process+Mining+%28EPM%29%3A+A+Learning+Analytics+Data+Set Commercial Building Energy Dataset Energy, Smart Building IIITD Energy related data set from a commercial building where data is sampled more than once a minute. http://combed.github.io/ Individual household electric power consumption Energy, Smart home EDF R&D, Clamart, France One-minute sampling rate over a period of almost 4 years http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption AMPds dataset Energy, Smart home S. Makonin AMPds contains electricity, water, and natural gas measurements at one minute intervals for 2 years of monitoring http://ampds.org/ UK Domestic Appliance-Level Electricity Energy, Smart Home Kelly and Knottenbelt Power demand from five houses. In each house both the whole-house mains power demand as well as power demand from individual appliances are recorded. http://www.doc.ic.ac.uk/∼dk3810/data/ PhysioBank databases Healthcare PhysioNet Archive of over 80 physiological datasets. https://physionet.org/physiobank/database/ Saarbruecken Voice Database Healthcare Universitat¨ des Saarlandes A collection of voice recordings from more than 2000 persons for pathological voice detection. http://www.stimmdatebank.coli.uni-saarland.de/help_en.php4 T-LESS Industry CMP at Czech Technical University An RGB-D dataset and evaluation methodology for detection and 6D pose estimation of texture-less objects http://cmp.felk.cvut.cz/t-less/ CityPulse Dataset Collection Smart City CityPulse EU FP7 project Road Traffic Data, Pollution Data, Weather, Parking http://iot.ee.surrey.ac.uk:8080/datasets.html Open Data Institute - node Trento Smart City Telecom Italia Weather, Air quality, Electricity, Telecommunication http://theodi.fbk.eu/openbigdata/ Malaga datasets Smart City City of Malaga A broad range of categories such as energy, ITS, weather, Industry, Sport, etc. http://datosabiertos.malaga.eu/dataset Gas sensors for home activity monitoring Smart home Univ. of California San Diego Recordings of 8 gas sensors under three conditions including background, wine and banana presentations. http://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring CASAS datasets for activities of daily living Smart home Washington State University Several public datasets related to Activities of Daily Living (ADL) performance in a two story home, an apartment, and an office settings. http://ailab.wsu.edu/casas/datasets.html ARAS Human Activity Dataset Smart home Bogazici University Human activity recognition datasets collected from two real houses with multiple residents during two months. https://www.cmpe.boun.edu.tr/aras/ MERLSense Data Smart home, building Mitsubishi Electric Research Labs Motion sensor data of residual traces from a network of over 200 sensors for two years, containing over 50 million records. http://www.merl.com/wmd SportVU Sport Stats LLC Video of basketball and soccer games captured from 6 cameras. http://go.stats.com/sportvu RealDisp Sport O. Banos Includes a wide range of physical activities (warm up, cool down and fitness exercises). http://orestibanos.com/datasets.htm Taxi Service Trajectory Transportation Prediction Challenge, ECML PKDD 2015 Trajectories performed by all the 442 taxis running in the city of Porto, in Portugal. http://www.geolink.pt/ecmlpkdd2015-challenge/dataset.html GeoLife GPS Trajectories Transportation Microsoft A GPS trajectory by a sequence of time-stamped points https://www.microsoft.com/en-us/download/details.aspx?id=52367 T-Drive trajectory data Transportation Microsoft Contains a one-week trajectories of 10,357 taxis https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/ Chicago Bus Traces data Transportation M. Doering Bus traces from the Chicago Transport Authority for 18 days with a rate between 20 and 40 seconds. http://www.ibr.cs.tu-bs.de/users/mdoering/bustraces/ Uber trip data Transportation FiveThirtyEight About 20 million Uber pickups in New York City during 12 months. https://github.com/fivethirtyeight/uber-tlc-foil-response Traffic Sign Recognition Transportation K. Lim Three datasets: Korean daytime, Korean nighttime, and German daytime traffic signs based on Vienna traffic rules. https://figshare.com/articles/Traffic_Sign_Recognition_Testsets/4597795 DDD17 Transportation J. Binas End-To-End DAVIS Driving Dataset. http://sensors.ini.uzh.ch/databases.html

0
2
88157

article-image-create-prepare-first-dataset-salesforce-einstein

Amey Varangaonkar

19 Mar 2018

3 min read

How to create and prepare your first dataset in Salesforce Einstein

Amey Varangaonkar

19 Mar 2018

3 min read

[box type="note" align="" class="" width=""]The following extract is taken from the book Learning Einstein Analytics written by Santosh Chitalkar. This book will help you learn Salesforce Einstein analytics, to get insights faster and understand your customer better.[/box] In this article, we see how to start your analytics journey using Salesforce Einstein by taking the first step in the process i.e; by creating and preparing your dataset! A dataset is a set of source data, specially formatted and optimized for interactive exploration. Here are the steps to create a new dataset in Salesforce Einstein: Click on the Create button in the top-right corner and then click on Dataset. You can see the following three options to create datasets: CSV File Salesforce Informatica Rev 2. Select CSV File and click on Continue, as shown in the following screenshot: 3. Select the Account_data.csv file or drag and drop the file. 4. Click on Next. The next screen uploads the user interface to create a single dataset by using the external.csv file: 5. Click on Next to proceed as shown in the following screenshot: 6. Change the dataset name if you want. You can select an application to store the dataset. You can also replace the CSV file from this screen. 7. Click on in the Data Schema File section and select the Replace File option to change the file. You can also download the uploaded .csv file from here as shown in the following screenshot: 8. Click on Next. In the next screen, you can change field attributes such as column name, dimensions, field type, and so on. 9. Click on the Next button and it will start uploading the file in Analytics and queuing it in dataflow. Once done click on the Got it button. 10. Wait for 10-15 minutes (depending on the data, it may take a longer time to create the dataset). 11. Go to Analytics Studio and open the DATASETS tab. You can see the Account_data dataset as shown in the following screenshot: Congrats!!! You have created your first dataset. Let's now update this dataset with the same information but with some additional columns. Updating datasets We need to update the dataset to add new fields, change application settings, remove fields, and so on. Einstein Analytics gives users the flexibility to update the dataset. Here are the steps to update an existing dataset: Create a CSV file to include some new fields and name it Account_Data_Updated. Save the file to a location that you can easily remember. In Salesforce, go to the Analytics Studio home page and find the dataset. Hover over the dataset and click on the button, then click on Edit, as shown in the following screenshot: 4. Salesforce displays the dataset editing screen. Click on the Replace Data button in the top-right corner of the page: 5. Click on the Next button and upload your new CSV file using upload UI. 6. Click on the Next button again to get to the next screen for editing and click on Next again. 7. Click on Replace as shown in the following screenshot: Voila! You’ve successfully updated your dataset. As you can see it’s fairly easy to create and then update the dataset if required, using Einstein without any hassle. If you found this post useful, make sure to check out our book Learning Einstein Analytics for more tips and techniques on using Einstein Analytics effectively to uncover unique insights from your data.

0
1
20600

article-image-perform-crud-operations-on-mongodb-with-php

Amey Varangaonkar

17 Mar 2018

6 min read

Perform CRUD operations on MongoDB with PHP

Amey Varangaonkar

17 Mar 2018

6 min read

0
0
16145

article-image-connecting-data-mongodb-using-pymongo-php

Amey Varangaonkar

16 Mar 2018

6 min read

Connecting your data to MongoDB using PyMongo and PHP

Amey Varangaonkar

16 Mar 2018

6 min read

[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Mastering MongoDB 3.x written by Alex Giamas. This book covers the fundamental as well as advanced tasks related to working with MongoDB.[/box] There are two ways to connect your data to MongoDB. The first is using the driver for your programming language. The second is by using an ODM (Object Document Mapper) layer to map your model objects to MongoDB in a transparent way. In this post, we will cover both methods using two of the most popular languages for web application development: Python, using the official MongoDB low level driver, PyMongo, and PHP, using the official PHP driver for MongoDB Connect using Python Installing PyMongo can be done using pip or easy_install: python -m pip install pymongo python -m easy_install pymongo Then in our class we can connect to a database: >>> from pymongo import MongoClient >>> client = MongoClient() Connecting to a replica set needs a set of seed servers for the client to find out what the primary, secondary, or arbiter nodes in the set are: client = pymongo.MongoClient('mongodb://user:passwd@node1:p1,node2:p2/?replicaSet=rs name') Using the connection string URL we can pass a username/password and replicaSet name all in a single string. Some of the most interesting options for the connection string URL are presented later. Connecting to a shard requires the server host and IP for the mongo router, which is the mongos process. PyMODM ODM Similar to Ruby's Mongoid, PyMODM is an ODM for Python that follows closely on Django's built-in ORM. Installing it can be done via pip: pip install pymodm Then we need to edit settings.py and replace the database engine with a dummy database: DATABASES = { 'default': { 'ENGINE': 'django.db.backends.dummy' } } And add our connection string anywhere in settings.py: from pymodm import connect connect("mongodb://localhost:27017/myDatabase", alias="MyApplication") Here we have to use a connection string that has the following structure: mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:port N]]][/[database][?options]] Options have to be pairs of name=value with an & between each pair. Some interesting pairs are: Model classes need to inherit from MongoModel. A sample class will look like this: from pymodm import MongoModel, fields class User(MongoModel): email = fields.EmailField(primary_key=True) first_name = fields.CharField() last_name = fields.CharField() This has a User class with first_name, last_name, and email fields where email is the primary field. Inheritance with PyMODM models Handling one-one and one-many relationships in MongoDB can be done using references or embedding. This example shows both ways: references for the model user and embedding for the comment model: from pymodm import EmbeddedMongoModel, MongoModel, fields class Comment(EmbeddedMongoModel): author = fields.ReferenceField(User) content = fields.CharField() class Post(MongoModel): title = fields.CharField() author = fields.ReferenceField(User) revised_on = fields.DateTimeField() content = fields.CharField() comments = fields.EmbeddedDocumentListField(Comment) Connecting with PHP The MongoDB PHP driver was rewritten from scratch two years ago to support the PHP 5, PHP 7, and HHVM architectures. The current architecture is shown in the following diagram: Currently we have official drivers for all three architectures with full support for the underlying functionality. Installation is a two-step process. First we need to install the MongoDB extension. This extension is dependent on the version of PHP (or HHVM) that we have installed and can be done using brew in Mac. For example with PHP 7.0: brew install php70-mongodb Then, using composer (a widely used dependency manager for PHP): composer require mongodb/mongodb Connecting to the database can then be done by using the connection string URI or by passing an array of options. Using the connection string URI we have: $client = new MongoDBClient($uri = 'mongodb://127.0.0.1/', array $uriOptions = [], array $driverOptions = []) For example, to connect to a replica set using SSL authentication: $client = new MongoDBClient('mongodb://myUsername:myPassword@rs1.example.com,rs2.example.com/?ssl=true&replicaSet=myReplicaSet&authSource=admin'); Or we can use the $uriOptions parameter to pass in parameters without using the connection string URL, like this: $client = new MongoDBClient( 'mongodb://rs1.example.com,rs2.example.com/' [ 'username' => 'myUsername', 'password' => 'myPassword', 'ssl' => true, 'replicaSet' => 'myReplicaSet', 'authSource' => 'admin', ], ); The set of $uriOptions and the connection string URL options available are analogous to the ones used for Ruby and Python. Doctrine ODM Laravel is one of the most widely used MVC frameworks for PHP, similar in architecture to Django and Rails from the Python and Ruby worlds respectively. We will follow through configuring our models using a stack of Laravel, Doctrine, and MongoDB. This section assumes that Doctrine is installed and working with Laravel 5.x. Doctrine entities are POPO (Plain Old PHP Objects) that, unlike Eloquent, Laravel's default ORM doesn't need to inherit from the Model class. Doctrine uses the Data Mapper pattern, whereas Eloquent uses Active Record. Skipping the get() set() methods, a simple class would look like: use DoctrineORMMapping AS ORM; use DoctrineCommonCollectionsArrayCollection; /** * @ORMEntity * @ORMTable(name="scientist") */ class Scientist { /** * @ORMId * @ORMGeneratedValue * @ORMColumn(type="integer") */ protected $id; /** * @ORMColumn(type="string") */ protected $firstname; /** * @ORMColumn(type="string") */ protected $lastname; /** * @ORMOneToMany(targetEntity="Theory", mappedBy="scientist", cascade={"persist"}) * @var ArrayCollection|Theory[] */ protected $theories; /** * @param $firstname * @param $lastname */ public function __construct($firstname, $lastname) { $this->firstname = $firstname; $this->lastname = $lastname; $this->theories = new ArrayCollection; } … public function addTheory(Theory $theory) { if(!$this->theories->contains($theory)) { $theory->setScientist($this); $this->theories->add($theory); } } This POPO-based model used annotations to define field types that need to be persisted in MongoDB. For example, @ORMColumn(type="string") defines a field in MongoDB with the string type firstname and lastname as the attribute names, in the respective lines. There is a whole set of annotations available here http://doctrine-orm.readthedocs.io/en/latest/reference/annotations- reference.html . If we want to separate the POPO structure from annotations, we can also define them using YAML or XML instead of inlining them with annotations in our POPO model classes. Inheritance with Doctrine Modeling one-one and one-many relationships can be done via annotations, YAML, or XML. Using annotations, we can define multiple embedded subdocuments within our document: /** @Document */ class User { // … /** @EmbedMany(targetDocument="Phonenumber") */ private $phonenumbers = array(); // … } /** @EmbeddedDocument */ class Phonenumber { // … } Here a User document embeds many PhoneNumbers. @EmbedOne() will embed one subdocument to be used for modeling one-one relationships. Referencing is similar to embedding: /** @Document */ class User { // … /** * @ReferenceMany(targetDocument="Account") */ private $accounts = array(); // … } /** @Document */ class Account { // … } @ReferenceMany() and @ReferenceOne() are used to model one-many and one-one relationships via referencing into a separate collection. We saw that the process of connecting data to MongoDB using Python and PHP is quite similar. We can accordingly define relationships as being embedded or referenced, depending on our design decision. If you found this post useful, check out our book Mastering MongoDB 3.x for more tips and techniques on working with MongoDB efficiently.

0
0
18911

article-image-generate-prime-numbers-ancient-greek-connection

Richard Gall

16 Mar 2018

2 min read

Prime numbers, modern encryption and their ancient Greek connection!

Richard Gall

16 Mar 2018

2 min read

Prime numbers are incredibly important in a number of fields, from computer science to cybersecurity. But they are incredibly mysterious - they are quite literally enigmas. They have been the subject of thousands of years of research and exploration but they have still not been cracked - we are yet to find a formula that will help generate prime numbers easily. Prime numbers are particularly integral to modern encryption - when a file is encrypted, the number used to do so is built using two primes. The only way to decrypt it is to work out the prime factors of that gargantuan number, a task which is almost impossible even with the most extensive computing power currently at our disposal. As well as this, prime numbers are also used as error correcting codes and in mass storage and data transmission. Did you know the Greeks were one of the early champions for our modern encryption systems? Find out how to generate prime numbers manually using a method devised by the Greek mathematician, Eratothenes, in this fun video from the video course by Packt, Fundamental Algorithms in Scala. [embed]https://www.youtube.com/watch?v=cd8v-Jo8obs&t=56s[/embed]

0
0
11644

article-image-troubleshooting-in-sql-server

Sunith Shetty

15 Mar 2018

16 min read

Troubleshooting in SQL Server

Sunith Shetty

15 Mar 2018

16 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book SQL Server 2017 Administrator's Guide written by Marek Chmel and Vladimír Mužný. This book will help you learn to implement and administer successful database solution with SQL Server 2017.[/box] Today, we will perform SQL Server analysis, and also learn ways for efficient performance monitoring and tuning. Performance monitoring and tuning Performance monitoring and tuning is a crucial part of your database administration skill set so as to keep the performance and stability of your server great, and to be able to find and fix the possible issues. The overall system performance can decrease over time; your system may work with more data or even become totally unresponsive. In such cases, you need the skills and tools to find the issue to bring the server back to normal as fast as possible. We can use several tools on the operating system layer and, then, inside the SQL Server to verify the performance and the possible root cause of the issue. The first tool that we can use is the performance monitor, which is available on your Windows Server: Performance monitor can be used to track important SQL Server counters, which can be very helpful in evaluating the SQL Server Performance. To add a counter, simply right-click on the monitoring screen in the Performance monitoring and tuning section and use the Add Counters item. If the SQL Server instance that you're monitoring is a default instance, you will find all the performance objects listed as SQL Server. If your instance is named, then the performance objects will be listed as MSSQL$InstanceName in the list of performance objects. We can split the important counters to watch between the system counters for the whole server and specific SQL Server counters. The list of system counters to watch include the following: Processor(_Total)% Processor Time: This is a counter to display the CPU load. Constant high values should be investigated to verify whether the current load does not exceed the performance limits of your HW or VM server, or if your workload is not running with proper indexes and statistics, and is generating bad query plans. MemoryAvailable MBytes: This counter displays the available memory on the operating system. There should always be enough memory for the operating system. If this counter drops below 64MB, it will send a notification of low memory and the SQL Server will reduce the memory usage. Physical Disk—Avg. Disk sec/Read: This disk counter provides the average latency information for your storage system; be careful if your storage is made of several different disks to monitor the proper storage system. Physical Disk: This indicates the average disk writes per second. Physical Disk: This indicates the average disk reads per second. Physical Disk: This indicates the number of disk writes per second. System—Processor Queue Length: This counter displays the number of threads waiting on a system CPU. If the counter is above 0, this means that there are more requests than the CPU can handle, and if the counter is constantly above 0, this may signal performance issues. Network interface: This indicates the total number of bytes per second. Once you have added all these system counters, you can see the values real time or you can configure a data collection, which will run for a specified selected time and periodically collect the information: With SQL Server-specific counters, we can dig deeper into the CPU, memory, and storage utilization to see what the SQL Server is doing and how the SQL Server is utilizing the subsystems. SQL Server memory monitoring and troubleshooting Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: SQLServer-Buffer Manager—buffer cache hit ratio: This counter displays the ratio of how often the SQL Server can find the proper data in the cache when a query returns such data. If the data is not found in the cache, it has to be read from the disk. The higher the counter, the better the overall performance, since the memory access is usually faster compared to the disk subsystem. SQLServer-Buffer Manager—page life expectancy: This counter can measure how long a page can stay in the memory in seconds. The longer a page can stay in the memory, the less likely it will be for the SQL Server to need to access the disk in order to get the data into the memory again. SQL Server-Memory Manager—total server memory (KB): This is the amount of memory the server has committed using the memory manager. SQL Server-Memory Manager—target server memory (KB): This is the ideal amount of memory the server can consume. On a stable system, the target and total should be equal unless you face a memory pressure. Once the memory is utilized after the warm-up of your server, these two counters should not drop significantly, which would be another indication of system-level memory pressure, where the SQL Server memory manager has to deallocate memory. SQL Server-Memory Manager—memory grants pending: This counter displays the total number of SQL Server processes that are waiting to be granted memory from the memory manager. To check the performance counters, you can also use a T-SQL query, where you can query the sys.dm_os_performance_counters DMV: SELECT [counter_name] as [Counter Name], [cntr_value]/1024 as [Server Memory (MB)] FROM sys.dm_os_performance_counters WHERE [object_name] LIKE '%Memory Manager%' AND [counter_name] IN ('Total Server Memory (KB)', 'Target Server Memory (KB)') This query will return two values—one for target memory and one for total memory. These two should be close to each other on a warmed up system. Another query you can use is to get the information from a DMV named sys.dm_0s_sys_memory: SELECT total_physical_memory_kb/1024/1024 AS [Physical Memory (GB)], available_physical_memory_kb/1024/1024 AS [Available Memory (GB)], system_memory_state_desc AS [System Memory State] FROM sys.dm_os_sys_memory WITH (NOLOCK) OPTION (RECOMPILE) This query will display the available physical memory and the total physical memory of your server with several possible memory states: Available physical memory is high (this is a state you would like to see on your system, indicating there is no lack of memory) Physical memory usage is steady Available physical memory is getting low Available physical memory is low The memory grants can be verified with a T-SQL query: SELECT [object_name] as [Object name] , cntr_value AS [Memory Grants Pending] FROM sys.dm_os_performance_counters WITH (NOLOCK) WHERE [object_name] LIKE N'%Memory Manager%' AND counter_name = N'Memory Grants Pending' OPTION (RECOMPILE); If you face memory issues, there are several steps you can take for improvements: Check and configure your SQL Server max memory usage Add more RAM to your server; the limit for Standard Edition is 128 GB and there is no limit for SQL Server with Enterprise Use Lock Pages in Memory Optimize your queries SQL Server storage monitoring and troubleshooting The important counters to watch for SQL Server storage utilization would include counters from the SQL Server:Access Methods performance object: SQL Server-Access Methods—Full Scans/sec: This counter displays the number of full scans per second, which can be either table or full-index scans SQL Server-Access Methods—Index Searches/sec: This counter displays the number of searches in the index per second SQL Server-Access Methods—Forwarded Records/sec: This counter displays the number of forwarded records per second Monitoring the disk system is crucial, since your disk is used as a storage for the following: Data files Log files tempDB database Page file Backup files To verify the disk latency and IOPS metric of your drives, you can use the Performance monitor, or the T-SQL commands, which will query the sys.dm_os_volume_stats and sys.dm_io_virtual_file_stats DMF. Simple code to start with would be a T-SQL script utilizing the DMF to check the space available within the database files: SELECT f.database_id, f.file_id, volume_mount_point, total_bytes, available_bytes FROM sys.master_files AS f CROSS APPLY sys.dm_os_volume_stats(f.database_id, f.file_id); To check the I/O file stats with the second provided DMF, you can use a T-SQL code for checking the information about tempDB data files: SELECT * FROM sys.dm_io_virtual_file_stats (NULL, NULL) vfs join sys.master_files mf on mf.database_id = vfs.database_id and mf.file_id = vfs.file_id WHERE mf.database_id = 2 and mf.type = 0 To measure the disk performance, we can use a tool named DiskSpeed, which is a replacement for older SQLIO tool, which was used for a long time. DiskSpeed is an external utility, which is not available on the operating system by default. This tool can be downloaded from GitHub or the Technet Gallery at https://github.com/microsoft/diskspd. The following example runs a test for 15 seconds using a single thread to drive 100 percent random 64 KB reads at a depth of 15 overlapped (outstanding) I/Os to a regular file: DiskSpd –d300 -F1 -w0 -r –b64k -o15 d:datafile.dat Troubleshooting wait statistics We can use the whole Wait Statistics approach for a thorough understanding of the SQL Server workload and undertake performance troubleshooting based on the collected data. Wait Statistics are based on the fact that, any time a request has to wait for a resource, the SQL Server tracks this information, and we can use this information for further analysis. When we consider any user process, it can include several threads. A thread is a single unit of execution on SQL Server, where SQLOS controls the thread scheduling instead of relying on the operating system layer. Each processor core has it's own scheduler component responsible for executing such threads. To see the available schedulers in your SQL Server, you can use the following query: SELECT * FROM sys.dm_os_schedulers Such code will return all the schedulers in your SQL Server; some will be displayed as visible online and some as hidden online. The hidden ones are for internal system tasks while the visible ones are used by user tasks running on the SQL Server. There is one more scheduler, which is displayed as Visible Online (DAC). This one is used for dedicated administration connection, which comes in handy when the SQL Server stops responding. To use a dedicated admin connection, you can modify your SSMS connection to use the DAC, or you can use a switch with the sqlcmd.exe utility, to connect to the DAC. To connect to the default instance with DAC on your server, you can use the following command: sqlcmd.exe -E -A Each thread can be in three possible states: running: This indicates that the thread is running on the processor suspended: This indicates that the thread is waiting for a resource on a waiter list runnable: This indicates that the thread is waiting for execution on a runnable queue Each running thread runs until it has to wait for a resource to become available or until it has exhausted the CPU time for a running thread, which is set to 4 ms. This 4 ms time can be visible in the output of the previous query to sys.dm_os_schedulers and is called a quantum. When a thread requires any resource, it is moved away from the processor to a waiter list, where the thread waits for the resource to become available. Once the resource is available, the thread is notified about the resource availability and moves to the bottom of the runnable queue. Any waiting thread can be found via the following code, which will display the waiting threads and the resource they are waiting for: SELECT * FROM sys.dm_os_waiting_tasks The threads then transition between the execution at the CPU, waiter list, and runnable queue. There is a special case when a thread does not need to wait for any resource and has already run for 4 ms on the CPU, then the thread will be moved directly to the runnable queue instead of the waiter list. In the following image, we can see the thread states and the objects where the thread resides: When the thread is waiting on the waiter list, we can talk about a resource wait time. When the thread is waiting on the runnable queue to get on the CPU for execution, we can talk about the signal time. The total wait time is, then, the sum of the signal and resource wait times. You can find the ratio of the signal to resource wait times with the following script: Select signalWaitTimeMs=sum(signal_wait_time_ms) ,'%signal waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) ,resourceWaitTimeMs=sum(wait_time_ms - signal_wait_time_ms) ,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) from sys.dm_os_wait_stats When the ratio goes over 30 percent for the signal waits, then there will be a serious CPU pressure and your processor(s) will have a hard time handling all the incoming requests from the threads. The following query can then grab the wait statistics and display the most frequent wait types, which were recorded through the thread executions, or actually during the time the threads were waiting on the waiter list for any particular resource: WITH [Waits] AS (SELECT [wait_type], [wait_time_ms] / 1000.0 AS [WaitS], ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS], [signal_wait_time_ms] / 1000.0 AS [SignalS], [waiting_tasks_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_wait_stats WHERE [wait_type] NOT IN ( N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE', N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE', N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX', N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE', N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP', N'LOGMGR_QUEUE', N'MEMORY_ALLOCATION_EXT', N'ONDEMAND_TASK_QUEUE', N'PREEMPTIVE_XE_GETTARGETSTATE', N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'PWAIT_DIRECTLOGCONSUMER_GETNEXT', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_ASYNC_QUEUE', N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'QDS_SHUTDOWN_QUEUE', N'REDO_THREAD_PENDING_WORK', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE', N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY', N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK', N'SLEEP_TASK', N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT', N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_RECOVERY', N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN', N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT' ) AND [waiting_tasks_count] > 0 ) SELECT MAX ([W1].[wait_type]) AS [WaitType], CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S], CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S], CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S], MAX ([W1].[WaitCount]) AS [WaitCount], CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage], CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S], CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S], CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S] FROM [Waits] AS [W1] INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum] GROUP BY [W1].[RowNum] HAVING SUM ([W2].[Percentage]) - MAX( [W1].[Percentage] ) < 95 GO This code is available on the whitepaper, published by SQLSkills, named SQL Server Performance Tuning Using Wait Statistics by Erin Stellato and Jonathan Kehayias, which then refers the URL on SQL Skills and uses the full query by Paul Randal available at https://www.sqlskills.com/ blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/. Some of the typical wait stats you can see are: PAGEIOLATCH The PAGEIOLATCH wait type is used when the thread is waiting for a page to be read into the buffer pool from the disk. This wait type comes with two main forms: PAGEIOLATCH_SH: This page will be read from the disk PAGEIOLATCH_EX: This page will be modified You may quickly assume that the storage has to be the problem, but that may not be the case. Like any other wait, they need to be considered in correlation with other wait types and other counters available to correctly find the root cause of the slow SQL Server operations. The page may be read into the buffer pool, because it was previously removed due to memory pressure and is needed again. So, you may also investigate the following: Buffer Manager: Page life expectancy Buffer Manager: Buffer cache hit ratio Also, you need to consider the following as a possible factor to the PAGEIOLATCH wait types: Large scans versus seeks on the indexes Implicit conversions Inundated statistics Missing indexes PAGELATCH This wait type is quite frequently misplaced with PAGEIOLATCH but PAGELATCH is used for pages already present in the memory. The thread waits for the access to such a page again with possible PAGELATCH_SH and PAGELATCH_EX wait types. A pretty common situation with this wait type is a tempDB contention, where you need to analyze what page is being waited for and what type of query is actually waiting for such a resource. As a solution to the tempDB, contention you can do the following: Add more tempDB data files Use traceflags 1118 and 1117 for tempDB on systems older than SQL Server 2016 CXPACKET This wait type is encountered when any thread is running in parallel. The CXPACKET wait type itself does not mean that there is really any problem on the SQL Server. But if such a wait type is accumulated very quickly, it may be a signal of skewed statistics, which require an update or a parallel scan on the table where proper indexes are missing. The option for parallelism is controlled via MAX DOP setting, which can be configured on the following: The server level The database level A query level with a hint We learned about SQL Server analysis with the Wait Statistics troubleshooting methodology and possible DMVs to get more insight on the problems occurring in the SQL Server. To know more about how to successfully create, design, and deploy databases using SQL Server 2017, do checkout the book SQL Server 2017 Administrator's Guide.

0
0
25776

article-image-stephen-hawking-artificial-intelligence-quotes

Richard Gall

15 Mar 2018

3 min read

5 polarizing Quotes from Professor Stephen Hawking on artificial intelligence

Richard Gall

15 Mar 2018

3 min read

Professor Stephen Hawking died today (March 14, 2018) aged 76 at his home in Cambridge, UK. Best known for his theory of cosmology that unified quantum mechanics with Einstein’s General Theory of Relativity, and for his book a Brief History of Time that brought his concepts to a wider general audience, Professor Hawking is quite possibly one of the most important and well-known voices in the scientific world. Among many things, Professor Hawking had a lot to say about artificial intelligence - its dangers, its opportunities and what we should be thinking about, not just as scientists and technologists, but as humans. Over the years, Hawking has remained cautious and consistent in his views on the topic constantly urging AI researchers and machine learning developers to consider the wider implications of their work on society and the human race itself. The machine learning community is quite divided on all the issues Hawking has raised and will probably continue to be so as the field grows faster than it can be fathomed. Here are 5 widely debated things Stephen Hawking said about AI arranged in chronological order - and if you’re going to listen to anyone, you’ve surely got to listen to him? On artificial intelligence ending the human race The development of full artificial intelligence could spell the end of the human race….It would take off on its own, and re-design itself at an ever-increasing rate. Humans, who are limited by slow biological evolution, couldn't compete and would be superseded. From an interview with the BBC, December 2014 On the future of AI research The establishment of shared theoretical frameworks, combined with the availability of data and processing power, has yielded remarkable successes in various component tasks such as speech recognition, image classification, autonomous vehicles, machine translation, legged locomotion, and question-answering systems. As capabilities in these areas and others cross the threshold from laboratory research to economically valuable technologies, a virtuous cycle takes hold whereby even small improvements in performance are worth large sums of money, prompting greater investments in research. There is now a broad consensus that AI research is progressing steadily, and that its impact on society is likely to increase.... Because of the great potential of AI, it is important to research how to reap its benefits while avoiding potential pitfalls. From Research Priorities for Robust and Beneficial Artificial Intelligence, an open letter co-signed by Hawking, January 2015 On AI emulating human intelligence I believe there is no deep difference between what can be achieved by a biological brain and what can be achieved by a computer. It, therefore, follows that computers can, in theory, emulate human intelligence — and exceed it From a speech given by Hawking at the opening of the Leverhulme Centre of the Future of Intelligence, Cambridge, U.K., October 2016 On making artificial intelligence benefit humanity Perhaps we should all stop for a moment and focus not only on making our AI better and more successful but also on the benefit of humanity. Taken from a speech given by Hawking at Web Summit in Lisbon, November 2017 On AI replacing humans The genie is out of the bottle. We need to move forward on artificial intelligence development but we also need to be mindful of its very real dangers. I fear that AI may replace humans altogether. If people design computer viruses, someone will design AI that replicates itself. This will be a new form of life that will outperform humans. From an interview with Wired, November 2017

0
2
72294

article-image-selecting-statistical-based-features-in-machine-learning-application

Pravin Dhandre

14 Mar 2018

16 min read

Selecting Statistical-based Features in Machine Learning application

Pravin Dhandre

14 Mar 2018

16 min read

In today’s tutorial, we will work on one of the methods of executing feature selection, the statistical-based method for interpreting both quantitative and qualitative datasets. Feature selection attempts to reduce the size of the original dataset by subsetting the original features and shortlisting the best ones with the highest predictive power. We may intelligently choose which feature selection method might work best for us, but in reality, a very valid way of working in this domain is to work through examples of each method and measure the performance of the resulting pipeline. To begin, let's take a look at the subclass of feature selection modules that are reliant on statistical tests to select viable features from a dataset. Statistical-based feature selections Statistics provides us with relatively quick and easy methods of interpreting both quantitative and qualitative data. We have used some statistical measures in previous chapters to obtain new knowledge and perspective around our data, specifically in that we recognized mean and standard deviation as metrics that enabled us to calculate z-scores and scale our data. In this tutorial, we will rely on two new concepts to help us with our feature selection: Pearson correlations hypothesis testing Both of these methods are known as univariate methods of feature selection, meaning that they are quick and handy when the problem is to select out single features at a time in order to create a better dataset for our machine learning pipeline. Using Pearson correlation to select features We have actually looked at correlations in this book already, but not in the context of feature selection. We already know that we can invoke a correlation calculation in pandas by calling the following method: credit_card_default.corr() The output of the preceding code produces is the following: As a continuation of the preceding table we have: The Pearson correlation coefficient (which is the default for pandas) measures the linear relationship between columns. The value of the coefficient varies between -1 and +1, where 0 implies no correlation between them. Correlations closer to -1 or +1 imply an extremely strong linear relationship. It is worth noting that Pearson’s correlation generally requires that each column be normally distributed (which we are not assuming). We can also largely ignore this requirement because our dataset is large (over 500 is the threshold). The pandas .corr() method calculates a Pearson correlation coefficient for every column versus every other column. This 24 column by 24 row matrix is very unruly, and in the past, we used heatmaps to try and make the information more digestible: # using seaborn to generate heatmaps import seaborn as sns import matplotlib.style as style # Use a clean stylizatino for our charts and graphs style.use('fivethirtyeight') sns.heatmap(credit_card_default.corr()) The heatmap generated will be as follows: Note that the heatmap function automatically chose the most correlated features to show us. That being said, we are, for the moment, concerned with the features correlations to the response variable. We will assume that the more correlated a feature is to the response, the more useful it will be. Any feature that is not as strongly correlated will not be as useful to us. Correlation coefficients are also used to determine feature interactions and redundancies. A key method of reducing overfitting in machine learning is spotting and removing these redundancies. We will be tackling this problem in our model-based selection methods. Let's isolate the correlations between the features and the response variable, using the following code: # just correlations between every feature and the response credit_card_default.corr()['default payment next month'] LIMIT_BAL -0.153520 SEX -0.039961 EDUCATION 0.028006 MARRIAGE -0.024339 AGE 0.013890 PAY_0 0.324794 PAY_2 0.263551 PAY_3 0.235253 PAY_4 0.216614 PAY_5 0.204149 PAY_6 0.186866 BILL_AMT1 -0.019644 BILL_AMT2 -0.014193 BILL_AMT3 -0.014076 BILL_AMT4 -0.010156 BILL_AMT5 -0.006760 BILL_AMT6 -0.005372 PAY_AMT1 -0.072929 PAY_AMT2 -0.058579 PAY_AMT3 -0.056250 PAY_AMT4 -0.056827 PAY_AMT5 -0.055124 PAY_AMT6 -0.053183 default payment next month 1.000000 We can ignore the final row, as is it is the response variable correlated perfectly to itself. We are looking for features that have correlation coefficient values close to -1 or +1. These are the features that we might assume are going to be useful. Let's use pandas filtering to isolate features that have at least .2 correlation (positive or negative). Let's do this by first defining a pandas mask, which will act as our filter, using the following code: # filter only correlations stronger than .2 in either direction (positive or negative) credit_card_default.corr()['default payment next month'].abs() > .2 LIMIT_BAL False SEX False EDUCATION False MARRIAGE False AGE False PAY_0 True PAY_2 True PAY_3 True PAY_4 True PAY_5 True PAY_6 False BILL_AMT1 False BILL_AMT2 False BILL_AMT3 False BILL_AMT4 False BILL_AMT5 False BILL_AMT6 False PAY_AMT1 False PAY_AMT2 False PAY_AMT3 False PAY_AMT4 False PAY_AMT5 False PAY_AMT6 False default payment next month True Every False in the preceding pandas Series represents a feature that has a correlation value between -.2 and .2 inclusive, while True values correspond to features with preceding correlation values .2 or less than -0.2. Let's plug this mask into our pandas filtering, using the following code: # store the features highly_correlated_features = credit_card_default.columns[credit_card_default.corr()['default payment next month'].abs() > .2] highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5', u'default payment next month'], dtype='object') The variable highly_correlated_features is supposed to hold the features of the dataframe that are highly correlated to the response; however, we do have to get rid of the name of the response column, as including that in our machine learning pipeline would be cheating: # drop the response variable highly_correlated_features = highly_correlated_features.drop('default payment next month') highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5'], dtype='object') So, now we have five features from our original dataset that are meant to be predictive of the response variable, so let's try it out with the help of the following code: # only include the five highly correlated features X_subsetted = X[highly_correlated_features] get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y) # barely worse, but about 20x faster to fit the model Best Accuracy: 0.819666666667 Best Parameters: {'max_depth': 3} Average Time to Fit (s): 0.01 Average Time to Score (s): 0.002 Our accuracy is definitely worse than the accuracy to beat, .8203, but also note that the fitting time saw about a 20-fold increase. Our model is able to learn almost as well as with the entire dataset with only five features. Moreover, it is able to learn as much in a much shorter timeframe. Let's bring back our scikit-learn pipelines and include our correlation choosing methodology as a part of our preprocessing phase. To do this, we will have to create a custom transformer that invokes the logic we just went through, as a pipeline-ready class. We will call our class the CustomCorrelationChooser and it will have to implement both a fit and a transform logic, which are: The fit logic will select columns from the features matrix that are higher than a specified threshold The transform logic will subset any future datasets to only include those columns that were deemed important from sklearn.base import TransformerMixin, BaseEstimator class CustomCorrelationChooser(TransformerMixin, BaseEstimator): def __init__(self, response, cols_to_keep=[], threshold=None): # store the response series self.response = response # store the threshold that we wish to keep self.threshold = threshold # initialize a variable that will eventually # hold the names of the features that we wish to keep self.cols_to_keep = cols_to_keep def transform(self, X): # the transform method simply selects the appropiate # columns from the original dataset return X[self.cols_to_keep] def fit(self, X, *_): # create a new dataframe that holds both features and response df = pd.concat([X, self.response], axis=1) # store names of columns that meet correlation threshold self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() > Self.threshold] # only keep columns in X, for example, will remove response Variable self.cols_to_keep = [c for c in self.cols_to_keep if c in X.columns] return self Let's take our new correlation feature selector for a spin, with the help of the following code: # instantiate our new feature selector ccc = CustomCorrelationChooser(threshold=.2, response=y) ccc.fit(X) ccc.cols_to_keep ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5'] Our class has selected the same five columns as we found earlier. Let's test out the transform functionality by calling it on our X matrix, using the following code: ccc.transform(X).head() The preceding code produces the following table as the output: We see that the transform method has eliminated the other columns and kept only the features that met our .2 correlation threshold. Now, let's put it all together in our pipeline, with the help of the following code: # instantiate our feature selector with the response variable set ccc = CustomCorrelationChooser(response=y) # make our new pipeline, including the selector ccc_pipe = Pipeline([('correlation_select', ccc), ('classifier', d_tree)]) # make a copy of the decisino tree pipeline parameters ccc_pipe_params = deepcopy(tree_pipe_params) # update that dictionary with feature selector specific parameters ccc_pipe_params.update({ 'correlation_select__threshold':[0, .1, .2, .3]}) print ccc_pipe_params #{'correlation_select__threshold': [0, 0.1, 0.2, 0.3], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # better than original (by a little, and a bit faster on # average overall get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'correlation_select__threshold': 0.1, 'classifier__max_depth': 5} Average Time to Fit (s): 0.105 Average Time to Score (s): 0.003 Wow! Our first attempt at feature selection and we have already beaten our goal (albeit by a little bit). Our pipeline is showing us that if we threshold at 0.1, we have eliminated noise enough to improve accuracy and also cut down on the fitting time (from .158 seconds without the selector). Let's take a look at which columns our selector decided to keep: # check the threshold of .1 ccc = CustomCorrelationChooser(threshold=0.1, response=y) ccc.fit(X) # check which columns were kept ccc.cols_to_keep ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] It appears that our selector has decided to keep the five columns that we found, as well as two more, the LIMIT_BAL and the PAY_6 columns. Great! This is the beauty of automated pipeline gridsearching in scikit-learn. It allows our models to do what they do best and intuit things that we could not have on our own. Feature selection using hypothesis testing Hypothesis testing is a methodology in statistics that allows for a bit more complex statistical testing for individual features. Feature selection via hypothesis testing will attempt to select only the best features from a dataset, just as we were doing with our custom correlation chooser, but these tests rely more on formalized statistical methods and are interpreted through what are known as p-values. A hypothesis test is a statistical test that is used to figure out whether we can apply a certain condition for an entire population, given a data sample. The result of a hypothesis test tells us whether we should believe the hypothesis or reject it for an alternative one. Based on sample data from a population, a hypothesis test determines whether or not to reject the null hypothesis. We usually use a p-value (a non-negative decimal with an upper bound of 1, which is based on our significance level) to make this conclusion. In the case of feature selection, the hypothesis we wish to test is along the lines of: True or False: This feature has no relevance to the response variable. We want to test this hypothesis for every feature and decide whether the features hold some significance in the prediction of the response. In a way, this is how we dealt with the correlation logic. We basically said that, if a column's correlation with the response is too weak, then we say that the hypothesis that the feature has no relevance is true. If the correlation coefficient was strong enough, then we can reject the hypothesis that the feature has no relevance in favor of an alternative hypothesis, that the feature does have some relevance. To begin to use this for our data, we will have to bring in two new modules: SelectKBest and f_classif, using the following code: # SelectKBest selects features according to the k highest scores of a given scoring function from sklearn.feature_selection import SelectKBest # This models a statistical test known as ANOVA from sklearn.feature_selection import f_classif # f_classif allows for negative values, not all do # chi2 is a very common classification criteria but only allows for positive values # regression has its own statistical tests SelectKBest is basically just a wrapper that keeps a set amount of features that are the highest ranked according to some criterion. In this case, we will use the p-values of completed hypothesis testings as a ranking. Interpreting the p-value The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test. Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it. The big take away from this is that the f_classif function will perform an ANOVA test (a type of hypothesis test) on each feature on its own (hence the name univariate testing) and assign that feature a p-value. The SelectKBest will rank the features by that p-value (the lower the better) and keep only the best k (a human input) features. Let's try this out in Python. Ranking the p-value Let's begin by instantiating a SelectKBest module. We will manually enter a k value, 5, meaning we wish to keep only the five best features according to the resulting p-values: # keep only the best five features according to p-values of ANOVA test k_best = SelectKBest(f_classif, k=5) We can then fit and transform our X matrix to select the features we want, as we did before with our custom selector: # matrix after selecting the top 5 features k_best.fit_transform(X, y) # 30,000 rows x 5 columns array([[ 2, 2, -1, -1, -2], [-1, 2, 0, 0, 0], [ 0, 0, 0, 0, 0], ..., [ 4, 3, 2, -1, 0], [ 1, -1, 0, 0, 0], [ 0, 0, 0, 0, 0]]) If we want to inspect the p-values directly and see which columns were chosen, we can dive deeper into the select k_best variables: # get the p values of columns k_best.pvalues_ # make a dataframe of features and p-values # sort that dataframe by p-value p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value') # show the top 5 features p_values.head() The preceding code produces the following table as the output: We can see that, once again, our selector is choosing the PAY_X columns as the most important. If we take a look at our p-value column, we will notice that our values are extremely small and close to zero. A common threshold for p-values is 0.05, meaning that anything less than 0.05 may be considered significant, and these columns are extremely significant according to our tests. We can also directly see which columns meet a threshold of 0.05 using the pandas filtering methodology: # features with a low p value p_values[p_values['p_value'] < .05] The preceding code produces the following table as the output: The majority of the columns have a low p-value, but not all. Let's see the columns with a higher p_value, using the following code: # features with a high p value p_values[p_values['p_value'] >= .05] The preceding code produces the following table as the output: These three columns have quite a high p-value. Let's use our SelectKBest in a pipeline to see if we can grid search our way into a better machine learning pipeline, using the following code: k_best = SelectKBest(f_classif) # Make a new pipeline with SelectKBest select_k_pipe = Pipeline([('k_best', k_best), ('classifier', d_tree)]) select_k_best_pipe_params = deepcopy(tree_pipe_params) # the 'all' literally does nothing to subset select_k_best_pipe_params.update({'k_best__k':range(1,23) + ['all']}) print select_k_best_pipe_params # {'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all'], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # comparable to our results with correlationchooser get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'k_best__k': 7, 'classifier__max_depth': 5} Average Time to Fit (s): 0.102 Average Time to Score (s): 0.002 It seems that our SelectKBest module is getting about the same accuracy as our custom transformer, but it's getting there a bit quicker! Let's see which columns our tests are selecting for us, with the help of the following code: k_best = SelectKBest(f_classif, k=7) # lowest 7 p values match what our custom correlationchooser chose before # ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] p_values.head(7) The preceding code produces the following table as the output: They appear to be the same columns that were chosen by our other statistical method. It's possible that our statistical method is limited to continually picking these seven columns for us. There are other tests available besides ANOVA, such as Chi2 and others, for regression tasks. They are all included in scikit-learn's documentation. For more info on feature selection through univariate testing, check out the scikit-learn documentation here: http://scikit-learn.org/stable/modules/feature_selection. html#univariate-feature-selection Before we move on to model-based feature selection, it's helpful to do a quick sanity check to ensure that we are on the right track. So far, we have seen two statistical methods for feature selection that gave us the same seven columns for optimal accuracy. But what if we were to take every column except those seven? We should expect a much lower accuracy and worse pipeline overall, right? Let's make sure. The following code helps us to implement sanity checks: # sanity check # If we only the worst columns the_worst_of_X = X[X.columns.drop(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])] # goes to show, that selecting the wrong features will # hurt us in predictive performance get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y) Best Accuracy: 0.783966666667 Best Parameters: {'max_depth': 5} Average Time to Fit (s): 0.21 Average Time to Score (s): 0.002 Hence, by selecting the columns except those seven, we see not only worse accuracy (almost as bad as the null accuracy), but also slower fitting times on average. We statistically selected features from the dataset for our machine learning pipeline. [box type="note" align="" class="" width=""]This article is an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla. Do check out the book to get access to alternative techniques such as the model-based method to achieve optimum results from the machine learning application.[/box]

0
0
34086

article-image-stack-overflow-developer-survey-2018-quick-overview

Amey Varangaonkar

14 Mar 2018

4 min read

Stack Overflow Developer Survey 2018: A Quick Overview

Amey Varangaonkar

14 Mar 2018

4 min read

Stack Overflow recently published their annual developer survey in which over 100,000 developers and professionals participated. The survey shed light on some very interesting insights - from the developers’ preferred language for programming, to the development platform they hate the most. As the survey is quite detailed and comprehensive, we thought why not present the most important takeaways and findings for you to go through very quickly? If you are short of time and want to scan through the results of the survey quickly, read on.. Developer Profile Young developers form the majority: Half the developer population falls in the age group of 25-34 years while almost all respondents (90%) fall within the 18 - 44 year age group. Limited professional coding experience: Majority of the developers have been coding from the last 2 to 10 years. That said, almost half of the respondents have a professional coding experience of less than 5 years. Continuous learning is key to surviving as a developer: Almost 75% of the developers have a bachelor’s degree, or higher. In addition, almost 90% of the respondents say they have learnt a new language, framework or a tool without taking any formal course, but with the help of the official documentation and/or Stack Overflow Back-end developers form the majority: Among the top developer roles, more than half the developers identify themselves as back-end developers, while the percentage of data scientists and analysts is quite low. About 20% of the respondents identify themselves as mobile developers Working full-time: More than 75% of the developers responded that they work a full-time job. Close to 10% are freelancers, or self-employed. Popularly used languages and frameworks The Javascript family continue their reign: For the sixth year running, JavaScript has continued to be the most popular programming language, and is the choice of language for more than 70% of the respondents. In terms of frameworks, Node.js and Angular continue to be the most popular choice of the developers. Desktop development ain’t dead yet: When it comes to the platforms, developers prefer Linux and Windows Desktop or Server for their development work. Cloud platforms have not gained that much adoption, as yet, but there is a slow but steady rise. What about Data Science? Machine Learning and DevOps rule the roost: Machine Learning and DevOps are two trends which are trending highly due to the vast applications and research that is being done on these fronts. Tensorflow rises, Hadoop falls: About 75% of the respondents love the Tensorflow framework, and say they would love to continue using it for their machine learning/deep learning tasks. Hadoop’s popularity seems to be going down, on the other hand, as other Big Data frameworks like Apache Spark gain more traction and popularity. Python - the next big programming language: Popular data science languages like R and Python are on the rise in terms of popularity. Python, which surpassed PHP last year, has surpassed C# this year, indicating its continuing rise in popularity. Python based Frameworks like Tensorflow and pyTorch are gaining a lot of adoption. Learn F# for more moolah: Languages like F#, Clojure and Rust are associated with high global salaries, with median salaries above $70,000. The likes of R and Python are associated with median salaries of up to $57,000. PostgreSQL growing rapidly, Redis most loved database: MySQL and SQL Server are the two most widely used databases as per the survey, while the usage of PostgreSQL has surpassed that of the traditionally popular databases like MongoDB and Redis. In terms of popularity, Redis is the most loved database while the developers dread (read looking to switch from) databases like IBM DB2 and Oracle. Job-hunt for data scientists: Approximately 18% of the 76,000+ respondents who are actively looking for jobs are data scientists or work as academicians and researchers. AI more exciting than worrying: Close to 75% of the 69,000+ respondents are excited about the future possibilities with AI than worried about the dangers posed by AI. Some of the major concerns include AI making important business decisions. The big surprise was that most developers find automation of jobs as the most exciting part of a future enabled by AI. So that’s it then! What do you think about the Stack Overflow Developer survey results? Do you agree with the developers’ responses? We would love to know your thoughts. In the coming days, watch out for more fine grained analysis of the Stack Overflow survey data.

0
0
20439

Creating 2D and 3D plots using Matplotlib

How to secure data in Salesforce Einstein Analytics

How to embed Einstein dashboards on Salesforce Classic

How to write high quality code in Python: 15+ tips for data scientists and researchers

Getting started with Python Web Scraping

The Cambridge Analytica scandal and ethics in data science

25 Datasets for Deep Learning in IoT

How to create and prepare your first dataset in Salesforce Einstein

Perform CRUD operations on MongoDB with PHP

Connecting your data to MongoDB using PyMongo and PHP

Trending Topics

Prime numbers, modern encryption and their ancient Greek connection!

Troubleshooting in SQL Server

5 polarizing Quotes from Professor Stephen Hawking on artificial intelligence

Selecting Statistical-based Features in Machine Learning application

Stack Overflow Developer Survey 2018: A Quick Overview

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access