Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-embed-einstein-dashboards-salesforce-classic
Amey Varangaonkar
21 Mar 2018
5 min read
Save for later

How to embed Einstein dashboards on Salesforce Classic

Amey Varangaonkar
21 Mar 2018
5 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Learning Einstein Analytics written by Santosh Chitalkar. This book highlights the key techniques and know-how to unlock critical insights from your data using Salesforce Einstein Analytics.[/box] With Einstein Analytics, users have the power to embed their dashboards on various third-party applications and even on their web applications. In this article, we will show how to embed an Einstein dashboard on Salesforce Classic. In order to start embedding the dashboard, let's create a sample dashboard by performing the following steps: Navigate to Analytics Studio | Create | Dashboard. Add three chart widgets on the dashboard. Click on the Chart button in the middle and select the Opportunity dataset. Select Measures as Sum of Amount and select BillingCountry under Group by. Click on Done. Repeat the second step for the second widget, but select Account Source under Group by and make it a donut chart. Repeat the second step for the third widget but select Stage under Group by and make it a funnel chart. Click on Save (s) and enter Embedding Opportunities in the title field, as shown in the following screenshot: Now that we have created a dashboard, let's embed this dashboard in Salesforce Classic. In order to start embedding the dashboard, exit from the Einstein Analytics platform and go to Classic mode. The user can embed the dashboard on the record detail page layout in Salesforce Classic. The user can view the dashboard, drill in, and apply a filter, just like in the Einstein Analytics window. Let's add the dashboard to the account detail page by performing the following steps: Navigate to Setup | Customize | Accounts | Page Layouts as shown in the following screenshot: Click on Edit of Account Layout and it will open a page layout editor which has two parts: a palette on the upper portion of the screen, and the page layout on the lower portion of the screen. The palette contains the user interface elements that you can add to your page layout, such as Fields, Buttons, Links, and Actions, and Related Lists, as shown in the following screenshot: Click on the Wave Analytics Assets option from the palette and you can see all the dashboards on the right-side panel. Drag and drop a section onto the page layout, name it Einstein Dashboard, and click on OK. Drag and drop the dashboard which you wish to add to the record detail page. We are going to add Embedded Opportunities. Click on Save. Go to any accounting record and you should see a new section within the dashboard: Users can easily configure the embedded dashboards by using attributes. To access the dashboard properties, go to edit page layout again, and go to the section where we added the dashboard to the layout. Hover over the dashboard and click on the Tool icon. It will open an Asset Properties window: The Asset Properties window gives the user the option to change the following features: Width (in pixels or %): This feature allows you to adjust the width of the dashboard section. Height (in pixels): This feature allows you to adjust the height of the dashboard section. Show Title: This feature allows you to display or hide the title of the dashboard. Show Sharing Icon: Using this feature, by default, the share icon is disabled. The Show Sharing Icon option gives the user a flexibility to include the share icon on the dashboard. Show Header: This feature allows you to display or hide the header. Hide on error: This feature gives you control over whether the Analytics asset appears if there is an error. Field mapping: Last but not least, field mapping is used to filter the relevant data to the record on the dashboard. To set up the dashboard to show only the data that’s relevant to the record being viewed, use field mapping. Field mapping links data fields in the dashboard to the object’s fields. We are using the Embedded Opportunity dashboard. Let's add field mapping to it. The following is the format for field mapping: { "datasets": { "datasetName":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset Fieldname"]} }] } Let's add field mapping for account by using the following format: { "datasets": { "Account":[{ "fields":["Name"], "filter":{"operator": "matches", "values":["$Name"]} }] } } If your dashboard uses multiple datasets, then you can use the following format: { "datasets": { "datasetName1":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset1 Fieldname"]} }], "datasetName2":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset2 Fieldname"]} }] } Let's add field mapping for account and opportunities: { "datasets": { "Opportunities":[{ "fields":["Account.Name"], "Filter":{"operator": "Matches", "values":["$Name"]} }], "Account":[{ "fields":["Name"], "filter":{"operator": "matches", "values":["$Name"]} }] } } Now that we have added field mapping, save the page layout and go to the actual record. Observe that the dashboard is getting filtered now per record, as shown in the following screenshot: To summarize, we saw it’s fairly easy to embed your custom dashboards in Salesforce. Similarly, you can do so on other platforms such as Lightning, Visualforce pages, and even on your websites and web applications. If you are keen to learn more, you may check out the book Learning Einstein Analytics.    
Read more
  • 0
  • 0
  • 3847

article-image-25-datasets-deep-learning-iot
Sugandha Lahoti
20 Mar 2018
8 min read
Save for later

25 Datasets for Deep Learning in IoT

Sugandha Lahoti
20 Mar 2018
8 min read
Deep Learning is one of the major players for facilitating the analytics and learning in the IoT domain. A really good roundup of the state of deep learning advances for big data and IoT is described in the paper Deep Learning for IoT Big Data and Streaming Analytics: A Survey by Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. In this article, we have attempted to draw inspiration from this research paper to establish the importance of IoT datasets for deep learning applications. The paper also provides a handy list of commonly used datasets suitable for building deep learning applications in IoT, which we have added at the end of the article. IoT and Big Data: The relationship IoT and Big data have a two-way relationship. IoT is the main producer of big data, and as such an important target for big data analytics to improve the processes and services of IoT. However, there is a difference between the two. Large-Scale Streaming data: IoT data is a large-scale streaming data. This is because a large number of IoT devices generate streams of data continuously. Big data, on the other hand, lack real-time processing. Heterogeneity: IoT data is heterogeneous as various IoT data acquisition devices gather different information. Big data devices are generally homogeneous in nature. Time and space correlation: IoT sensor devices are also attached to a specific location, and thus have a location and time-stamp for each of the data items. Big data sensors lack time-stamp resolution. High noise data: IoT data is highly noisy, owing to the tiny pieces of data in IoT applications, which are prone to errors and noise during acquisition and transmission. Big data, in contrast, is generally less noisy. Big data, on the other hand, is classified according to conventional 3V’s, Volume, Velocity, and Variety. As such techniques used for Big data analytics are not sufficient to analyze the kind of data, that is being generated by IoT devices. For instance, autonomous cars need to make fast decisions on driving actions such as lane or speed change. These decisions should be supported by fast analytics with data streaming from multiple sources (e.g., cameras, radars, left/right signals, traffic light etc.). This changes the definition of IoT big data classification to 6V’s. Volume: The quantity of generated data using IoT devices is much more than before and clearly fits this feature. Velocity: Advanced tools and technologies for analytics are needed to efficiently operate the high rate of data production. Variety: Big data may be structured, semi-structured, and unstructured data. The data types produced by IoT include text, audio, video, sensory data and so on. Veracity: Veracity refers to the quality, consistency, and trustworthiness of the data, which in turn leads to accurate analytics. Variability: This property refers to the different rates of data flow. Value: Value is the transformation of big data to useful information and insights that bring competitive advantage to organizations. Despite the recent advancement in DL for big data, there are still significant challenges that need to be addressed to mature this technology. Every 6 characteristics of IoT big data imposes a challenge for DL techniques. One common denominator for all is the lack of availability of IoT big data datasets.   IoT datasets and why are they needed Deep learning methods have been promising with state-of-the-art results in several areas, such as signal processing, natural language processing, and image recognition. The trend is going up in IoT verticals as well. IoT datasets play a major role in improving the IoT analytics. Real-world IoT datasets generate more data which in turn improve the accuracy of DL algorithms. However, the lack of availability of large real-world datasets for IoT applications is a major hurdle for incorporating DL models in IoT. The shortage of these datasets acts as a barrier to deployment and acceptance of IoT analytics based on DL since the empirical validation and evaluation of the system should be shown promising in the natural world. The lack of availability is mainly because: Most IoT datasets are available with large organizations who are unwilling to share it so easily. Access to the copyrighted datasets or privacy considerations. These are more common in domains with human data such as healthcare and education. While there is a lot of ground to be covered in terms of making datasets for IoT available, here is a list of commonly used datasets suitable for building deep learning applications in IoT. Dataset Name Domain Provider Notes Address/Link CGIAR dataset Agriculture, Climate CCAFS High-resolution climate datasets for a variety of fields including agricultural http://www.ccafs-climate.org/ Educational Process Mining Education University of Genova Recordings of 115 subjects’ activities through a logging application while learning with an educational simulator http://archive.ics.uci.edu/ml/datasets/Educational+Process+Mining+%28EPM%29%3A+A+Learning+Analytics+Data+Set Commercial Building Energy Dataset Energy, Smart Building IIITD Energy related data set from a commercial building where data is sampled more than once a minute. http://combed.github.io/ Individual household electric power consumption Energy, Smart home EDF R&D, Clamart, France One-minute sampling rate over a period of almost 4 years http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption AMPds dataset Energy, Smart home S. Makonin AMPds contains electricity, water, and natural gas measurements at one minute intervals for 2 years of monitoring http://ampds.org/ UK Domestic Appliance-Level Electricity Energy, Smart Home Kelly and Knottenbelt Power demand from five houses. In each house both the whole-house mains power demand as well as power demand from individual appliances are recorded. http://www.doc.ic.ac.uk/∼dk3810/data/ PhysioBank databases Healthcare PhysioNet Archive of over 80 physiological datasets. https://physionet.org/physiobank/database/ Saarbruecken Voice Database Healthcare Universitat¨ des Saarlandes A collection of voice recordings from more than 2000 persons for pathological voice detection. http://www.stimmdatebank.coli.uni-saarland.de/help_en.php4   T-LESS   Industry CMP at Czech Technical University An RGB-D dataset and evaluation methodology for detection and 6D pose estimation of texture-less objects http://cmp.felk.cvut.cz/t-less/ CityPulse Dataset Collection Smart City CityPulse EU FP7 project Road Traffic Data, Pollution Data, Weather, Parking http://iot.ee.surrey.ac.uk:8080/datasets.html Open Data Institute - node Trento Smart City Telecom Italia Weather, Air quality, Electricity, Telecommunication http://theodi.fbk.eu/openbigdata/ Malaga datasets Smart City City of Malaga A broad range of categories such as energy, ITS, weather, Industry, Sport, etc. http://datosabiertos.malaga.eu/dataset Gas sensors for home activity monitoring Smart home Univ. of California San Diego Recordings of 8 gas sensors under three conditions including background, wine and banana presentations. http://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring CASAS datasets for activities of daily living Smart home Washington State University Several public datasets related to Activities of Daily Living (ADL) performance in a two story home, an apartment, and an office settings. http://ailab.wsu.edu/casas/datasets.html ARAS Human Activity Dataset Smart home Bogazici University Human activity recognition datasets collected from two real houses with multiple residents during two months. https://www.cmpe.boun.edu.tr/aras/ MERLSense Data Smart home, building Mitsubishi Electric Research Labs Motion sensor data of residual traces from a network of over 200 sensors for two years, containing over 50 million records. http://www.merl.com/wmd SportVU   Sport Stats LLC   Video of basketball and soccer games captured from 6 cameras. http://go.stats.com/sportvu RealDisp Sport O. Banos   Includes a wide range of physical activities (warm up, cool down and fitness exercises). http://orestibanos.com/datasets.htm   Taxi Service Trajectory Transportation Prediction Challenge, ECML PKDD 2015 Trajectories performed by all the 442 taxis running in the city of Porto, in Portugal. http://www.geolink.pt/ecmlpkdd2015-challenge/dataset.html GeoLife GPS Trajectories Transportation Microsoft A GPS trajectory by a sequence of time-stamped points https://www.microsoft.com/en-us/download/details.aspx?id=52367 T-Drive trajectory data Transportation Microsoft Contains a one-week trajectories of 10,357 taxis https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/ Chicago Bus Traces data Transportation M. Doering   Bus traces from the Chicago Transport Authority for 18 days with a rate between 20 and 40 seconds. http://www.ibr.cs.tu-bs.de/users/mdoering/bustraces/   Uber trip data Transportation FiveThirtyEight About 20 million Uber pickups in New York City during 12 months. https://github.com/fivethirtyeight/uber-tlc-foil-response Traffic Sign Recognition Transportation K. Lim   Three datasets: Korean daytime, Korean nighttime, and German daytime traffic signs based on Vienna traffic rules. https://figshare.com/articles/Traffic_Sign_Recognition_Testsets/4597795 DDD17   Transportation J. Binas End-To-End DAVIS Driving Dataset. http://sensors.ini.uzh.ch/databases.html      
Read more
  • 0
  • 2
  • 88157

article-image-cambridge-analytica-ethics-data-science
Richard Gall
20 Mar 2018
5 min read
Save for later

The Cambridge Analytica scandal and ethics in data science

Richard Gall
20 Mar 2018
5 min read
Earlier this month, Stack Overflow published the results of its 2018 developer survey. In it, there was an interesting set of questions around the concept of 'ethical code'. The main takeaway was ultimately that the area remains a gray area. The Cambridge Analytica scandal, however, has given the issue of 'ethical code' a renewed urgency in the last couple of days. The data analytics company are alleged to have not only been involved in votes in the UK and US, but also of harvesting copious amounts of data from Facebook (illegally). For whistleblower Christopher Wylie, the issue of ethical code is particularly pronounced. “I created Steve Bannon’s psychological mindfuck tool” he told Carole Cadwalladr in an interview in the Guardian. Cambridge Analytica: psyops or just market research? Wylie is a data scientist whose experience over the last half a decade or so has been impressive. It’s worth noting however, that Wylie’s career didn’t begin in politics. His academic career was focused primarily on fashion forecasting. That might all seem a little prosaic, but it underlines the fact that data science never happens in a vacuum. Data scientists always operate within a given field. It might be tempting to view the world purely through the prism of impersonal data and cold statistics. To a certain extent you have to if you’re a data scientist or a statistician. But at the very least this can be unhelpful; at worst a potential threat to global democracy. At one point in the interview Wylie remarks that: ...it’s normal for a market research company to amass data on domestic populations. And if you’re working in some country and there’s an auxiliary benefit to a current client with aligned interests, well that’s just a bonus. This is potentially the most frightening thing. Cambridge Analytica’s ostensible role in elections and referenda isn’t actually that remarkable. For all the vested interests and meetings between investors, researchers and entrepreneurs, the scandal is really just the extension of data mining and marketing tactics employed by just about every organization with a digital presence on the planet. Data scientists are always going to be in a difficult position. True, we're not all going to end up working alongside Steve Bannon. But your skills are always being deployed with a very specific end in mind. It’s not always easy to see the effects and impact of your work until later, but it’s still essential for data scientists and analysts to be aware of whose data is being collected and used, how it’s being used and why. Who is responsible for the ethics around data and code? There was another interesting question in the Stack Overflow survey that's relevant to all of this. The survey asked respondents who was ultimately most responsible for code that accomplishes something unethical. 57.5% claimed upper management were responsible, 22.8% said the person who came up with the idea, and 19.7% said it was the responsibility of the developer themselves. Clearly the question is complex. The truth lies somewhere between all three. Management make decisions about what’s required from an organizational perspective, but the engineers themselves are, of course, a part of the wider organizational dynamic. They should be in a position where they are able to communicate any personal misgivings or broader legal issues with the work they are being asked to do. The case of Wylie and Cambridge Analytica is unique, however. But it does highlight that data science can be deployed in ways that are difficult to predict. And without proper channels of escalation and the right degree of transparency it's easy for things to remain secretive, hidden in small meetings, email threads and paper trails. That's another thing that data scientists need to remember. Office politics might be a fact of life, but when you're a data scientist you're sitting on the apex of legal, strategic and political issues. To refuse to be aware of this would be naive. What the Cambridge Analytica story can teach data scientists But there's something else worth noting. This story also illustrates something more about the world in which data scientists are operating. This is a world where traditional infrastructure is being dismantled. This is a world where privatization and outsourcing is viewed as the route towards efficiency and 'value for money'. Whether you think that’s a good or bad thing isn’t really the point here. What’s important is that it makes the way we use data, even the code we write more problematic than ever because it’s not always easy to see how it’s being used. Arguably Wylie was naive. His curiosity and desire to apply his data science skills to intriguing and complex problems led him towards people who knew just how valuable he could be. Wylie has evidently developed greater self-awareness. This is perhaps the main reason why he has come forward with his version of events. But as this saga unfolds it’s worth remembering the value of data scientists in the modern world - for a range of organizations. It’s made the concept of the 'citizen data scientist' take on an even more urgent and literal meaning. Yes data science can help to empower the economy and possibly even toy with democracy. But it can also be used to empower people, improve transparency in politics and business. If anything, the Cambridge Analytica saga proves that data science is a dangerous field - not only the sexiest job of the twenty-first century, but one of the most influential in shaping the kind of world we're going to live in. That's frightening, but it's also pretty exciting.
Read more
  • 0
  • 0
  • 49887

article-image-getting-started-with-python-web-scraping
Amarabha Banerjee
20 Mar 2018
13 min read
Save for later

Getting started with Python Web Scraping

Amarabha Banerjee
20 Mar 2018
13 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] The amount of data available on the web is consistently growing both in quantity and in form. Businesses require this data to make decisions, particularly with the explosive growth of machine learning tools which require large amounts of data for training. Much of this data is available via Application Programming Interfaces, but at the same time a lot of valuable data is still only available through the process of web scraping. Python is the choice of programing language for many who build systems to perform scraping. It is an easy to use programming language with a rich ecosystem of tools for other tasks. In this article, we will focus on the fundamentals of setting up a scraping environment and perform basic requests for data with several tools of trade. Setting up a Python development environment If you have not used Python before, it is important to have a working development  environment. The recipes in this book will be all in Python and be a mix of interactive examples, but primarily implemented as scripts to be interpreted by the Python interpreter. This recipe will show you how to set up an isolated development environment with virtualenv and manage project dependencies with pip . We also get the code for the book and install it into the Python virtual environment. Getting ready We will exclusively be using Python 3.x, and specifically in my case 3.6.1. While Mac and Linux normally have Python version 2 installed, and Windows systems do not. So it is likely that in any case that Python 3 will need to be installed. You can find references for Python installers at www.python.org. You can check Python's version with python --version pip comes installed with Python 3.x, so we will omit instructions on its installation. Additionally, all command line examples in this book are run on a Mac. For Linux users the commands should be identical. On Windows, there are alternate commands (like dir instead of ls), but these alternatives will not be covered. How to do it We will be installing a number of packages with pip. These packages are installed into a Python environment. There often can be version conflicts with other packages, so a good practice for following along with the recipes in the book will be to create a new virtual Python environment where the packages we will use will be ensured to work properly. Virtual Python environments are managed with the virtualenv tool. This can be installed with the following command: ~ $ pip install virtualenv Collecting virtualenv Using cached virtualenv-15.1.0-py2.py3-none-any.whl Installing collected packages: virtualenv Successfully installed virtualenv-15.1.0 Now we can use virtualenv. But before that let's briefly look at pip. This command installs Python packages from PyPI, a package repository with literally 10's of thousands of packages. We just saw using the install subcommand to pip, which ensures a package is installed. We can also see all currently installed packages with pip list: ~ $ pip list alabaster (0.7.9) amqp (1.4.9) anaconda-client (1.6.0) anaconda-navigator (1.5.3) anaconda-project (0.4.1) aniso8601 (1.3.0) Packages can also be uninstalled using pip uninstall followed by the package name. I'll leave it to you to give it a try. Now back to virtualenv. Using virtualenv is very simple. Let's use it to create an environment and install the code from github. Let's walk through the steps: Create a directory to represent the project and enter the directory. ~ $ mkdir pywscb ~ $ cd pywscb Initialize a virtual environment folder named env: pywscb $ virtualenv env Using base prefix '/Users/michaelheydt/anaconda' New python executable in /Users/michaelheydt/pywscb/env/bin/python copying /Users/michaelheydt/anaconda/bin/python => /Users/michaelheydt/pywscb/env/bin/python copying /Users/michaelheydt/anaconda/bin/../lib/libpython3.6m.dylib => /Users/michaelheydt/pywscb/env/lib/libpython3. 6m.dylib Installing setuptools, pip, wheel...done. This creates an env folder. Let's take a look at what was installed. pywscb $ ls -la env total 8 drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 . drwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 .. drwxr-xr-x 16 michaelheydt staff 544 Jan 18 15:38 bin drwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 include drwxr-xr-x 4 michaelheydt staff 136 Jan 18 15:38 lib -rw-r--r-- 1 michaelheydt staff 60 Jan 18 15:38 pipselfcheck. json New we activate the virtual environment. This command uses the content in the env folder to configure Python. After this all python activities are relative to this virtual environment. pywscb $ source env/bin/activate (env) pywscb $ We can check that python is indeed using this virtual environment with the following command: (env) pywscb $ which python /Users/michaelheydt/pywscb/env/bin/python With our virtual environment created, let's clone the books sample code and take a look at its structure. (env) pywscb $ git clone https://github.com/PacktBooks/PythonWebScrapingCookbook.git Cloning into 'PythonWebScrapingCookbook'... remote: Counting objects: 420, done. remote: Compressing objects: 100% (316/316), done. remote: Total 420 (delta 164), reused 344 (delta 88), pack-reused 0 Receiving objects: 100% (420/420), 1.15 MiB | 250.00 KiB/s, done. Resolving deltas: 100% (164/164), done. Checking connectivity... done. This created a PythonWebScrapingCookbook directory. (env) pywscb $ ls -l total 0 drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 PythonWebScrapingCookbook drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 env Let's change into it and examine the content. (env) PythonWebScrapingCookbook $ ls -l total 0 drwxr-xr-x 15 michaelheydt staff 510 Jan 18 16:21 py drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 www There are two directories. Most the the Python code is is the py directory. www contains some web content that we will use from time-to-time using a local web server. Let's look at the contents of the py directory: (env) py $ ls -l total 0 drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 01 drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 03 drwxr-xr-x 21 michaelheydt staff 714 Jan 18 16:21 04 drwxr-xr-x 10 michaelheydt staff 340 Jan 18 16:21 05 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 06 drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 07 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 08 drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 09 drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 10 drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 11 drwxr-xr-x 8 michaelheydt staff 272 Jan 18 16:21 modules Code for each chapter is in the numbered folder matching the chapter (there is no code for chapter 2 as it is all interactive Python). Note that there is a modules folder. Some of the recipes throughout the book use code in those modules. Make sure that your Python path points to this folder. On Mac and Linux you can sets this in your .bash_profile file (and environments variables dialog on Windows): Export PYTHONPATH="/users/michaelheydt/dropbox/packt/books/pywebscrcookbook/code/py/modules" export PYTHONPATH The contents in each folder generally follows a numbering scheme matching the sequence of the recipe in the chapter. The following is the contents of the chapter 6 folder: (env) py $ ls -la 06 total 96 drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 . drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:26 .. -rw-r--r-- 1 michaelheydt staff 902 Jan 18 16:21 01_scrapy_retry.py -rw-r--r-- 1 michaelheydt staff 656 Jan 18 16:21 02_scrapy_redirects.py -rw-r--r-- 1 michaelheydt staff 1129 Jan 18 16:21 03_scrapy_pagination.py -rw-r--r-- 1 michaelheydt staff 488 Jan 18 16:21 04_press_and_wait.py -rw-r--r-- 1 michaelheydt staff 580 Jan 18 16:21 05_allowed_domains.py -rw-r--r-- 1 michaelheydt staff 826 Jan 18 16:21 06_scrapy_continuous.py -rw-r--r-- 1 michaelheydt staff 704 Jan 18 16:21 07_scrape_continuous_twitter.py -rw-r--r-- 1 michaelheydt staff 1409 Jan 18 16:21 08_limit_depth.py -rw-r--r-- 1 michaelheydt staff 526 Jan 18 16:21 09_limit_length.py -rw-r--r-- 1 michaelheydt staff 1537 Jan 18 16:21 10_forms_auth.py -rw-r--r-- 1 michaelheydt staff 597 Jan 18 16:21 11_file_cache.py -rw-r--r-- 1 michaelheydt staff 1279 Jan 18 16:21 12_parse_differently_based_on_rules.py In the recipes I'll state that we'll be using the script in <chapter directory>/<recipe filename>. Now just the be complete, if you want to get out of the Python virtual environment, you can exit using the following command: (env) py $ deactivate py $ And checking which python we can see it has switched back: py $ which python /Users/michaelheydt/anaconda/bin/python Scraping Python.org with Requests and Beautiful Soup In this recipe we will install Requests and Beautiful Soup and scrape some content from www.python.org. We'll install both of the libraries and get some basic familiarity with them. We'll come back to them both in subsequent chapters and dive deeper into each. Getting ready In this recipe, we will scrape the upcoming Python events from https:/ / www. python. org/events/ pythonevents. The following is an an example of The Python.org Events Page (it changes frequently, so your experience will differ): We will need to ensure that Requests and Beautiful Soup are installed. We can do that with the following: pywscb $ pip install requests Downloading/unpacking requests Downloading requests-2.18.4-py2.py3-none-any.whl (88kB): 88kB downloaded Downloading/unpacking certifi>=2017.4.17 (from requests) Downloading certifi-2018.1.18-py2.py3-none-any.whl (151kB): 151kB downloaded Downloading/unpacking idna>=2.5,<2.7 (from requests) Downloading idna-2.6-py2.py3-none-any.whl (56kB): 56kB downloaded Downloading/unpacking chardet>=3.0.2,<3.1.0 (from requests) Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB): 133kB downloaded Downloading/unpacking urllib3>=1.21.1,<1.23 (from requests) Downloading urllib3-1.22-py2.py3-none-any.whl (132kB): 132kB downloaded Installing collected packages: requests, certifi, idna, chardet, urllib3 Successfully installed requests certifi idna chardet urllib3 Cleaning up... pywscb $ pip install bs4 Downloading/unpacking bs4 Downloading bs4-0.0.1.tar.gz Running setup.py (path:/Users/michaelheydt/pywscb/env/build/bs4/setup.py) egg_info for package bs4 How to do it Now let's go and learn to scrape a couple events. For this recipe we will start by using interactive python. Start it with the ipython command: $ ipython Python 3.6.1 |Anaconda custom (x86_64)| (default, Mar 22 2017, 19:25:17) Type "copyright", "credits" or "license" for more information. IPython 5.1.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: Next we import Requests In [1]: import requests We now use requests to make a GET HTTP request for the following url:https://www.python.org/events/ python-events/ by making a GET request: In [2]: url = 'https://www.python.org/events/python-events/' In [3]: req = requests.get(url) That downloaded the page content but it is stored in our requests object req. We can retrieve the content using the .text property. This prints the first 200 characters. req.text[:200] Out[4]: '<!doctype html>n<!--[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9"> <![endif]-->n<!--[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9"> <![endif]-->n<!--[if IE 8]> <h' We now have the raw HTML of the page. We can now use beautiful soup to parse the HTML and retrieve the event data. First import Beautiful Soup In [5]: from bs4 import BeautifulSoup Now we create a BeautifulSoup object and pass it the HTML. In [6]: soup = BeautifulSoup(req.text, 'lxml') Now we tell Beautiful Soup to find the main <ul> tag for the recent events, and then to get all the <li> tags below it. In [7]: events = soup.find('ul', {'class': 'list-recentevents'}). findAll('li') And finally we can loop through each of the <li> elements, extracting the event details, and print each to the console: In [13]: for event in events: ...: event_details = dict() ...: event_details['name'] = event_details['name'] = event.find('h3').find("a").text ...: event_details['location'] = event.find('span', {'class' 'event-location'}).text ...: event_details['time'] = event.find('time').text ...: print(event_details) ...: {'name': 'PyCascades 2018', 'location': 'Granville Island Stage, 1585 Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan. – 24 Jan. 2018'} {'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon', 'time': '24 Jan. – 29 Jan. 2018'} {'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F. Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05 Feb. 2018'} {'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08 Feb. – 12 Feb. 2018'} {'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia', 'time': '09 Feb. – 12 Feb. 2018'} {'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA', 'time': '10 Feb. – 12 Feb. 2018'} This entire example is available in the 01/01_events_with_requests.py script file. The following is its content and it pulls together all of what we just did step by step: import requests from bs4 import BeautifulSoup def get_upcoming_events(url): req = requests.get(url) soup = BeautifulSoup(req.text, 'lxml') events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li') for event in events: event_details = dict() event_details['name'] = event.find('h3').find("a").text event_details['location'] = event.find('span', {'class', 'eventlocation'}). text event_details['time'] = event.find('time').text print(event_details) get_upcoming_events('https://www.python.org/events/python-events/') You can run this using the following command from the terminal: $ python 01_events_with_requests.py {'name': 'PyCascades 2018', 'location': 'Granville Island Stage, 1585 Johnston St, Vancouver, BC V6H 3R9, Canada', 'time': '22 Jan. – 24 Jan. 2018'} {'name': 'PyCon Cameroon 2018', 'location': 'Limbe, Cameroon', 'time': '24 Jan. – 29 Jan. 2018'} {'name': 'FOSDEM 2018', 'location': 'ULB Campus du Solbosch, Av. F. D. Roosevelt 50, 1050 Bruxelles, Belgium', 'time': '03 Feb. – 05 Feb. 2018'} {'name': 'PyCon Pune 2018', 'location': 'Pune, India', 'time': '08 Feb. – 12 Feb. 2018'} {'name': 'PyCon Colombia 2018', 'location': 'Medellin, Colombia', 'time': '09 Feb. – 12 Feb. 2018'} {'name': 'PyTennessee 2018', 'location': 'Nashville, TN, USA', 'time': '10 Feb. – 12 Feb. 2018'} How it works We will dive into details of both Requests and Beautiful Soup in the next chapter, but for now let's just summarize a few key points about how this works. The following important points about Requests: Requests is used to execute HTTP requests. We used it to make a GET verb request of the URL for the events page. The Requests object holds the results of the request. This is not only the page content, but also many other items about the result such as HTTP status codes and headers. Requests is used only to get the page, it does not do an parsing. We use Beautiful Soup to do the parsing of the HTML and also the finding of content within the HTML. To understand how this worked, the content of the page has the following HTML to start the Upcoming Events section: We used the power of Beautiful Soup to: Find the <ul> element representing the section, which is found by looking for a <ul> with the a class attribute that has a value of list-recent-events. From that object, we find all the <li> elements. Each of these <li> tags represent a different event. We iterate over each of those making a dictionary from the event data found in child HTML tags: The name is extracted from the <a> tag that is a child of the <h3> tag The location is the text content of the <span> with a class of event-location And the time is extracted from the datetime attribute of the <time> tag. To summarize, we saw how to setup a Python environment for effective data scraping from the web and also explored ways to use Beautiful Soup to perform preliminary data scraping for ethical purposes. If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Apache Kafka installation.        
Read more
  • 0
  • 0
  • 17002

article-image-create-prepare-first-dataset-salesforce-einstein
Amey Varangaonkar
19 Mar 2018
3 min read
Save for later

How to create and prepare your first dataset in Salesforce Einstein

Amey Varangaonkar
19 Mar 2018
3 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book Learning Einstein Analytics written by Santosh Chitalkar. This book will help you learn Salesforce Einstein analytics, to get insights faster and understand your customer better.[/box] In this article, we see how to start your analytics journey using Salesforce Einstein by taking the first step in the process i.e; by creating and preparing your dataset! A dataset is a set of source data, specially formatted and optimized for interactive exploration. Here are the steps to create a new dataset in Salesforce Einstein: Click on the Create button in the top-right corner and then click on Dataset. You can see the following three options to create datasets: CSV File Salesforce Informatica Rev 2. Select CSV File and click on Continue, as shown in the following screenshot: 3. Select the Account_data.csv file or drag and drop the file. 4. Click on Next. The next screen uploads the user interface to create a single dataset by using the external.csv file: 5. Click on Next to proceed as shown in the following screenshot: 6. Change the dataset name if you want. You can select an application to store the dataset. You can also replace the CSV file from this screen. 7. Click on in the Data Schema File section and select the Replace File option to change the file. You can also download the uploaded .csv file from here as shown in the following screenshot: 8. Click on Next. In the next screen, you can change field attributes such as column name, dimensions, field type, and so on. 9. Click on the Next button and it will start uploading the file in Analytics and queuing it in dataflow. Once done click on the Got it button. 10. Wait for 10-15 minutes (depending on the data, it may take a longer time to create the dataset). 11. Go to Analytics Studio and open the DATASETS tab. You can see the Account_data dataset as shown in the following screenshot: Congrats!!! You have created your first dataset. Let's now update this dataset with the same information but with some additional columns. Updating datasets We need to update the dataset to add new fields, change application settings, remove fields, and so on. Einstein Analytics gives users the flexibility to update the dataset. Here are the steps to update an existing dataset: Create a CSV file to include some new fields and name it Account_Data_Updated. Save the file to a location that you can easily remember. In Salesforce, go to the Analytics Studio home page and find the dataset. Hover over the dataset and click on the button, then click on Edit, as shown in the following screenshot: 4. Salesforce displays the dataset editing screen. Click on the Replace Data button in the top-right corner of the page: 5. Click on the Next button and upload your new CSV file using upload UI. 6. Click on the Next button again to get to the next screen for editing and click on Next again. 7. Click on Replace as shown in the following screenshot: Voila! You’ve successfully updated your dataset. As you can see it’s fairly easy to create and then update the dataset if required, using Einstein without any hassle. If you found this post useful, make sure to check out our book Learning Einstein Analytics for more tips and techniques on using Einstein Analytics effectively to uncover unique insights from your data.        
Read more
  • 0
  • 1
  • 20600

article-image-perform-crud-operations-on-mongodb-with-php
Amey Varangaonkar
17 Mar 2018
6 min read
Save for later

Perform CRUD operations on MongoDB with PHP

Amey Varangaonkar
17 Mar 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book Mastering MongoDB 3.x authored by Alex Giamas. This book covers the key concepts, and tips & tricks needed to build fault-tolerant applications in MongoDB. It gives you the power to become a true expert when it comes to the world’s most popular NoSQL database.[/box] In today’s tutorial, we will cover the CRUD (Create, Read, Update and Delete) operations using the popular PHP language with the official MongoDB driver. Create and delete operations To perform the create and delete operations, run the following code: $document = array( "isbn" => "401", "name" => "MongoDB and PHP" ); $result = $collection->insertOne($document); var_dump($result); This is the output: MongoDBInsertOneResult Object ( [writeResult:MongoDBInsertOneResult:private] => MongoDBDriverWriteResult Object ( [nInserted] => 1 [nMatched] => 0 [nModified] => 0 [nRemoved] => 0 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) [insertedId:MongoDBInsertOneResult:private] => MongoDBBSONObjectID Object ( [oid] => 5941ac50aabac9d16f6da142 ) [isAcknowledged:MongoDBInsertOneResult:private] => 1 ) The rather lengthy output contains all the information that we may need. We can get the ObjectId of the document inserted; the number of inserted, matched, modified, removed, and upserted documents by fields prefixed with n; and information about writeError or writeConcernError. There are also convenience methods in the $result object if we want to get the Information: $result->getInsertedCount(): To get the number of inserted objects $result->getInsertedId(): To get the ObjectId of the inserted document We can also use the ->insertMany() method to insert many documents at once, like this: $documentAlpha = array( "isbn" => "402", "name" => "MongoDB and PHP, 2nd Edition" ); $documentBeta = array( "isbn" => "403", "name" => "MongoDB and PHP, revisited" ); $result = $collection->insertMany([$documentAlpha, $documentBeta]); print_r($result); The result is: ( [writeResult:MongoDBInsertManyResult:private] => MongoDBDriverWriteResult Object ( [nInserted] => 2 [nMatched] => 0 [nModified] => 0 [nRemoved] => 0 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) [insertedIds:MongoDBInsertManyResult:private] => Array ( [0] => MongoDBBSONObjectID Object ( [oid] => 5941ae85aabac9d1d16c63a2 ) [1] => MongoDBBSONObjectID Object ( [oid] => 5941ae85aabac9d1d16c63a3 ) ) [isAcknowledged:MongoDBInsertManyResult:private] => 1 ) Again, $result->getInsertedCount() will return 2, whereas $result->getInsertedIds() will return an array with the two newly created ObjectIds: array(2) { [0]=> object(MongoDBBSONObjectID)#13 (1) { ["oid"]=> string(24) "5941ae85aabac9d1d16c63a2" } [1]=> object(MongoDBBSONObjectID)#14 (1) { ["oid"]=> string(24) "5941ae85aabac9d1d16c63a3" } } Deleting documents is similar to inserting but with the deleteOne() and deleteMany() methods; an example of deleteMany() is shown here: $deleteQuery = array( "isbn" => "401"); $deleteResult = $collection->deleteMany($deleteQuery); print_r($result); print($deleteResult->getDeletedCount()); Here is the output: MongoDBDeleteResult Object ( [writeResult:MongoDBDeleteResult:private] => MongoDBDriverWriteResult Object ( [nInserted] => 0 [nMatched] => 0 [nModified] => 0 [nRemoved] => 2 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) [isAcknowledged:MongoDBDeleteResult:private] => 1 ) 2 In this example, we used ->getDeletedCount() to get the number of affected documents, which is printed out in the last line of the output. Bulk write The new PHP driver supports the bulk write interface to minimize network calls to MongoDB: $manager = new MongoDBDriverManager('mongodb://localhost:27017'); $bulk = new MongoDBDriverBulkWrite(array("ordered" => true)); $bulk->insert(array( "isbn" => "401", "name" => "MongoDB and PHP" )); $bulk->insert(array( "isbn" => "402", "name" => "MongoDB and PHP, 2nd Edition" )); $bulk->update(array("isbn" => "402"), array('$set' => array("price" => 15))); $bulk->insert(array( "isbn" => "403", "name" => "MongoDB and PHP, revisited" )); $result = $manager->executeBulkWrite('mongo_book.books', $bulk); print_r($result); The result is: MongoDBDriverWriteResult Object ( [nInserted] => 3 [nMatched] => 1 [nModified] => 1 [nRemoved] => 0 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) In the preceding example, we executed two inserts, one update, and a third insert in an ordered fashion. The WriteResult object contains a total of three inserted documents and one modified document. The main difference compared to simple create/delete queries is that executeBulkWrite() is a method of the MongoDBDriverManager class, which we instantiate on the first line. Read operation Querying an interface is similar to inserting and deleting, with the findOne() and find() methods used to retrieve the first result or all results of a query: $document = $collection->findOne( array("isbn" => "101") ); $cursor = $collection->find( array( "name" => new MongoDBBSONRegex("mongo", "i") ) ); In the second example, we are using a regular expression to search for a key name with the value mongo (case-insensitive). Embedded documents can be queried using the . notation, as with the other languages that we examined earlier in this chapter: $cursor = $collection->find( array('meta.price' => 50) ); We do this to query for an embedded document price inside the meta key field. Similarly to Ruby and Python, in PHP we can query using comparison operators, like this: $cursor = $collection->find( array( 'price' => array('$gte'=> 60) ) ); Querying with multiple key-value pairs is an implicit AND, whereas queries using $or, $in, $nin, or AND ($and) combined with $or can be achieved with nested queries: $cursor = $collection->find( array( '$or' => array( array("price" => array( '$gte' => 60)), array("price" => array( '$lte' => 20)) ))); This finds documents that have price>=60 OR price<=20. Update operation Updating documents has a similar interface with the ->updateOne() OR ->updateMany() method. The first parameter is the query used to find documents and the second one will update our documents. We can use any of the update operators explained at the end of this chapter to update in place or specify a new document to completely replace the document in the query: $result = $collection->updateOne( array( "isbn" => "401"), array( '$set' => array( "price" => 39 ) ) ); We can use single quotes or double quotes for key names, but if we have special operators starting with $, we need to use single quotes. We can use array( "key" => "value" ) or ["key" => "value"]. We prefer the more explicit array() notation in this book. The ->getMatchedCount() and ->getModifiedCount() methods will return the number of documents matched in the query part or the ones modified from the query. If the new value is the same as the existing value of a document, it will not be counted as modified. We saw, it is fairly easy and advantageous to use PHP as a language and tool for performing efficient CRUD operations in MongoDB to handle data efficiently. If you are interested to get more information on how to effectively handle data using MongoDB, you may check out this book Mastering MongoDB 3.x.
Read more
  • 0
  • 0
  • 16145
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-generate-prime-numbers-ancient-greek-connection
Richard Gall
16 Mar 2018
2 min read
Save for later

Prime numbers, modern encryption and their ancient Greek connection!

Richard Gall
16 Mar 2018
2 min read
Prime numbers are incredibly important in a number of fields, from computer science to cybersecurity. But they are incredibly mysterious - they are quite literally enigmas. They have been the subject of thousands of years of research and exploration but they have still not been cracked - we are yet to find a formula that will help generate prime numbers easily. Prime numbers are particularly integral to modern encryption - when a file is encrypted, the number used to do so is built using two primes. The only way to decrypt it is to work out the prime factors of that gargantuan number, a task which is almost impossible even with the most extensive computing power currently at our disposal. As well as this, prime numbers are also used as error correcting codes and in mass storage and data transmission. Did you know the Greeks were one of the early champions for our modern encryption systems? Find out how to generate prime numbers manually using a method devised by the Greek mathematician, Eratothenes, in this fun video from the video course by Packt, Fundamental Algorithms in Scala. [embed]https://www.youtube.com/watch?v=cd8v-Jo8obs&t=56s[/embed]  
Read more
  • 0
  • 0
  • 11644

article-image-connecting-data-mongodb-using-pymongo-php
Amey Varangaonkar
16 Mar 2018
6 min read
Save for later

Connecting your data to MongoDB using PyMongo and PHP

Amey Varangaonkar
16 Mar 2018
6 min read
[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Mastering MongoDB 3.x written by Alex Giamas. This book covers the fundamental as well as advanced tasks related to working with MongoDB.[/box] There are two ways to connect your data to MongoDB. The first is using the driver for your programming language. The second is by using an ODM (Object Document Mapper) layer to map your model objects to MongoDB in a transparent way. In this post, we will cover both methods using two of the most popular languages for web application development: Python, using the official MongoDB low level driver, PyMongo, and PHP, using the official PHP driver for MongoDB Connect using Python Installing PyMongo can be done using pip or easy_install: python -m pip install pymongo python -m easy_install pymongo Then in our class we can connect to a database: >>> from pymongo import MongoClient >>> client = MongoClient() Connecting to a replica set needs a set of seed servers for the client to find out what the primary, secondary, or arbiter nodes in the set are: client = pymongo.MongoClient('mongodb://user:passwd@node1:p1,node2:p2/?replicaSet=rs name') Using the connection string URL we can pass a username/password and replicaSet name all in a single string. Some of the most interesting options for the connection string URL are presented later. Connecting to a shard requires the server host and IP for the mongo router, which is the mongos process. PyMODM ODM Similar to Ruby's Mongoid, PyMODM is an ODM for Python that follows closely on Django's built-in ORM. Installing it can be done via pip: pip install pymodm Then we need to edit settings.py and replace the database engine with a dummy database: DATABASES = { 'default': { 'ENGINE': 'django.db.backends.dummy' } } And add our connection string anywhere in settings.py: from pymodm import connect connect("mongodb://localhost:27017/myDatabase", alias="MyApplication") Here we have to use a connection string that has the following structure: mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:port N]]][/[database][?options]] Options have to be pairs of name=value with an &amp; between each pair. Some interesting pairs are: Model classes need to inherit from MongoModel. A sample class will look like this: from pymodm import MongoModel, fields class User(MongoModel): email = fields.EmailField(primary_key=True) first_name = fields.CharField() last_name = fields.CharField() This has a User class with first_name, last_name, and email fields where email is the primary field. Inheritance with PyMODM models Handling one-one and one-many relationships in MongoDB can be done using references or embedding. This example shows both ways: references for the model user and embedding for the comment model: from pymodm import EmbeddedMongoModel, MongoModel, fields class Comment(EmbeddedMongoModel): author = fields.ReferenceField(User) content = fields.CharField() class Post(MongoModel): title = fields.CharField() author = fields.ReferenceField(User) revised_on = fields.DateTimeField() content = fields.CharField() comments = fields.EmbeddedDocumentListField(Comment) Connecting with PHP The MongoDB PHP driver was rewritten from scratch two years ago to support the PHP 5, PHP 7, and HHVM architectures. The current architecture is shown in the following diagram: Currently we have official drivers for all three architectures with full support for the underlying functionality. Installation is a two-step process. First we need to install the MongoDB extension. This extension is dependent on the version of PHP (or HHVM) that we have installed and can be done using brew in Mac. For example with PHP 7.0: brew install php70-mongodb Then, using composer (a widely used dependency manager for PHP): composer require mongodb/mongodb Connecting to the database can then be done by using the connection string URI or by passing an array of options. Using the connection string URI we have: $client = new MongoDBClient($uri = 'mongodb://127.0.0.1/', array $uriOptions = [], array $driverOptions = []) For example, to connect to a replica set using SSL authentication: $client = new MongoDBClient('mongodb://myUsername:myPassword@rs1.example.com,rs2.example.com/?ssl=true&replicaSet=myReplicaSet&authSource=admin'); Or we can use the $uriOptions parameter to pass in parameters without using the connection string URL, like this: $client = new MongoDBClient( 'mongodb://rs1.example.com,rs2.example.com/' [ 'username' => 'myUsername', 'password' => 'myPassword', 'ssl' => true, 'replicaSet' => 'myReplicaSet', 'authSource' => 'admin', ], ); The set of $uriOptions and the connection string URL options available are analogous to the ones used for Ruby and Python. Doctrine ODM Laravel is one of the most widely used MVC frameworks for PHP, similar in architecture to Django and Rails from the Python and Ruby worlds respectively. We will follow through configuring our models using a stack of Laravel, Doctrine, and MongoDB. This section assumes that Doctrine is installed and working with Laravel 5.x. Doctrine entities are POPO (Plain Old PHP Objects) that, unlike Eloquent, Laravel's default ORM doesn't need to inherit from the Model class. Doctrine uses the Data Mapper pattern, whereas Eloquent uses Active Record. Skipping the get() set() methods, a simple class would look like: use DoctrineORMMapping AS ORM; use DoctrineCommonCollectionsArrayCollection; /** * @ORMEntity * @ORMTable(name="scientist") */ class Scientist { /** * @ORMId * @ORMGeneratedValue * @ORMColumn(type="integer") */ protected $id; /** * @ORMColumn(type="string") */ protected $firstname; /** * @ORMColumn(type="string") */ protected $lastname; /** * @ORMOneToMany(targetEntity="Theory", mappedBy="scientist", cascade={"persist"}) * @var ArrayCollection|Theory[] */ protected $theories; /** * @param $firstname * @param $lastname */ public function __construct($firstname, $lastname) { $this->firstname = $firstname; $this->lastname = $lastname; $this->theories = new ArrayCollection; } … public function addTheory(Theory $theory) { if(!$this->theories->contains($theory)) { $theory->setScientist($this); $this->theories->add($theory); } } This POPO-based model used annotations to define field types that need to be persisted in MongoDB. For example, @ORMColumn(type="string") defines a field in MongoDB with the string type firstname and lastname as the attribute names, in the respective lines. There is a whole set of annotations available here http://doctrine-orm.readthedocs.io/en/latest/reference/annotations- reference.html . If we want to separate the POPO structure from annotations, we can also define them using YAML or XML instead of inlining them with annotations in our POPO model classes. Inheritance with Doctrine Modeling one-one and one-many relationships can be done via annotations, YAML, or XML. Using annotations, we can define multiple embedded subdocuments within our document: /** @Document */ class User { // … /** @EmbedMany(targetDocument="Phonenumber") */ private $phonenumbers = array(); // … } /** @EmbeddedDocument */ class Phonenumber { // … } Here a User document embeds many PhoneNumbers. @EmbedOne() will embed one subdocument to be used for modeling one-one relationships. Referencing is similar to embedding: /** @Document */ class User { // … /** * @ReferenceMany(targetDocument="Account") */ private $accounts = array(); // … } /** @Document */ class Account { // … } @ReferenceMany() and @ReferenceOne() are used to model one-many and one-one relationships via referencing into a separate collection. We saw that the process of connecting data to MongoDB using Python and PHP is quite similar. We can accordingly define relationships as being embedded or referenced, depending on our design decision. If you found this post useful, check out our book Mastering MongoDB 3.x for more tips and techniques on working with MongoDB efficiently.  
Read more
  • 0
  • 0
  • 18911

article-image-troubleshooting-in-sql-server
Sunith Shetty
15 Mar 2018
16 min read
Save for later

Troubleshooting in SQL Server

Sunith Shetty
15 Mar 2018
16 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book SQL Server 2017 Administrator's Guide written by Marek Chmel and Vladimír Mužný. This book will help you learn to implement and administer successful database solution with SQL Server 2017.[/box] Today, we will perform SQL Server analysis, and also learn ways for efficient performance monitoring and tuning. Performance monitoring and tuning Performance monitoring and tuning is a crucial part of your database administration skill set so as to keep the performance and stability of your server great, and to be able to find and fix the possible issues. The overall system performance can decrease over time; your system may work with more data or even become totally unresponsive. In such cases, you need the skills and tools to find the issue to bring the server back to normal as fast as possible. We can use several tools on the operating system layer and, then, inside the SQL Server to verify the performance and the possible root cause of the issue. The first tool that we can use is the performance monitor, which is available on your Windows Server: Performance monitor can be used to track important SQL Server counters, which can be very helpful in evaluating the SQL Server Performance. To add a counter, simply right-click on the monitoring screen in the Performance monitoring and tuning section and use the Add Counters item. If the SQL Server instance that you're monitoring is a default instance, you will find all the performance objects listed as SQL Server. If your instance is named, then the performance objects will be listed as MSSQL$InstanceName in the list of performance objects. We can split the important counters to watch between the system counters for the whole server and specific SQL Server counters. The list of system counters to watch include the following: Processor(_Total)% Processor Time: This is a counter to display the CPU load. Constant high values should be investigated to verify whether the current load does not exceed the performance limits of your HW or VM server, or if your workload is not running with proper indexes and statistics, and is generating bad query plans. MemoryAvailable MBytes: This counter displays the available memory on the operating system. There should always be enough memory for the operating system. If this counter drops below 64MB, it will send a notification of low memory and the SQL Server will reduce the memory usage. Physical Disk—Avg. Disk sec/Read: This disk counter provides the average latency information for your storage system; be careful if your storage is made of several different disks to monitor the proper storage system. Physical Disk: This indicates the average disk writes per second. Physical Disk: This indicates the average disk reads per second. Physical Disk: This indicates the number of disk writes per second. System—Processor Queue Length: This counter displays the number of threads waiting on a system CPU. If the counter is above 0, this means that there are more requests than the CPU can handle, and if the counter is constantly above 0, this may signal performance issues. Network interface: This indicates the total number of bytes per second. Once you have added all these system counters, you can see the values real time or you can configure a data collection, which will run for a specified selected time and periodically collect the information: With SQL Server-specific counters, we can dig deeper into the CPU, memory, and storage utilization to see what the SQL Server is doing and how the SQL Server is utilizing the subsystems. SQL Server memory monitoring and troubleshooting Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: SQLServer-Buffer Manager—buffer cache hit ratio: This counter displays the ratio of how often the SQL Server can find the proper data in the cache when a query returns such data. If the data is not found in the cache, it has to be read from the disk. The higher the counter, the better the overall performance, since the memory access is usually faster compared to the disk subsystem. SQLServer-Buffer Manager—page life expectancy: This counter can measure how long a page can stay in the memory in seconds. The longer a page can stay in the memory, the less likely it will be for the SQL Server to need to access the disk in order to get the data into the memory again. SQL Server-Memory Manager—total server memory (KB): This is the amount of memory the server has committed using the memory manager. SQL Server-Memory Manager—target server memory (KB): This is the ideal amount of memory the server can consume. On a stable system, the target and total should be equal unless you face a memory pressure. Once the memory is utilized after the warm-up of your server, these two counters should not drop significantly, which would be another indication of system-level memory pressure, where the SQL Server memory manager has to deallocate memory. SQL Server-Memory Manager—memory grants pending: This counter displays the total number of SQL Server processes that are waiting to be granted memory from the memory manager. To check the performance counters, you can also use a T-SQL query, where you can query the sys.dm_os_performance_counters DMV: SELECT [counter_name] as [Counter Name], [cntr_value]/1024 as [Server Memory (MB)] FROM sys.dm_os_performance_counters WHERE  [object_name] LIKE '%Memory Manager%'  AND [counter_name] IN ('Total Server Memory (KB)', 'Target Server Memory (KB)') This query will return two values—one for target memory and one for total memory. These two should be close to each other on a warmed up system. Another query you can use is to get the information from a DMV named sys.dm_0s_sys_memory: SELECT total_physical_memory_kb/1024/1024 AS [Physical Memory (GB)],    available_physical_memory_kb/1024/1024 AS [Available Memory (GB)], system_memory_state_desc AS [System Memory State] FROM sys.dm_os_sys_memory WITH (NOLOCK) OPTION (RECOMPILE) This query will display the available physical memory and the total physical memory of your server with several possible memory states: Available physical memory is high (this is a state you would like to see on your system, indicating there is no lack of memory) Physical memory usage is steady Available physical memory is getting low Available physical memory is low The memory grants can be verified with a T-SQL query: SELECT [object_name] as [Object name] , cntr_value AS [Memory Grants Pending] FROM sys.dm_os_performance_counters WITH (NOLOCK) WHERE  [object_name] LIKE N'%Memory Manager%'  AND counter_name = N'Memory Grants Pending' OPTION (RECOMPILE); If you face memory issues, there are several steps you can take for improvements: Check and configure your SQL Server max memory usage Add more RAM to your server; the limit for Standard Edition is 128 GB and there is no limit for SQL Server with Enterprise Use Lock Pages in Memory Optimize your queries SQL Server storage monitoring and troubleshooting The important counters to watch for SQL Server storage utilization would include counters from the SQL Server:Access Methods performance object: SQL Server-Access Methods—Full Scans/sec: This counter displays the number of full scans per second, which can be either table or full-index scans SQL Server-Access Methods—Index Searches/sec: This counter displays the number of searches in the index per second SQL Server-Access Methods—Forwarded Records/sec: This counter displays the number of forwarded records per second Monitoring the disk system is crucial, since your disk is used as a storage for the following: Data files Log files tempDB database Page file Backup files To verify the disk latency and IOPS metric of your drives, you can use the Performance monitor, or the T-SQL commands, which will query the sys.dm_os_volume_stats and sys.dm_io_virtual_file_stats DMF. Simple code to start with would be a T-SQL script utilizing the DMF to check the space available within the database files: SELECT f.database_id, f.file_id, volume_mount_point, total_bytes, available_bytes FROM sys.master_files AS f CROSS APPLY sys.dm_os_volume_stats(f.database_id, f.file_id); To check the I/O file stats with the second provided DMF, you can use a T-SQL code for checking the information about tempDB data files: SELECT * FROM sys.dm_io_virtual_file_stats (NULL, NULL) vfs join sys.master_files mf on mf.database_id = vfs.database_id and mf.file_id = vfs.file_id WHERE mf.database_id = 2 and mf.type = 0 To measure the disk performance, we can use a tool named DiskSpeed, which is a replacement for older SQLIO tool, which was used for a long time. DiskSpeed is an external utility, which is not available on the operating system by default. This tool can be downloaded from GitHub or the Technet Gallery at https://github.com/microsoft/diskspd. The following example runs a test for 15 seconds using a single thread to drive 100 percent random 64 KB reads at a depth of 15 overlapped (outstanding) I/Os to a regular file: DiskSpd –d300 -F1 -w0 -r –b64k -o15 d:datafile.dat Troubleshooting wait statistics We can use the whole Wait Statistics approach for a thorough understanding of the SQL Server workload and undertake performance troubleshooting based on the collected data. Wait Statistics are based on the fact that, any time a request has to wait for a resource, the SQL Server tracks this information, and we can use this information for further analysis. When we consider any user process, it can include several threads. A thread is a single unit of execution on SQL Server, where SQLOS controls the thread scheduling instead of relying on the operating system layer. Each processor core has it's own scheduler component responsible for executing such threads. To see the available schedulers in your SQL Server, you can use the following query: SELECT * FROM sys.dm_os_schedulers Such code will return all the schedulers in your SQL Server; some will be displayed as visible online and some as hidden online. The hidden ones are for internal system tasks while the visible ones are used by user tasks running on the SQL Server. There is one more scheduler, which is displayed as Visible Online (DAC). This one is used for dedicated administration connection, which comes in handy when the SQL Server stops responding. To use a dedicated admin connection, you can modify your SSMS connection to use the DAC, or you can use a switch with the sqlcmd.exe utility, to connect to the DAC. To connect to the default instance with DAC on your server, you can use the following command: sqlcmd.exe -E -A Each thread can be in three possible states: running: This indicates that the thread is running on the processor suspended: This indicates that the thread is waiting for a resource on a waiter list runnable:  This indicates that the thread is waiting for execution on a runnable queue Each running thread runs until it has to wait for a resource to become available or until it has exhausted the CPU time for a running thread, which is set to 4 ms. This 4 ms time can be visible in the output of the previous query to sys.dm_os_schedulers and is called a quantum. When a thread requires any resource, it is moved away from the processor to a waiter list, where the thread waits for the resource to become available. Once the resource is available, the thread is notified about the resource availability and moves to the bottom of the runnable queue. Any waiting thread can be found via the following code, which will display the waiting threads and the resource they are waiting for: SELECT * FROM sys.dm_os_waiting_tasks The threads then transition between the execution at the CPU, waiter list, and runnable queue. There is a special case when a thread does not need to wait for any resource and has already run for 4 ms on the CPU, then the thread will be moved directly to the runnable queue instead of the waiter list. In the following image, we can see the thread states and the objects where the thread resides: When the thread is waiting on the waiter list, we can talk about a resource wait time. When the thread is waiting on the runnable queue to get on the CPU for execution, we can talk about the signal time. The total wait time is, then, the sum of the signal and resource wait times. You can find the ratio of the signal to resource wait times with the following script: Select signalWaitTimeMs=sum(signal_wait_time_ms)  ,'%signal waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))  ,resourceWaitTimeMs=sum(wait_time_ms - signal_wait_time_ms)  ,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) from sys.dm_os_wait_stats When the ratio goes over 30 percent for the signal waits, then there will be a serious CPU pressure and your processor(s) will have a hard time handling all the incoming requests from the threads. The following query can then grab the wait statistics and display the most frequent wait types, which were recorded through the thread executions, or actually during the time the threads were waiting on the waiter list for any particular resource: WITH [Waits] AS  (SELECT    [wait_type],   [wait_time_ms] / 1000.0 AS [WaitS], ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS], [signal_wait_time_ms] / 1000.0 AS [SignalS], [waiting_tasks_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_wait_stats WHERE [wait_type] NOT IN ( N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE', N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE', N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX', N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE', N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP', N'LOGMGR_QUEUE', N'MEMORY_ALLOCATION_EXT', N'ONDEMAND_TASK_QUEUE', N'PREEMPTIVE_XE_GETTARGETSTATE', N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'PWAIT_DIRECTLOGCONSUMER_GETNEXT', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_ASYNC_QUEUE', N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'QDS_SHUTDOWN_QUEUE', N'REDO_THREAD_PENDING_WORK', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE', N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY', N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK', N'SLEEP_TASK', N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT', N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_RECOVERY', N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN', N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT' ) AND [waiting_tasks_count] > 0 ) SELECT MAX ([W1].[wait_type]) AS [WaitType], CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S], CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S], CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S], MAX ([W1].[WaitCount]) AS [WaitCount], CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage], CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S], CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S], CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S] FROM [Waits] AS [W1] INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum] GROUP BY [W1].[RowNum] HAVING SUM ([W2].[Percentage]) - MAX( [W1].[Percentage] ) < 95 GO This code is available on the whitepaper, published by SQLSkills, named SQL Server Performance Tuning Using Wait Statistics by Erin Stellato and Jonathan Kehayias, which then refers the URL on SQL Skills and uses the full query by Paul Randal available at https://www.sqlskills.com/ blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/. Some of the typical wait stats you can see are: PAGEIOLATCH The PAGEIOLATCH wait type is used when the thread is waiting for a page to be read into the buffer pool from the disk. This wait type comes with two main forms: PAGEIOLATCH_SH: This page will be read from the disk PAGEIOLATCH_EX: This page will be modified You may quickly assume that the storage has to be the problem, but that may not be the case. Like any other wait, they need to be considered in correlation with other wait types and other counters available to correctly find the root cause of the slow SQL Server operations. The page may be read into the buffer pool, because it was previously removed due to memory pressure and is needed again. So, you may also investigate the following: Buffer Manager: Page life expectancy Buffer Manager: Buffer cache hit ratio Also, you need to consider the following as a possible factor to the PAGEIOLATCH wait types: Large scans versus seeks on the indexes Implicit conversions Inundated statistics Missing indexes PAGELATCH This wait type is quite frequently misplaced with PAGEIOLATCH but PAGELATCH is used for pages already present in the memory. The thread waits for the access to such a page again with possible PAGELATCH_SH and PAGELATCH_EX wait types. A pretty common situation with this wait type is a tempDB contention, where you need to analyze what page is being waited for and what type of query is actually waiting for such a resource. As a solution to the tempDB, contention you can do the following: Add more tempDB data files Use traceflags 1118 and 1117 for tempDB on systems older than SQL Server 2016 CXPACKET This wait type is encountered when any thread is running in parallel. The CXPACKET wait type itself does not mean that there is really any problem on the SQL Server. But if such a wait type is accumulated very quickly, it may be a signal of skewed statistics, which require an update or a parallel scan on the table where proper indexes are missing. The option for parallelism is controlled via MAX DOP setting, which can be configured on the following: The server level The database level A query level with a hint We learned about SQL Server analysis with the Wait Statistics troubleshooting methodology and possible DMVs to get more insight on the problems occurring in the SQL Server. To know more about how to successfully create, design, and deploy databases using SQL Server 2017, do checkout the book SQL Server 2017 Administrator's Guide.
Read more
  • 0
  • 0
  • 25776

article-image-stephen-hawking-artificial-intelligence-quotes
Richard Gall
15 Mar 2018
3 min read
Save for later

5 polarizing Quotes from Professor Stephen Hawking on artificial intelligence

Richard Gall
15 Mar 2018
3 min read
Professor Stephen Hawking died today (March 14, 2018) aged 76 at his home in Cambridge, UK. Best known for his theory of cosmology that unified quantum mechanics with Einstein’s General Theory of Relativity, and for his book a Brief History of Time that brought his concepts to a wider general audience, Professor Hawking is quite possibly one of the most important and well-known voices in the scientific world. Among many things, Professor Hawking had a lot to say about artificial intelligence - its dangers, its opportunities and what we should be thinking about, not just as scientists and technologists, but as humans. Over the years, Hawking has remained cautious and consistent in his views on the topic constantly urging AI researchers and machine learning developers to consider the wider implications of their work on society and the human race itself.  The machine learning community is quite divided on all the issues Hawking has raised and will probably continue to be so as the field grows faster than it can be fathomed. Here are 5 widely debated things Stephen Hawking said about AI arranged in chronological order - and if you’re going to listen to anyone, you’ve surely got to listen to him?   On artificial intelligence ending the human race The development of full artificial intelligence could spell the end of the human race….It would take off on its own, and re-design itself at an ever-increasing rate. Humans, who are limited by slow biological evolution, couldn't compete and would be superseded. From an interview with the BBC, December 2014 On the future of AI research The establishment of shared theoretical frameworks, combined with the availability of data and processing power, has yielded remarkable successes in various component tasks such as speech recognition, image classification, autonomous vehicles, machine translation, legged locomotion, and question-answering systems. As capabilities in these areas and others cross the threshold from laboratory research to economically valuable technologies, a virtuous cycle takes hold whereby even small improvements in performance are worth large sums of money, prompting greater investments in research. There is now a broad consensus that AI research is progressing steadily, and that its impact on society is likely to increase.... Because of the great potential of AI, it is important to research how to reap its benefits while avoiding potential pitfalls. From Research Priorities for Robust and Beneficial Artificial Intelligence, an open letter co-signed by Hawking, January 2015 On AI emulating human intelligence I believe there is no deep difference between what can be achieved by a biological brain and what can be achieved by a computer. It, therefore, follows that computers can, in theory, emulate human intelligence — and exceed it From a speech given by Hawking at the opening of the Leverhulme Centre of the Future of Intelligence, Cambridge, U.K., October 2016 On making artificial intelligence benefit humanity Perhaps we should all stop for a moment and focus not only on making our AI better and more successful but also on the benefit of humanity. Taken from a speech given by Hawking at Web Summit in Lisbon, November 2017 On AI replacing humans The genie is out of the bottle. We need to move forward on artificial intelligence development but we also need to be mindful of its very real dangers. I fear that AI may replace humans altogether. If people design computer viruses, someone will design AI that replicates itself. This will be a new form of life that will outperform humans. From an interview with Wired, November 2017
Read more
  • 0
  • 2
  • 72294
article-image-stack-overflow-developer-survey-2018-quick-overview
Amey Varangaonkar
14 Mar 2018
4 min read
Save for later

Stack Overflow Developer Survey 2018: A Quick Overview

Amey Varangaonkar
14 Mar 2018
4 min read
Stack Overflow recently published their annual developer survey in which over 100,000 developers and professionals participated. The survey shed light on some very interesting insights - from the developers’ preferred language for programming, to the development platform they hate the most. As the survey is quite detailed and comprehensive, we thought why not present the most important takeaways and findings for you to go through very quickly? If you are short of time and want to scan through the results of the survey quickly, read on.. Developer Profile Young developers form the majority: Half the developer population falls in the age group of 25-34 years while almost all respondents (90%) fall within the 18 - 44 year age group. Limited professional coding experience: Majority of the developers have been coding from the last 2 to 10 years. That said, almost half of the respondents have a professional coding experience of less than 5 years. Continuous learning is key to surviving as a developer: Almost 75% of the developers have a bachelor’s degree, or higher. In addition, almost 90% of the respondents say they have learnt a new language, framework or a tool without taking any formal course, but with the help of the official documentation and/or Stack Overflow Back-end developers form the majority: Among the top developer roles, more than half the developers identify themselves as back-end developers, while the percentage of data scientists and analysts is quite low. About 20% of the respondents identify themselves as mobile developers Working full-time: More than 75% of the developers responded that they work a full-time job. Close to 10% are freelancers, or self-employed. Popularly used languages and frameworks The Javascript family continue their reign: For the sixth year running, JavaScript has continued to be the most popular programming language, and is the choice of language for more than 70% of the respondents. In terms of frameworks, Node.js and Angular continue to be the most popular choice of the developers. Desktop development ain’t dead yet: When it comes to the platforms, developers prefer Linux and Windows Desktop or Server for their development work. Cloud platforms have not gained that much adoption, as yet, but there is a slow but steady rise. What about Data Science? Machine Learning and DevOps rule the roost: Machine Learning and DevOps are two trends which are trending highly due to the vast applications and research that is being done on these fronts. Tensorflow rises, Hadoop falls: About 75% of the respondents love the Tensorflow framework, and say they would love to continue using it for their machine learning/deep learning tasks. Hadoop’s popularity seems to be going down, on the other hand, as other Big Data frameworks like Apache Spark gain more traction and popularity. Python - the next big programming language: Popular data science languages like R and Python are on the rise in terms of popularity. Python, which surpassed PHP last year, has surpassed C# this year, indicating its continuing rise in popularity. Python based Frameworks like Tensorflow and pyTorch are gaining a lot of adoption. Learn F# for more moolah: Languages like F#, Clojure and Rust are associated with high global salaries, with median salaries above $70,000. The likes of R and Python are associated with median salaries of up to $57,000. PostgreSQL growing rapidly, Redis most loved database: MySQL and SQL Server are the two most widely used databases as per the survey, while the usage of PostgreSQL has surpassed that of the traditionally popular databases like MongoDB and Redis. In terms of popularity, Redis is the most loved database while the developers dread (read looking to switch from) databases like IBM DB2 and Oracle. Job-hunt for data scientists: Approximately 18% of the 76,000+ respondents who are actively looking for jobs are data scientists or work as academicians and researchers. AI more exciting than worrying: Close to 75% of the 69,000+ respondents are excited about the future possibilities with AI than worried about the dangers posed by AI. Some of the major concerns include AI making important business decisions. The big surprise was that most developers find automation of jobs as the most exciting part of a future enabled by AI. So that’s it then! What do you think about the Stack Overflow Developer survey results? Do you agree with the developers’ responses? We would love to know your thoughts. In the coming days, watch out for more fine grained analysis of the Stack Overflow survey data.
Read more
  • 0
  • 0
  • 20439

article-image-selecting-statistical-based-features-in-machine-learning-application
Pravin Dhandre
14 Mar 2018
16 min read
Save for later

Selecting Statistical-based Features in Machine Learning application

Pravin Dhandre
14 Mar 2018
16 min read
In today’s tutorial, we will work on one of the methods of executing feature selection, the statistical-based method for interpreting both quantitative and qualitative datasets. Feature selection attempts to reduce the size of the original dataset by subsetting the original features and shortlisting the best ones with the highest predictive power. We may intelligently choose which feature selection method might work best for us, but in reality, a very valid way of working in this domain is to work through examples of each method and measure the performance of the resulting pipeline. To begin, let's take a look at the subclass of feature selection modules that are reliant on statistical tests to select viable features from a dataset. Statistical-based feature selections Statistics provides us with relatively quick and easy methods of interpreting both quantitative and qualitative data. We have used some statistical measures in previous chapters to obtain new knowledge and perspective around our data, specifically in that we recognized mean and standard deviation as metrics that enabled us to calculate z-scores and scale our data. In this tutorial, we will rely on two new concepts to help us with our feature selection: Pearson correlations hypothesis testing Both of these methods are known as univariate methods of feature selection, meaning that they are quick and handy when the problem is to select out single features at a time in order to create a better dataset for our machine learning pipeline. Using Pearson correlation to select features We have actually looked at correlations in this book already, but not in the context of feature selection. We already know that we can invoke a correlation calculation in pandas by calling the following method: credit_card_default.corr() The output of the preceding code produces is the following: As a continuation of the preceding table we have: The Pearson correlation coefficient (which is the default for pandas) measures the linear relationship between columns. The value of the coefficient varies between -1 and +1, where 0 implies no correlation between them. Correlations closer to -1 or +1 imply an extremely strong linear relationship. It is worth noting that Pearson’s correlation generally requires that each column be normally distributed (which we are not assuming). We can also largely ignore this requirement because our dataset is large (over 500 is the threshold). The pandas .corr() method calculates a Pearson correlation coefficient for every column versus every other column. This 24 column by 24 row matrix is very unruly, and in the past, we used heatmaps to try and make the information more digestible: # using seaborn to generate heatmaps import seaborn as sns import matplotlib.style as style # Use a clean stylizatino for our charts and graphs style.use('fivethirtyeight') sns.heatmap(credit_card_default.corr()) The heatmap generated will be as follows: Note that the heatmap function automatically chose the most correlated features to show us. That being said, we are, for the moment, concerned with the features correlations to the response variable. We will assume that the more correlated a feature is to the response, the more useful it will be. Any feature that is not as strongly correlated will not be as useful to us. Correlation coefficients are also used to determine feature interactions and redundancies. A key method of reducing overfitting in machine learning is spotting and removing these redundancies. We will be tackling this problem in our model-based selection methods. Let's isolate the correlations between the features and the response variable, using the following code: # just correlations between every feature and the response credit_card_default.corr()['default payment next month'] LIMIT_BAL -0.153520 SEX -0.039961 EDUCATION 0.028006 MARRIAGE -0.024339 AGE 0.013890 PAY_0 0.324794 PAY_2 0.263551 PAY_3 0.235253 PAY_4 0.216614 PAY_5 0.204149 PAY_6 0.186866 BILL_AMT1 -0.019644 BILL_AMT2 -0.014193 BILL_AMT3 -0.014076 BILL_AMT4 -0.010156 BILL_AMT5 -0.006760 BILL_AMT6 -0.005372 PAY_AMT1 -0.072929 PAY_AMT2 -0.058579 PAY_AMT3 -0.056250 PAY_AMT4 -0.056827 PAY_AMT5 -0.055124 PAY_AMT6 -0.053183 default payment next month 1.000000 We can ignore the final row, as is it is the response variable correlated perfectly to itself. We are looking for features that have correlation coefficient values close to -1 or +1. These are the features that we might assume are going to be useful. Let's use pandas filtering to isolate features that have at least .2 correlation (positive or negative). Let's do this by first defining a pandas mask, which will act as our filter, using the following code: # filter only correlations stronger than .2 in either direction (positive or negative) credit_card_default.corr()['default payment next month'].abs() > .2 LIMIT_BAL False SEX False EDUCATION False MARRIAGE False AGE False PAY_0 True PAY_2 True PAY_3 True PAY_4 True PAY_5 True PAY_6 False BILL_AMT1 False BILL_AMT2 False BILL_AMT3 False BILL_AMT4 False BILL_AMT5 False BILL_AMT6 False PAY_AMT1 False PAY_AMT2 False PAY_AMT3 False PAY_AMT4 False PAY_AMT5 False PAY_AMT6 False default payment next month True Every False in the preceding pandas Series represents a feature that has a correlation value between -.2 and .2 inclusive, while True values correspond to features with preceding correlation values .2 or less than -0.2. Let's plug this mask into our pandas filtering, using the following code: # store the features highly_correlated_features = credit_card_default.columns[credit_card_default.corr()['default payment next month'].abs() > .2] highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5', u'default payment next month'], dtype='object') The variable highly_correlated_features is supposed to hold the features of the dataframe that are highly correlated to the response; however, we do have to get rid of the name of the response column, as including that in our machine learning pipeline would be cheating: # drop the response variable highly_correlated_features = highly_correlated_features.drop('default payment next month') highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5'], dtype='object') So, now we have five features from our original dataset that are meant to be predictive of the response variable, so let's try it out with the help of the following code: # only include the five highly correlated features X_subsetted = X[highly_correlated_features] get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y) # barely worse, but about 20x faster to fit the model Best Accuracy: 0.819666666667 Best Parameters: {'max_depth': 3} Average Time to Fit (s): 0.01 Average Time to Score (s): 0.002 Our accuracy is definitely worse than the accuracy to beat, .8203, but also note that the fitting time saw about a 20-fold increase. Our model is able to learn almost as well as with the entire dataset with only five features. Moreover, it is able to learn as much in a much shorter timeframe. Let's bring back our scikit-learn pipelines and include our correlation choosing methodology as a part of our preprocessing phase. To do this, we will have to create a custom transformer that invokes the logic we just went through, as a pipeline-ready class. We will call our class the CustomCorrelationChooser and it will have to implement both a fit and a transform logic, which are: The fit logic will select columns from the features matrix that are higher than a specified threshold The transform logic will subset any future datasets to only include those columns that were deemed important from sklearn.base import TransformerMixin, BaseEstimator class CustomCorrelationChooser(TransformerMixin, BaseEstimator): def __init__(self, response, cols_to_keep=[], threshold=None): # store the response series self.response = response # store the threshold that we wish to keep self.threshold = threshold # initialize a variable that will eventually # hold the names of the features that we wish to keep self.cols_to_keep = cols_to_keep def transform(self, X): # the transform method simply selects the appropiate # columns from the original dataset return X[self.cols_to_keep] def fit(self, X, *_): # create a new dataframe that holds both features and response df = pd.concat([X, self.response], axis=1) # store names of columns that meet correlation threshold self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() > Self.threshold] # only keep columns in X, for example, will remove response Variable self.cols_to_keep = [c for c in self.cols_to_keep if c in X.columns] return self Let's take our new correlation feature selector for a spin, with the help of the following code: # instantiate our new feature selector ccc = CustomCorrelationChooser(threshold=.2, response=y) ccc.fit(X) ccc.cols_to_keep ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5'] Our class has selected the same five columns as we found earlier. Let's test out the transform functionality by calling it on our X matrix, using the following code: ccc.transform(X).head() The preceding code produces the following table as the output: We see that the transform method has eliminated the other columns and kept only the features that met our .2 correlation threshold. Now, let's put it all together in our pipeline, with the help of the following code: # instantiate our feature selector with the response variable set ccc = CustomCorrelationChooser(response=y) # make our new pipeline, including the selector ccc_pipe = Pipeline([('correlation_select', ccc), ('classifier', d_tree)]) # make a copy of the decisino tree pipeline parameters ccc_pipe_params = deepcopy(tree_pipe_params) # update that dictionary with feature selector specific parameters ccc_pipe_params.update({ 'correlation_select__threshold':[0, .1, .2, .3]}) print ccc_pipe_params #{'correlation_select__threshold': [0, 0.1, 0.2, 0.3], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # better than original (by a little, and a bit faster on # average overall get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'correlation_select__threshold': 0.1, 'classifier__max_depth': 5} Average Time to Fit (s): 0.105 Average Time to Score (s): 0.003 Wow! Our first attempt at feature selection and we have already beaten our goal (albeit by a little bit). Our pipeline is showing us that if we threshold at 0.1, we have eliminated noise enough to improve accuracy and also cut down on the fitting time (from .158 seconds without the selector). Let's take a look at which columns our selector decided to keep: # check the threshold of .1 ccc = CustomCorrelationChooser(threshold=0.1, response=y) ccc.fit(X) # check which columns were kept ccc.cols_to_keep ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] It appears that our selector has decided to keep the five columns that we found, as well as two more, the LIMIT_BAL and the PAY_6 columns. Great! This is the beauty of automated pipeline gridsearching in scikit-learn. It allows our models to do what they do best and intuit things that we could not have on our own. Feature selection using hypothesis testing Hypothesis testing is a methodology in statistics that allows for a bit more complex statistical testing for individual features. Feature selection via hypothesis testing will attempt to select only the best features from a dataset, just as we were doing with our custom correlation chooser, but these tests rely more on formalized statistical methods and are interpreted through what are known as p-values. A hypothesis test is a statistical test that is used to figure out whether we can apply a certain condition for an entire population, given a data sample. The result of a hypothesis test tells us whether we should believe the hypothesis or reject it for an alternative one. Based on sample data from a population, a hypothesis test determines whether or not to reject the null hypothesis. We usually use a p-value (a non-negative decimal with an upper bound of 1, which is based on our significance level) to make this conclusion. In the case of feature selection, the hypothesis we wish to test is along the lines of: True or False: This feature has no relevance to the response variable. We want to test this hypothesis for every feature and decide whether the features hold some significance in the prediction of the response. In a way, this is how we dealt with the correlation logic. We basically said that, if a column's correlation with the response is too weak, then we say that the hypothesis that the feature has no relevance is true. If the correlation coefficient was strong enough, then we can reject the hypothesis that the feature has no relevance in favor of an alternative hypothesis, that the feature does have some relevance. To begin to use this for our data, we will have to bring in two new modules: SelectKBest and f_classif, using the following code: # SelectKBest selects features according to the k highest scores of a given scoring function from sklearn.feature_selection import SelectKBest # This models a statistical test known as ANOVA from sklearn.feature_selection import f_classif # f_classif allows for negative values, not all do # chi2 is a very common classification criteria but only allows for positive values # regression has its own statistical tests SelectKBest is basically just a wrapper that keeps a set amount of features that are the highest ranked according to some criterion. In this case, we will use the p-values of completed hypothesis testings as a ranking. Interpreting the p-value The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test. Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it. The big take away from this is that the f_classif function will perform an ANOVA test (a type of hypothesis test) on each feature on its own (hence the name univariate testing) and assign that feature a p-value. The SelectKBest will rank the features by that p-value (the lower the better) and keep only the best k (a human input) features. Let's try this out in Python. Ranking the p-value Let's begin by instantiating a SelectKBest module. We will manually enter a k value, 5, meaning we wish to keep only the five best features according to the resulting p-values: # keep only the best five features according to p-values of ANOVA test k_best = SelectKBest(f_classif, k=5) We can then fit and transform our X matrix to select the features we want, as we did before with our custom selector: # matrix after selecting the top 5 features k_best.fit_transform(X, y) # 30,000 rows x 5 columns array([[ 2, 2, -1, -1, -2], [-1, 2, 0, 0, 0], [ 0, 0, 0, 0, 0], ..., [ 4, 3, 2, -1, 0], [ 1, -1, 0, 0, 0], [ 0, 0, 0, 0, 0]]) If we want to inspect the p-values directly and see which columns were chosen, we can dive deeper into the select k_best variables: # get the p values of columns k_best.pvalues_ # make a dataframe of features and p-values # sort that dataframe by p-value p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value') # show the top 5 features p_values.head() The preceding code produces the following table as the output: We can see that, once again, our selector is choosing the PAY_X columns as the most important. If we take a look at our p-value column, we will notice that our values are extremely small and close to zero. A common threshold for p-values is 0.05, meaning that anything less than 0.05 may be considered significant, and these columns are extremely significant according to our tests. We can also directly see which columns meet a threshold of 0.05 using the pandas filtering methodology: # features with a low p value p_values[p_values['p_value'] < .05] The preceding code produces the following table as the output: The majority of the columns have a low p-value, but not all. Let's see the columns with a higher p_value, using the following code: # features with a high p value p_values[p_values['p_value'] >= .05] The preceding code produces the following table as the output: These three columns have quite a high p-value. Let's use our SelectKBest in a pipeline to see if we can grid search our way into a better machine learning pipeline, using the following code: k_best = SelectKBest(f_classif) # Make a new pipeline with SelectKBest select_k_pipe = Pipeline([('k_best', k_best), ('classifier', d_tree)]) select_k_best_pipe_params = deepcopy(tree_pipe_params) # the 'all' literally does nothing to subset select_k_best_pipe_params.update({'k_best__k':range(1,23) + ['all']}) print select_k_best_pipe_params # {'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all'], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # comparable to our results with correlationchooser get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'k_best__k': 7, 'classifier__max_depth': 5} Average Time to Fit (s): 0.102 Average Time to Score (s): 0.002 It seems that our SelectKBest module is getting about the same accuracy as our custom transformer, but it's getting there a bit quicker! Let's see which columns our tests are selecting for us, with the help of the following code: k_best = SelectKBest(f_classif, k=7) # lowest 7 p values match what our custom correlationchooser chose before # ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] p_values.head(7) The preceding code produces the following table as the output: They appear to be the same columns that were chosen by our other statistical method. It's possible that our statistical method is limited to continually picking these seven columns for us. There are other tests available besides ANOVA, such as Chi2 and others, for regression tasks. They are all included in scikit-learn's documentation. For more info on feature selection through univariate testing, check out the scikit-learn documentation here: http:/​/​scikit-​learn.​org/​stable/​modules/​feature_​selection. html#univariate-​feature-​selection Before we move on to model-based feature selection, it's helpful to do a quick sanity check to ensure that we are on the right track. So far, we have seen two statistical methods for feature selection that gave us the same seven columns for optimal accuracy. But what if we were to take every column except those seven? We should expect a much lower accuracy and worse pipeline overall, right? Let's make sure. The following code helps us to implement sanity checks: # sanity check # If we only the worst columns the_worst_of_X = X[X.columns.drop(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])] # goes to show, that selecting the wrong features will # hurt us in predictive performance get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y) Best Accuracy: 0.783966666667 Best Parameters: {'max_depth': 5} Average Time to Fit (s): 0.21 Average Time to Score (s): 0.002 Hence, by selecting the columns except those seven, we see not only worse accuracy (almost as bad as the null accuracy), but also slower fitting times on average. We statistically selected features from the dataset for our machine learning pipeline. [box type="note" align="" class="" width=""]This article is an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla.  Do check out the book to get access to alternative techniques such as the model-based method to achieve optimum results from the machine learning application.[/box]      
Read more
  • 0
  • 0
  • 34086

article-image-getting-started-with-q-learning-using-tensorflow
Savia Lobo
14 Mar 2018
9 min read
Save for later

Getting started with Q-learning using TensorFlow

Savia Lobo
14 Mar 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. This book will help you master advanced concepts of deep learning such as transfer learning, reinforcement learning, generative models and more, using TensorFlow and Keras.[/box] In this tutorial, we will learn about Q-learning and how to implement it using deep reinforcement learning. Q-Learning is a model-free method of finding the optimal policy that can maximize the reward of an agent. During initial gameplay, the agent learns a Q value for each pair of (state, action), also known as the exploration strategy. Once the Q values are learned, then the optimal policy will be to select an action with the largest Q-value in every state, also known as the exploitation strategy. The learning algorithm may end in locally optimal solutions, hence we keep using the exploration policy by setting an exploration_rate parameter. The Q-Learning algorithm is as follows: initialize  Q(shape=[#s,#a])  to  random  values  or  zeroes Repeat  (for  each  episode) observe  current  state  s Repeat select  an  action  a  (apply  explore  or  exploit  strategy) observe  state  s_next  as  a  result  of  action  a update  the  Q-Table  using  bellman's  equation set  current  state  s  =  s_next until  the  episode  ends  or  a  max  reward  /  max  steps  condition  is  reached Until  a  number  of  episodes  or  a  condition  is  reached (such  as  max  consecutive  wins) Q(s, a) in the preceding algorithm represents the Q function. The values of this function are used for selecting the action instead of the rewards, thus this function represents the reward or discounted rewards. The values for the Q-function are updated using the values of the Q function in the future state. The well- known bellman equation captures this update: This basically means that at time step t, in state s, for action a, the maximum future reward (Q) is equal to the reward from the current state plus the max future reward from the next state. Q(s,a) can be implemented as a Q-Table or as a neural network known as a Q-Network. In both cases, the task of the Q-Table or the Q-Network is to provide the best possible action based on the Q value of the given input. The Q-Table-based approach generally becomes intractable as the Q-Table becomes large, thus making neural networks the best candidate for approximating the Q-function through Q-Network. Let us look at both of these approaches in action. Initializing and discretizing for Q-Learning The observations returned by the pole-cart environment involves the state of the environment. The state of pole-cart is represented by continuous values that we need to discretize. If we discretize these values into small state-space, then the agent gets trained faster, but with the caveat of risking the convergence to the optimal policy. We use the following helper function to discretize the state-space of the pole-cart environment: #  discretize  the  value  to  a  state  space def  discretize(val,bounds,n_states): discrete_val  =  0 if  val  <=  bounds[0]: discrete_val  =  0 elif  val  >=  bounds[1]: discrete_val  =  n_states-1 else:                       discrete_val  =  int(round(  (n_states-1)  * ((val-bounds[0])/                                                                                         (bounds[1]-bounds[0])) return  discrete_val def  discretize_state(vals,s_bounds,n_s): discrete_vals  =  [] for  i  in  range(len(n_s)): discrete_vals.append(discretize(vals[i],s_bounds[i],n_s[i])) return  np.array(discrete_vals,dtype=np.int) We discretize the space into 10 units for each of the observation dimensions. You may want to try out different discretization spaces. After the discretization, we find the upper and lower bounds of the observations, and change the bounds of velocity and angular velocity to be between -1 and +1, instead of -Inf and +Inf. The code is as follows: env  =  gym.make('CartPole-v0') n_a  =  env.action_space.n #  number  of  discrete  states  for  each  observation  dimension n_s  =  np.array([10,10,10,10])                  #  position,  velocity,  angle,  angular velocity s_bounds  =  np.array(list(zip(env.observation_space.low, env.observation_space.high))) #  the  velocity  and  angular  velocity  bounds  are #  too  high  so  we  bound  between  -1,  +1 s_bounds[1]  =  (-1.0,1.0) s_bounds[3]  =  (-1.0,1.0) Q-Learning with Q-Table Since our discretised space is of the dimensions [10,10,10,10], our Q-Table is of [10,10,10,10,2] dimensions: #  create  a  Q-Table  of  shape  (10,10,10,10,  2)  representing  S  X  A  ->  R q_table  =  np.zeros(shape  =  np.append(n_s,n_a)) We define a Q-Table policy that exploits or explores based on the exploration_rate: def  policy_q_table(state,  env): #  Exploration  strategy  -  Select  a  random  action if  np.random.random()  <  explore_rate: action  =  env.action_space.sample() #  Exploitation  strategy  -  Select  the  action  with  the  highest  q else: action  =  np.argmax(q_table[tuple(state)]) return  action Define the episode() function that runs a single episode as follows:  Start with initializing the variables and the first state: obs  =  env.reset() state_prev  =  discretize_state(obs,s_bounds,n_s) episode_reward  =  0 done  =  False t  =  0  Select the action and observe the next state: action  =  policy(state_prev,  env) obs,  reward,  done,  info  =  env.step(action) state_new  =  discretize_state(obs,s_bounds,n_s)  Update the Q-Table: best_q  =  np.amax(q_table[tuple(state_new)]) bellman_q  =  reward  +  discount_rate  *  best_q indices  =  tuple(np.append(state_prev,action)) q_table[indices]  +=  learning_rate*(  bellman_q  -  q_table[indices])  Set the next state as the previous state and add the rewards to the episode's rewards: state_prev  =  state_new episode_reward  +=  reward The experiment() function calls the episode function and accumulates the rewards for reporting. You may want to modify the function to check for consecutive wins and other logic specific to your play or games: #  collect  observations  and  rewards  for  each  episode def  experiment(env,  policy,  n_episodes,r_max=0,  t_max=0): rewards=np.empty(shape=[n_episodes]) for  i  in  range(n_episodes): val  =  episode(env,  policy,  r_max,  t_max) rewards[i]=val print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) Now, all we have to do is define the parameters, such as learning_rate, discount_rate, and explore_rate, and run the experiment() function as follows: learning_rate  =  0.8 discount_rate  =  0.9 explore_rate  =  0.2 n_episodes  =  1000 experiment(env,  policy_q_table,  n_episodes) For 1000 episodes, the Q-Table-based policy's maximum reward is 180 based on our simple implementation: Policy:policy_q_table,  Min  reward:8.0,  Max  reward:180.0,  Average reward:17.592 Our implementation of the algorithm is very simple to explain. However, you can modify the code to set the explore rate high initially and then decay as the time-steps pass. Similarly, you can also implement the decay logic for the learning and discount rates. Let us see if we can get a higher reward with fewer episodes as our Q function learns faster. Q-Learning with Q-Network  or Deep Q Network (DQN)  In the DQN, we replace the Q-Table with a neural network (Q-Network) that will learn to respond with the optimal action as we train it continuously with the explored states and their Q-Values. Thus, for training the network we need a place to store the game memory:  Implement the game memory using a deque of size 1000: memory  =  deque(maxlen=1000)  Next, build a simple hidden layer neural network model, q_nn: from  keras.models  import  Sequential from  keras.layers  import  Dense model  =  Sequential() model.add(Dense(8,input_dim=4,  activation='relu')) model.add(Dense(2,  activation='linear')) model.compile(loss='mse',optimizer='adam') model.summary() q_nn  =  model The Q-Network looks like this: Layer  (type)                                           Output  Shape                                 Param  # ================================================================= dense_1  (Dense)                                 (None,  8)                                         40 dense_2  (Dense)                                 (None,  2)                                         18 ================================================================= Total  params:  58 Trainable  params:  58 Non-trainable  params:  0 The episode() function that executes one episode of the game, incorporates the following changes for the Q-Network-based algorithm:  After generating the next state, add the states, action, and rewards to the game memory: action  =  policy(state_prev,  env) obs,  reward,  done,  info  =  env.step(action) state_next  =  discretize_state(obs,s_bounds,n_s) #  add  the  state_prev,  action,  reward,  state_new,  done  to  memory memory.append([state_prev,action,reward,state_next,done])   Generate and update the q_values with the maximum future rewards using the bellman function: states  =  np.array([x[0]  for  x  in  memory]) states_next  =  np.array([np.zeros(4)  if  x[4]  else  x[3]  for  x  in memory]) q_values  =  q_nn.predict(states) q_values_next  =  q_nn.predict(states_next) for  i  in  range(len(memory)): state_prev,action,reward,state_next,done  =  memory[i] if  done: q_values[i,action]  =  reward else: best_q  =  np.amax(q_values_next[i]) bellman_q  =  reward  +  discount_rate  *  best_q q_values[i,action]  =  bellman_q  Train the q_nn with the states and the q_values we received from memory: q_nn.fit(states,q_values,epochs=1,batch_size=50,verbose=0) The process of saving gameplay in memory and using it to train the model is also known as memory replay in deep reinforcement learning literature. Let us run our DQN-based gameplay as follows: learning_rate  =  0.8 discount_rate  =  0.9 explore_rate  =  0.2 n_episodes  =  100 experiment(env,  policy_q_nn,  n_episodes) We get a max reward of 150 that you can improve upon with hyper-parameter tuning, network tuning, and by using rate decay for the discount rate and explore rate: Policy:policy_q_nn,  Min  reward:8.0,  Max  reward:150.0,  Average  reward:41.27 To summarize, we calculated and trained the model at every step. One can change the code to discard the memory replay and retrain the model for the episodes that return smaller rewards. However, implement this option with caution as it may slow down your learning as initial gameplay would generate smaller rewards more often. Do check out the book Mastering TensorFlow 1.x  to explore advanced features of TensorFlow 1.x and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet.
Read more
  • 0
  • 0
  • 38886
article-image-feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique
Pravin Dhandre
13 Mar 2018
9 min read
Save for later

Feature Improvement: Identifying missing values using EDA (Exploratory Data Analysis) technique

Pravin Dhandre
13 Mar 2018
9 min read
Today, we will work towards developing a better sense of data through identifying missing values in a dataset using Exploratory Data Analysis (EDA) technique and python packages. Identifying missing values in data Our first method of identifying missing values is to give us a better understanding of how to work with real-world data. Often, data can have missing values due to a variety of reasons, for example with survey data, some observations may not have been recorded. It is important for us to analyze our data, and get a sense of what the missing values are so we can decide how we want to handle missing values for our machine learning. To start, let's dive into a dataset the Pima Indian Diabetes Prediction dataset. This dataset is available on the UCI Machine Learning Repository at: https:/​/​archive.​ics.​uci.​edu/​ml/​datasets/​pima+indians+diabetes From the main website, we can learn a few things about this publicly available dataset. We have nine columns and 768 instances (rows). The dataset is primarily used for predicting the onset of diabetes within five years in females of Pima Indian heritage over the age of 21 given medical details about their bodies. The dataset is meant to correspond with a binary (2-class) classification machine learning problem. Namely, the answer to the question, will this person develop diabetes within five years? The column names are provided as follows (in order): Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skinfold thickness (mm) 2-Hour serum insulin measurement (mu U/ml) Body mass index (weight in kg/(height in m)2) Diabetes pedigree function Age (years) Class variable (zero or one) The goal of the dataset is to be able to predict the final column of class variable, which predicts if the patient has developed diabetes, using the other eight features as inputs to a machine learning function. There are two very important reasons we will be working with this dataset: We will have to work with missing values All of the features we will be working with will be quantitative The first point makes more sense for now as a reason, because the point of this chapter is to deal with missing values. As far as only choosing to work with quantitative data, this will only be the case for this chapter. We do not have enough tools to deal with missing values in categorical columns. In the next chapter, when we talk about feature construction, we will deal with this procedure. The exploratory data analysis (EDA) To identify our missing values we will begin with an EDA of our dataset. We will be using some useful python packages, pandas and numpy, to store our data and make some simple calculations as well as some popular visualization tools to see what the distribution of our data looks like. Let's begin and dive into some code. First, we will do some imports: # import packages we need for exploratory data analysis (EDA) import pandas as pd # to store tabular data import numpy as np # to do some math import matplotlib.pyplot as plt # a popular data visualization tool import seaborn as sns # another popular data visualization tool %matplotlib inline plt.style.use('fivethirtyeight') # a popular data visualization theme We will import our tabular data through a CSV, as follows: # load in our dataset using pandas pima = pd.read_csv('../data/pima.data') pima.head() The head method allows us to see the first few rows in our dataset. The output is as follows: Something's not right here, there's no column names. The CSV must not have the names for the columns built into the file. No matter, we can use the data source's website to fill this in, as shown in the following code: pima_column_names = ['times_pregnant', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness', 'serum_insulin', 'bmi', 'pedigree_function', 'age', 'onset_diabetes'] pima = pd.read_csv('../data/pima.data', names=pima_column_names) pima.head() Now, using the head method again, we can see our columns with the appropriate headers. The output of the preceding code is as follows: Much better, now we can use the column names to do some basic stats, selecting, and visualizations. Let's first get our null accuracy as follows: pima['onset_diabetes'].value_counts(normalize=True) # get null accuracy, 65% did not develop diabetes 0 0.651042 1 0.348958 Name: onset_diabetes, dtype: float64 If our eventual goal is to exploit patterns in our data in order to predict the onset of diabetes, let us try to visualize some of the differences between those that developed diabetes and those that did not. Our hope is that the histogram will reveal some sort of pattern, or obvious difference in values between the classes of prediction: # get a histogram of the plasma_glucose_concentration column for # both classes col = 'plasma_glucose_concentration' plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='nondiabetes') plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes') plt.legend(loc='upper right') plt.xlabel(col) plt.ylabel('Frequency') plt.title('Histogram of {}'.format(col)) plt.show() The output of the preceding code is as follows: It seems that this histogram is showing us a pretty big difference between plasma_glucose_concentration between the two prediction classes. Let's show the same histogram style for multiple columns as follows: for col in ['bmi', 'diastolic_blood_pressure', 'plasma_glucose_concentration']: plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='non-diabetes') plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes') plt.legend(loc='upper right') plt.xlabel(col) plt.ylabel('Frequency') plt.title('Histogram of {}'.format(col)) plt.show() The output of the preceding code will give us the following three histograms. The first one is show us the distributions of bmi for the two class variables (non-diabetes and diabetes): The next histogram to appear will shows us again contrastingly different distributions between a feature across our two class variables. This time we are looking at diastolic_blood_pressure: The final graph will show plasma_glucose_concentration differences between our two class Variables: We can definitely see some major differences simply by looking at just a few histograms. For example, there seems to be a large jump in plasma_glucose_concentration for those who will eventually develop diabetes. To solidify this, perhaps we can visualize a linear correlation matrix in an attempt to quantify the relationship between these variables. We will use the visualization tool, seaborn, which we imported at the beginning of this chapter for our correlation matrix as follows: # look at the heatmap of the correlation matrix of our dataset sns.heatmap(pima.corr()) # plasma_glucose_concentration definitely seems to be an interesting feature here Following is the correlation matrix of our dataset. This is showing us the correlation amongst the different columns in our Pima dataset. The output is as follows: This correlation matrix is showing a strong correlation between plasma_glucose_concentration and onset_diabetes. Let's take a further look at the numerical correlations for the onset_diabetes column, with the following code: pima.corr()['onset_diabetes'] # numerical correlation matrix # plasma_glucose_concentration definitely seems to be an interesting feature here times_pregnant 0.221898 plasma_glucose_concentration 0.466581 diastolic_blood_pressure 0.065068 triceps_thickness 0.074752 serum_insulin 0.130548 bmi 0.292695 pedigree_function 0.173844 age 0.238356 onset_diabetes 1.000000 Name: onset_diabetes, dtype: float64 We will explore the powers of correlation in a later Chapter 4, Feature Construction, but for now we are using exploratory data analysis (EDA) to hint at the fact that the plasma_glucose_concentration column will be an important factor in our prediction of the onset of diabetes. Moving on to more important matters at hand, let's see if we are missing any values in our dataset by invoking the built-in isnull() method of the pandas DataFrame: pima.isnull().sum() >>>> times_pregnant 0 plasma_glucose_concentration 0 diastolic_blood_pressure 0 triceps_thickness 0 serum_insulin 0 bmi 0 pedigree_function 0 age 0 onset_diabetes 0 dtype: int64 Great! We don't have any missing values. Let's go on to do some more EDA, first using the shape method to see the number of rows and columns we are working with: pima.shape . # (# rows, # cols) (768, 9) Confirming we have 9 columns (including our response variable) and 768 data observations (rows). Now, let's take a peak at the percentage of patients who developed diabetes, using the following code: pima['onset_diabetes'].value_counts(normalize=True) # get null accuracy, 65% did not develop diabetes 0 0.651042 1 0.348958 Name: onset_diabetes, dtype: float64 This shows us that 65% of the patients did not develop diabetes, while about 35% did. We can use a nifty built-in method of a pandas DataFrame called describe to look at some basic descriptive statistics: pima.describe() # get some basic descriptive statistics We get the output as follows: This shows us quite quickly some basic stats such as mean, standard deviation, and some different percentile measurements of our data. But, notice that the minimum value of the BMI column is 0. That is medically impossible; there must be a reason for this to happen. Perhaps the number zero has been encoded as a missing value instead of the None value or a missing cell. Upon closer inspection, we see that the value 0 appears as a minimum value for the following columns: times_pregnant plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin bmi onset_diabetes Because zero is a class for onset_diabetes and 0 is actually a viable number for times_pregnant, we may conclude that the number 0 is encoding missing values for: plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin bmi So, we actually do having missing values! It was obviously not luck that we happened upon the zeros as missing values, we knew it beforehand. As a data scientist, you must be ever vigilant and make sure that you know as much about the dataset as possible in order to find missing values encoded as other symbols. Be sure to read any and all documentation that comes with open datasets in case they mention any missing values. If no documentation is available, some common values used instead of missing values are: 0 (for numerical values) unknown or Unknown (for categorical variables) ? (for categorical variables) To summarize, we have five columns where the fields are left with missing values and symbols. [box type="note" align="" class="" width=""]You just read an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla. To learn more about missing values and manipulating features, do check out Feature Engineering Made Easy and develop expert proficiency in Feature Selection, Learning, and Optimization.[/box]    
Read more
  • 0
  • 0
  • 14296

article-image-build-cartpole-game-using-openai-gym
Savia Lobo
10 Mar 2018
11 min read
Save for later

How to build a cartpole game using OpenAI Gym

Savia Lobo
10 Mar 2018
11 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn advanced features of TensorFlow1.x, such as distributed TensorFlow with TF Clusters, deploy production models with TensorFlow Serving, and more. [/box] Today, we will help you understand OpenAI Gym and how to apply the basics of OpenAI Gym onto a cartpole game. OpenAI Gym 101 OpenAI Gym is a Python-based toolkit for the research and development of reinforcement learning algorithms. OpenAI Gym provides more than 700 opensource contributed environments at the time of writing. With OpenAI, you can also create your own environment. The biggest advantage is that OpenAI provides a unified interface for working with these environments, and takes care of running the simulation while you focus on the reinforcement learning algorithms. Note : The research paper describing OpenAI Gym is available here: http://arxiv.org/abs/1606.01540 You can install OpenAI Gym using the following command: pip3  install  gym Note: If the above command does not work, then you can find further help with installation at the following link: https://github.com/openai/ gym#installation  Let us print the number of available environments in OpenAI Gym: all_env  =  list(gym.envs.registry.all()) print('Total  Environments  in  Gym  version  {}  :  {}' .format(gym.     version     ,len(all_env))) Total  Environments  in  Gym  version  0.9.4  :  777  Let us print the list of all environments: for  e  in  list(all_env): print(e) The partial list from the output is as follows: EnvSpec(Carnival-ramNoFrameskip-v0) EnvSpec(EnduroDeterministic-v0) EnvSpec(FrostbiteNoFrameskip-v4) EnvSpec(Taxi-v2) EnvSpec(Pooyan-ram-v0) EnvSpec(Solaris-ram-v4) EnvSpec(Breakout-ramDeterministic-v0) EnvSpec(Kangaroo-ram-v4) EnvSpec(StarGunner-ram-v4) EnvSpec(Enduro-ramNoFrameskip-v4) EnvSpec(DemonAttack-ramDeterministic-v0) EnvSpec(TimePilot-ramNoFrameskip-v0) EnvSpec(Amidar-v4) Each environment, represented by the env object, has a standardized interface, for example: An env object can be created with the env.make(<game-id-string>) function by passing the id string. Each env object contains the following main functions: The step() function takes an action object as an argument and returns four objects: observation: An object implemented by the environment, representing the observation of the environment. reward: A signed float value indicating the gain (or loss) from the previous action. done: A Boolean value representing if the scenario is finished. info: A Python dictionary object representing the diagnostic information. The render() function creates a visual representation of the environment. The reset() function resets the environment to the original state. Each env object comes with well-defined actions and observations, represented by action_space and observation_space. One of the most popular games in the gym to learn reinforcement learning is CartPole. In this game, a pole attached to a cart has to be balanced so that it doesn't fall. The game ends if either the pole tilts by more than 15 degrees or the cart moves by more than 2.4 units from the center. The home page of OpenAI.com emphasizes the game in these words: The small size and simplicity of this environment make it possible to run very quick experiments, which is essential when learning the basics. The game has only four observations and two actions. The actions are to move a cart by applying a force of +1 or -1. The observations are the position of the cart, the velocity of the cart, the angle of the pole, and the rotation rate of the pole. However, knowledge of the semantics of observation is not necessary to learn to maximize the rewards of the game. Now let us load a popular game environment, CartPole-v0, and play it with stochastic control:  Create the env object with the standard make function: env  =  gym.make('CartPole-v0')  The number of episodes is the number of game plays. We shall set it to one, for now, indicating that we just want to play the game once. Since every episode is stochastic, in actual production runs you will run over several episodes and calculate the average values of the rewards. Additionally, we can initialize an array to store the visualization of the environment at every timestep: n_episodes  =  1 env_vis  =  []  Run two nested loops—an external loop for the number of episodes and an internal loop for the number of timesteps you would like to simulate for. You can either keep running the internal loop until the scenario is done or set the number of steps to a higher value. At the beginning of every episode, reset the environment using env.reset(). At the beginning of every timestep, capture the visualization using env.render(). for  i_episode  in  range(n_episodes): observation  =  env.reset() for  t  in  range(100): env_vis.append(env.render(mode  =  'rgb_array')) print(observation) action  =  env.action_space.sample() observation,  reward,  done,  info  =  env.step(action) if  done: print("Episode  finished  at  t{}".format(t+1)) break  Render the environment using the helper function: env_render(env_vis)  The code for the helper function is as follows: def  env_render(env_vis): plt.figure() plot  =  plt.imshow(env_vis[0]) plt.axis('off') def  animate(i): plot.set_data(env_vis[i]) anim  =  anm.FuncAnimation(plt.gcf(), animate, frames=len(env_vis), interval=20, repeat=True, repeat_delay=20) display(display_animation(anim,  default_mode='loop')) We get the following output when we run this example: [-0.00666995  -0.03699492  -0.00972623    0.00287713] [-0.00740985    0.15826516  -0.00966868  -0.29285861] [-0.00424454  -0.03671761  -0.01552586  -0.00324067] [-0.0049789    -0.2316135    -0.01559067    0.28450351] [-0.00961117  -0.42650966  -0.0099006    0.57222875] [-0.01814136  -0.23125029    0.00154398    0.27644332] [-0.02276636  -0.0361504    0.00707284  -0.01575223] [-0.02348937    0.1588694       0.0067578    -0.30619523] [-0.02031198  -0.03634819    0.00063389  -0.01138875] [-0.02103895    0.15876466    0.00040612  -0.3038716  ] [-0.01786366    0.35388083  -0.00567131  -0.59642642] [-0.01078604    0.54908168  -0.01759984  -0.89089036] [    1.95594914e-04   7.44437934e-01    -3.54176495e-02    -1.18905344e+00] [ 0.01508435 0.54979251 -0.05919872 -0.90767902] [ 0.0260802 0.35551978 -0.0773523 -0.63417465] [ 0.0331906 0.55163065 -0.09003579 -0.95018025] [ 0.04422321 0.74784161 -0.1090394 -1.26973934] [ 0.05918004 0.55426764 -0.13443418 -1.01309691] [ 0.0702654 0.36117014 -0.15469612 -0.76546874] [ 0.0774888 0.16847818 -0.1700055 -0.52518186] [ 0.08085836 0.3655333 -0.18050913 -0.86624457] [ 0.08816903 0.56259197 -0.19783403 -1.20981195] Episode  finished  at  t22 It took 22 time-steps for the pole to become unbalanced. At every run, we get a different time-step value because we picked the action scholastically by using env.action_space.sample(). Since the game results in a loss so quickly, randomly picking an action and applying it is probably not the best strategy. There are many algorithms for finding solutions to keeping the pole straight for a longer number of time-steps that you can use, such as Hill Climbing, Random Search, and Policy Gradient. Note: Some of the algorithms for solving the Cartpole game are available at the following links: https://openai.com/requests-for-research/#cartpole http://kvfrans.com/simple-algoritms-for-solving-cartpole/ https://github.com/kvfrans/openai-cartpole Applying simple policies to a cartpole game So far, we have randomly picked an action and applied it. Now let us apply some logic to picking the action instead of random chance. The third observation refers to the angle. If the angle is greater than zero, that means the pole is tilting right, thus we move the cart to the right (1). Otherwise, we move the cart to the left (0). Let us look at an example: We define two policy functions as follows: def  policy_logic(env,obs): return  1  if  obs[2]  >  0  else  0 def  policy_random(env,obs): return  env.action_space.sample() Next, we define an experiment function that will run for a specific number of episodes; each episode runs until the game is lost, namely when done is True. We use rewards_max to indicate when to break out of the loop as we do not wish to run the experiment forever: def  experiment(policy,  n_episodes,  rewards_max): rewards=np.empty(shape=(n_episodes)) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done: action  =  policy(env,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break rewards[i]=episode_reward print('Policy:{},  Min  reward:{},  Max  reward:{}' .format(policy.     name     , min(rewards), max(rewards))) We run the experiment 100 times, or until the rewards are less than or equal to rewards_max, that is set to 10,000: n_episodes  =  100 rewards_max  =  10000 experiment(policy_random,  n_episodes,  rewards_max) experiment(policy_logic,  n_episodes,  rewards_max) We can see that the logically selected actions do better than the randomly selected ones, but not that much better: Policy:policy_random,  Min  reward:9.0,  Max  reward:63.0,  Average  reward:20.26 Policy:policy_logic,  Min  reward:24.0,  Max  reward:66.0,  Average  reward:42.81 Now let us modify the process of selecting the action further—to be based on parameters. The parameters will be multiplied by the observations and the action will be chosen based on whether the multiplication result is zero or one. Let us modify the random search method in which we initialize the parameters randomly. The code looks as follows: def  policy_logic(theta,obs): #  just  ignore  theta return  1  if  obs[2]  >  0  else  0 def  policy_random(theta,obs): return  0  if  np.matmul(theta,obs)  <  0  else  1 def  episode(env,  policy,  rewards_max): obs  =  env.reset() done  =  False episode_reward  =  0 if  policy.   name          in  ['policy_random']: theta  =  np.random.rand(4)  *  2  -  1 else: theta  =  None while  not  done: action  =  policy(theta,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break return  episode_reward def  experiment(policy,  n_episodes,  rewards_max): rewards=np.empty(shape=(n_episodes)) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): rewards[i]=episode(env,policy,rewards_max) #print("Episode  finished  at  t{}".format(reward)) print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes  =  100 rewards_max  =  10000 experiment(policy_random,  n_episodes,  rewards_max) experiment(policy_logic,  n_episodes,  rewards_max) We can see that random search does improve the results: Policy:policy_random,  Min  reward:8.0,  Max  reward:200.0,  Average reward:40.04 Policy:policy_logic,  Min  reward:25.0,  Max  reward:62.0,  Average  reward:43.03 With the random search, we have improved our results to get the max rewards of 200. On average, the rewards for random search are lower because random search tries various bad parameters that bring the overall results down. However, we can select the best parameters from all the runs and then, in production, use the best parameters. Let us modify the code to train the parameters first: def  policy_logic(theta,obs): #  just  ignore  theta return  1  if  obs[2]  >  0  else  0 def  policy_random(theta,obs): return  0  if  np.matmul(theta,obs)  <  0  else  1 def  episode(env,policy,  rewards_max,theta): obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done: action  =  policy(theta,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break return  episode_reward def  train(policy,  n_episodes,  rewards_max): env  =  gym.make('CartPole-v0') theta_best  =  np.empty(shape=[4]) reward_best  =  0 for  i  in  range(n_episodes): if  policy.   name          in  ['policy_random']: theta  =  np.random.rand(4)  *  2  -  1 else: theta  =  None reward_episode=episode(env,policy,rewards_max,  theta) if  reward_episode  >  reward_best: reward_best  =  reward_episode theta_best  =  theta.copy() return  reward_best,theta_best def  experiment(policy,  n_episodes,  rewards_max,  theta=None): rewards=np.empty(shape=[n_episodes]) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): rewards[i]=episode(env,policy,rewards_max,theta) #print("Episode  finished  at  t{}".format(reward)) print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes  =  100 rewards_max  =  10000 reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max) We train for 100 episodes and then use the best parameters to run the experiment for the random search policy: n_episodes  =  100 rewards_max  =  10000 reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max) We find the that the training parameters gives us the best results of 200: trained  theta:  [-0.14779543               0.93269603    0.70896423   0.84632461],  rewards: 200.0 Policy:policy_random,  Min  reward:200.0,  Max  reward:200.0,  Average reward:200.0 Policy:policy_logic,  Min  reward:24.0,  Max  reward:63.0,  Average  reward:41.94 We may optimize the training code to continue training until we reach a maximum reward. To summarize, we learnt the basics of OpenAI Gym and also applied it onto a cartpole game for relevant output.   If you found this post useful, do check out this book Mastering TensorFlow 1.x  to build, scale, and deploy deep neural network models using star libraries in Python.
Read more
  • 0
  • 1
  • 28647
Modal Close icon
Modal Close icon