How-To Tutorials

article-image-generate-prime-numbers-ancient-greek-connection

16 Mar 2018

2 min read

Prime numbers, modern encryption and their ancient Greek connection!

16 Mar 2018

Prime numbers are incredibly important in a number of fields, from computer science to cybersecurity. But they are incredibly mysterious - they are quite literally enigmas. They have been the subject of thousands of years of research and exploration but they have still not been cracked - we are yet to find a formula that will help generate prime numbers easily. Prime numbers are particularly integral to modern encryption - when a file is encrypted, the number used to do so is built using two primes. The only way to decrypt it is to work out the prime factors of that gargantuan number, a task which is almost impossible even with the most extensive computing power currently at our disposal. As well as this, prime numbers are also used as error correcting codes and in mass storage and data transmission. Did you know the Greeks were one of the early champions for our modern encryption systems? Find out how to generate prime numbers manually using a method devised by the Greek mathematician, Eratothenes, in this fun video from the video course by Packt, Fundamental Algorithms in Scala. [embed]https://www.youtube.com/watch?v=cd8v-Jo8obs&t=56s[/embed]

0
0
11644

article-image-troubleshooting-in-sql-server

Sunith Shetty

15 Mar 2018

16 min read

Troubleshooting in SQL Server

Sunith Shetty

15 Mar 2018

16 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book SQL Server 2017 Administrator's Guide written by Marek Chmel and Vladimír Mužný. This book will help you learn to implement and administer successful database solution with SQL Server 2017.[/box] Today, we will perform SQL Server analysis, and also learn ways for efficient performance monitoring and tuning. Performance monitoring and tuning Performance monitoring and tuning is a crucial part of your database administration skill set so as to keep the performance and stability of your server great, and to be able to find and fix the possible issues. The overall system performance can decrease over time; your system may work with more data or even become totally unresponsive. In such cases, you need the skills and tools to find the issue to bring the server back to normal as fast as possible. We can use several tools on the operating system layer and, then, inside the SQL Server to verify the performance and the possible root cause of the issue. The first tool that we can use is the performance monitor, which is available on your Windows Server: Performance monitor can be used to track important SQL Server counters, which can be very helpful in evaluating the SQL Server Performance. To add a counter, simply right-click on the monitoring screen in the Performance monitoring and tuning section and use the Add Counters item. If the SQL Server instance that you're monitoring is a default instance, you will find all the performance objects listed as SQL Server. If your instance is named, then the performance objects will be listed as MSSQL$InstanceName in the list of performance objects. We can split the important counters to watch between the system counters for the whole server and specific SQL Server counters. The list of system counters to watch include the following: Processor(_Total)% Processor Time: This is a counter to display the CPU load. Constant high values should be investigated to verify whether the current load does not exceed the performance limits of your HW or VM server, or if your workload is not running with proper indexes and statistics, and is generating bad query plans. MemoryAvailable MBytes: This counter displays the available memory on the operating system. There should always be enough memory for the operating system. If this counter drops below 64MB, it will send a notification of low memory and the SQL Server will reduce the memory usage. Physical Disk—Avg. Disk sec/Read: This disk counter provides the average latency information for your storage system; be careful if your storage is made of several different disks to monitor the proper storage system. Physical Disk: This indicates the average disk writes per second. Physical Disk: This indicates the average disk reads per second. Physical Disk: This indicates the number of disk writes per second. System—Processor Queue Length: This counter displays the number of threads waiting on a system CPU. If the counter is above 0, this means that there are more requests than the CPU can handle, and if the counter is constantly above 0, this may signal performance issues. Network interface: This indicates the total number of bytes per second. Once you have added all these system counters, you can see the values real time or you can configure a data collection, which will run for a specified selected time and periodically collect the information: With SQL Server-specific counters, we can dig deeper into the CPU, memory, and storage utilization to see what the SQL Server is doing and how the SQL Server is utilizing the subsystems. SQL Server memory monitoring and troubleshooting Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: SQLServer-Buffer Manager—buffer cache hit ratio: This counter displays the ratio of how often the SQL Server can find the proper data in the cache when a query returns such data. If the data is not found in the cache, it has to be read from the disk. The higher the counter, the better the overall performance, since the memory access is usually faster compared to the disk subsystem. SQLServer-Buffer Manager—page life expectancy: This counter can measure how long a page can stay in the memory in seconds. The longer a page can stay in the memory, the less likely it will be for the SQL Server to need to access the disk in order to get the data into the memory again. SQL Server-Memory Manager—total server memory (KB): This is the amount of memory the server has committed using the memory manager. SQL Server-Memory Manager—target server memory (KB): This is the ideal amount of memory the server can consume. On a stable system, the target and total should be equal unless you face a memory pressure. Once the memory is utilized after the warm-up of your server, these two counters should not drop significantly, which would be another indication of system-level memory pressure, where the SQL Server memory manager has to deallocate memory. SQL Server-Memory Manager—memory grants pending: This counter displays the total number of SQL Server processes that are waiting to be granted memory from the memory manager. To check the performance counters, you can also use a T-SQL query, where you can query the sys.dm_os_performance_counters DMV: SELECT [counter_name] as [Counter Name], [cntr_value]/1024 as [Server Memory (MB)] FROM sys.dm_os_performance_counters WHERE [object_name] LIKE '%Memory Manager%' AND [counter_name] IN ('Total Server Memory (KB)', 'Target Server Memory (KB)') This query will return two values—one for target memory and one for total memory. These two should be close to each other on a warmed up system. Another query you can use is to get the information from a DMV named sys.dm_0s_sys_memory: SELECT total_physical_memory_kb/1024/1024 AS [Physical Memory (GB)], available_physical_memory_kb/1024/1024 AS [Available Memory (GB)], system_memory_state_desc AS [System Memory State] FROM sys.dm_os_sys_memory WITH (NOLOCK) OPTION (RECOMPILE) This query will display the available physical memory and the total physical memory of your server with several possible memory states: Available physical memory is high (this is a state you would like to see on your system, indicating there is no lack of memory) Physical memory usage is steady Available physical memory is getting low Available physical memory is low The memory grants can be verified with a T-SQL query: SELECT [object_name] as [Object name] , cntr_value AS [Memory Grants Pending] FROM sys.dm_os_performance_counters WITH (NOLOCK) WHERE [object_name] LIKE N'%Memory Manager%' AND counter_name = N'Memory Grants Pending' OPTION (RECOMPILE); If you face memory issues, there are several steps you can take for improvements: Check and configure your SQL Server max memory usage Add more RAM to your server; the limit for Standard Edition is 128 GB and there is no limit for SQL Server with Enterprise Use Lock Pages in Memory Optimize your queries SQL Server storage monitoring and troubleshooting The important counters to watch for SQL Server storage utilization would include counters from the SQL Server:Access Methods performance object: SQL Server-Access Methods—Full Scans/sec: This counter displays the number of full scans per second, which can be either table or full-index scans SQL Server-Access Methods—Index Searches/sec: This counter displays the number of searches in the index per second SQL Server-Access Methods—Forwarded Records/sec: This counter displays the number of forwarded records per second Monitoring the disk system is crucial, since your disk is used as a storage for the following: Data files Log files tempDB database Page file Backup files To verify the disk latency and IOPS metric of your drives, you can use the Performance monitor, or the T-SQL commands, which will query the sys.dm_os_volume_stats and sys.dm_io_virtual_file_stats DMF. Simple code to start with would be a T-SQL script utilizing the DMF to check the space available within the database files: SELECT f.database_id, f.file_id, volume_mount_point, total_bytes, available_bytes FROM sys.master_files AS f CROSS APPLY sys.dm_os_volume_stats(f.database_id, f.file_id); To check the I/O file stats with the second provided DMF, you can use a T-SQL code for checking the information about tempDB data files: SELECT * FROM sys.dm_io_virtual_file_stats (NULL, NULL) vfs join sys.master_files mf on mf.database_id = vfs.database_id and mf.file_id = vfs.file_id WHERE mf.database_id = 2 and mf.type = 0 To measure the disk performance, we can use a tool named DiskSpeed, which is a replacement for older SQLIO tool, which was used for a long time. DiskSpeed is an external utility, which is not available on the operating system by default. This tool can be downloaded from GitHub or the Technet Gallery at https://github.com/microsoft/diskspd. The following example runs a test for 15 seconds using a single thread to drive 100 percent random 64 KB reads at a depth of 15 overlapped (outstanding) I/Os to a regular file: DiskSpd –d300 -F1 -w0 -r –b64k -o15 d:datafile.dat Troubleshooting wait statistics We can use the whole Wait Statistics approach for a thorough understanding of the SQL Server workload and undertake performance troubleshooting based on the collected data. Wait Statistics are based on the fact that, any time a request has to wait for a resource, the SQL Server tracks this information, and we can use this information for further analysis. When we consider any user process, it can include several threads. A thread is a single unit of execution on SQL Server, where SQLOS controls the thread scheduling instead of relying on the operating system layer. Each processor core has it's own scheduler component responsible for executing such threads. To see the available schedulers in your SQL Server, you can use the following query: SELECT * FROM sys.dm_os_schedulers Such code will return all the schedulers in your SQL Server; some will be displayed as visible online and some as hidden online. The hidden ones are for internal system tasks while the visible ones are used by user tasks running on the SQL Server. There is one more scheduler, which is displayed as Visible Online (DAC). This one is used for dedicated administration connection, which comes in handy when the SQL Server stops responding. To use a dedicated admin connection, you can modify your SSMS connection to use the DAC, or you can use a switch with the sqlcmd.exe utility, to connect to the DAC. To connect to the default instance with DAC on your server, you can use the following command: sqlcmd.exe -E -A Each thread can be in three possible states: running: This indicates that the thread is running on the processor suspended: This indicates that the thread is waiting for a resource on a waiter list runnable: This indicates that the thread is waiting for execution on a runnable queue Each running thread runs until it has to wait for a resource to become available or until it has exhausted the CPU time for a running thread, which is set to 4 ms. This 4 ms time can be visible in the output of the previous query to sys.dm_os_schedulers and is called a quantum. When a thread requires any resource, it is moved away from the processor to a waiter list, where the thread waits for the resource to become available. Once the resource is available, the thread is notified about the resource availability and moves to the bottom of the runnable queue. Any waiting thread can be found via the following code, which will display the waiting threads and the resource they are waiting for: SELECT * FROM sys.dm_os_waiting_tasks The threads then transition between the execution at the CPU, waiter list, and runnable queue. There is a special case when a thread does not need to wait for any resource and has already run for 4 ms on the CPU, then the thread will be moved directly to the runnable queue instead of the waiter list. In the following image, we can see the thread states and the objects where the thread resides: When the thread is waiting on the waiter list, we can talk about a resource wait time. When the thread is waiting on the runnable queue to get on the CPU for execution, we can talk about the signal time. The total wait time is, then, the sum of the signal and resource wait times. You can find the ratio of the signal to resource wait times with the following script: Select signalWaitTimeMs=sum(signal_wait_time_ms) ,'%signal waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) ,resourceWaitTimeMs=sum(wait_time_ms - signal_wait_time_ms) ,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) from sys.dm_os_wait_stats When the ratio goes over 30 percent for the signal waits, then there will be a serious CPU pressure and your processor(s) will have a hard time handling all the incoming requests from the threads. The following query can then grab the wait statistics and display the most frequent wait types, which were recorded through the thread executions, or actually during the time the threads were waiting on the waiter list for any particular resource: WITH [Waits] AS (SELECT [wait_type], [wait_time_ms] / 1000.0 AS [WaitS], ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS], [signal_wait_time_ms] / 1000.0 AS [SignalS], [waiting_tasks_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_wait_stats WHERE [wait_type] NOT IN ( N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE', N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE', N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX', N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE', N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP', N'LOGMGR_QUEUE', N'MEMORY_ALLOCATION_EXT', N'ONDEMAND_TASK_QUEUE', N'PREEMPTIVE_XE_GETTARGETSTATE', N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'PWAIT_DIRECTLOGCONSUMER_GETNEXT', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_ASYNC_QUEUE', N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'QDS_SHUTDOWN_QUEUE', N'REDO_THREAD_PENDING_WORK', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE', N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY', N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK', N'SLEEP_TASK', N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT', N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_RECOVERY', N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN', N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT' ) AND [waiting_tasks_count] > 0 ) SELECT MAX ([W1].[wait_type]) AS [WaitType], CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S], CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S], CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S], MAX ([W1].[WaitCount]) AS [WaitCount], CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage], CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S], CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S], CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S] FROM [Waits] AS [W1] INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum] GROUP BY [W1].[RowNum] HAVING SUM ([W2].[Percentage]) - MAX( [W1].[Percentage] ) < 95 GO This code is available on the whitepaper, published by SQLSkills, named SQL Server Performance Tuning Using Wait Statistics by Erin Stellato and Jonathan Kehayias, which then refers the URL on SQL Skills and uses the full query by Paul Randal available at https://www.sqlskills.com/ blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/. Some of the typical wait stats you can see are: PAGEIOLATCH The PAGEIOLATCH wait type is used when the thread is waiting for a page to be read into the buffer pool from the disk. This wait type comes with two main forms: PAGEIOLATCH_SH: This page will be read from the disk PAGEIOLATCH_EX: This page will be modified You may quickly assume that the storage has to be the problem, but that may not be the case. Like any other wait, they need to be considered in correlation with other wait types and other counters available to correctly find the root cause of the slow SQL Server operations. The page may be read into the buffer pool, because it was previously removed due to memory pressure and is needed again. So, you may also investigate the following: Buffer Manager: Page life expectancy Buffer Manager: Buffer cache hit ratio Also, you need to consider the following as a possible factor to the PAGEIOLATCH wait types: Large scans versus seeks on the indexes Implicit conversions Inundated statistics Missing indexes PAGELATCH This wait type is quite frequently misplaced with PAGEIOLATCH but PAGELATCH is used for pages already present in the memory. The thread waits for the access to such a page again with possible PAGELATCH_SH and PAGELATCH_EX wait types. A pretty common situation with this wait type is a tempDB contention, where you need to analyze what page is being waited for and what type of query is actually waiting for such a resource. As a solution to the tempDB, contention you can do the following: Add more tempDB data files Use traceflags 1118 and 1117 for tempDB on systems older than SQL Server 2016 CXPACKET This wait type is encountered when any thread is running in parallel. The CXPACKET wait type itself does not mean that there is really any problem on the SQL Server. But if such a wait type is accumulated very quickly, it may be a signal of skewed statistics, which require an update or a parallel scan on the table where proper indexes are missing. The option for parallelism is controlled via MAX DOP setting, which can be configured on the following: The server level The database level A query level with a hint We learned about SQL Server analysis with the Wait Statistics troubleshooting methodology and possible DMVs to get more insight on the problems occurring in the SQL Server. To know more about how to successfully create, design, and deploy databases using SQL Server 2017, do checkout the book SQL Server 2017 Administrator's Guide.

0
0
25776

article-image-stephen-hawking-artificial-intelligence-quotes

Richard Gall

15 Mar 2018

3 min read

5 polarizing Quotes from Professor Stephen Hawking on artificial intelligence

Richard Gall

15 Mar 2018

3 min read

Professor Stephen Hawking died today (March 14, 2018) aged 76 at his home in Cambridge, UK. Best known for his theory of cosmology that unified quantum mechanics with Einstein’s General Theory of Relativity, and for his book a Brief History of Time that brought his concepts to a wider general audience, Professor Hawking is quite possibly one of the most important and well-known voices in the scientific world. Among many things, Professor Hawking had a lot to say about artificial intelligence - its dangers, its opportunities and what we should be thinking about, not just as scientists and technologists, but as humans. Over the years, Hawking has remained cautious and consistent in his views on the topic constantly urging AI researchers and machine learning developers to consider the wider implications of their work on society and the human race itself. The machine learning community is quite divided on all the issues Hawking has raised and will probably continue to be so as the field grows faster than it can be fathomed. Here are 5 widely debated things Stephen Hawking said about AI arranged in chronological order - and if you’re going to listen to anyone, you’ve surely got to listen to him? On artificial intelligence ending the human race The development of full artificial intelligence could spell the end of the human race….It would take off on its own, and re-design itself at an ever-increasing rate. Humans, who are limited by slow biological evolution, couldn't compete and would be superseded. From an interview with the BBC, December 2014 On the future of AI research The establishment of shared theoretical frameworks, combined with the availability of data and processing power, has yielded remarkable successes in various component tasks such as speech recognition, image classification, autonomous vehicles, machine translation, legged locomotion, and question-answering systems. As capabilities in these areas and others cross the threshold from laboratory research to economically valuable technologies, a virtuous cycle takes hold whereby even small improvements in performance are worth large sums of money, prompting greater investments in research. There is now a broad consensus that AI research is progressing steadily, and that its impact on society is likely to increase.... Because of the great potential of AI, it is important to research how to reap its benefits while avoiding potential pitfalls. From Research Priorities for Robust and Beneficial Artificial Intelligence, an open letter co-signed by Hawking, January 2015 On AI emulating human intelligence I believe there is no deep difference between what can be achieved by a biological brain and what can be achieved by a computer. It, therefore, follows that computers can, in theory, emulate human intelligence — and exceed it From a speech given by Hawking at the opening of the Leverhulme Centre of the Future of Intelligence, Cambridge, U.K., October 2016 On making artificial intelligence benefit humanity Perhaps we should all stop for a moment and focus not only on making our AI better and more successful but also on the benefit of humanity. Taken from a speech given by Hawking at Web Summit in Lisbon, November 2017 On AI replacing humans The genie is out of the bottle. We need to move forward on artificial intelligence development but we also need to be mindful of its very real dangers. I fear that AI may replace humans altogether. If people design computer viruses, someone will design AI that replicates itself. This will be a new form of life that will outperform humans. From an interview with Wired, November 2017

0
2
72294

article-image-stack-overflow-developer-survey-2018-quick-overview

Amey Varangaonkar

14 Mar 2018

4 min read

Stack Overflow Developer Survey 2018: A Quick Overview

Amey Varangaonkar

14 Mar 2018

4 min read

Stack Overflow recently published their annual developer survey in which over 100,000 developers and professionals participated. The survey shed light on some very interesting insights - from the developers’ preferred language for programming, to the development platform they hate the most. As the survey is quite detailed and comprehensive, we thought why not present the most important takeaways and findings for you to go through very quickly? If you are short of time and want to scan through the results of the survey quickly, read on.. Developer Profile Young developers form the majority: Half the developer population falls in the age group of 25-34 years while almost all respondents (90%) fall within the 18 - 44 year age group. Limited professional coding experience: Majority of the developers have been coding from the last 2 to 10 years. That said, almost half of the respondents have a professional coding experience of less than 5 years. Continuous learning is key to surviving as a developer: Almost 75% of the developers have a bachelor’s degree, or higher. In addition, almost 90% of the respondents say they have learnt a new language, framework or a tool without taking any formal course, but with the help of the official documentation and/or Stack Overflow Back-end developers form the majority: Among the top developer roles, more than half the developers identify themselves as back-end developers, while the percentage of data scientists and analysts is quite low. About 20% of the respondents identify themselves as mobile developers Working full-time: More than 75% of the developers responded that they work a full-time job. Close to 10% are freelancers, or self-employed. Popularly used languages and frameworks The Javascript family continue their reign: For the sixth year running, JavaScript has continued to be the most popular programming language, and is the choice of language for more than 70% of the respondents. In terms of frameworks, Node.js and Angular continue to be the most popular choice of the developers. Desktop development ain’t dead yet: When it comes to the platforms, developers prefer Linux and Windows Desktop or Server for their development work. Cloud platforms have not gained that much adoption, as yet, but there is a slow but steady rise. What about Data Science? Machine Learning and DevOps rule the roost: Machine Learning and DevOps are two trends which are trending highly due to the vast applications and research that is being done on these fronts. Tensorflow rises, Hadoop falls: About 75% of the respondents love the Tensorflow framework, and say they would love to continue using it for their machine learning/deep learning tasks. Hadoop’s popularity seems to be going down, on the other hand, as other Big Data frameworks like Apache Spark gain more traction and popularity. Python - the next big programming language: Popular data science languages like R and Python are on the rise in terms of popularity. Python, which surpassed PHP last year, has surpassed C# this year, indicating its continuing rise in popularity. Python based Frameworks like Tensorflow and pyTorch are gaining a lot of adoption. Learn F# for more moolah: Languages like F#, Clojure and Rust are associated with high global salaries, with median salaries above $70,000. The likes of R and Python are associated with median salaries of up to $57,000. PostgreSQL growing rapidly, Redis most loved database: MySQL and SQL Server are the two most widely used databases as per the survey, while the usage of PostgreSQL has surpassed that of the traditionally popular databases like MongoDB and Redis. In terms of popularity, Redis is the most loved database while the developers dread (read looking to switch from) databases like IBM DB2 and Oracle. Job-hunt for data scientists: Approximately 18% of the 76,000+ respondents who are actively looking for jobs are data scientists or work as academicians and researchers. AI more exciting than worrying: Close to 75% of the 69,000+ respondents are excited about the future possibilities with AI than worried about the dangers posed by AI. Some of the major concerns include AI making important business decisions. The big surprise was that most developers find automation of jobs as the most exciting part of a future enabled by AI. So that’s it then! What do you think about the Stack Overflow Developer survey results? Do you agree with the developers’ responses? We would love to know your thoughts. In the coming days, watch out for more fine grained analysis of the Stack Overflow survey data.

0
0
20439

article-image-getting-started-with-q-learning-using-tensorflow

Savia Lobo

14 Mar 2018

9 min read

Getting started with Q-learning using TensorFlow

Savia Lobo

14 Mar 2018

9 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. This book will help you master advanced concepts of deep learning such as transfer learning, reinforcement learning, generative models and more, using TensorFlow and Keras.[/box] In this tutorial, we will learn about Q-learning and how to implement it using deep reinforcement learning. Q-Learning is a model-free method of finding the optimal policy that can maximize the reward of an agent. During initial gameplay, the agent learns a Q value for each pair of (state, action), also known as the exploration strategy. Once the Q values are learned, then the optimal policy will be to select an action with the largest Q-value in every state, also known as the exploitation strategy. The learning algorithm may end in locally optimal solutions, hence we keep using the exploration policy by setting an exploration_rate parameter. The Q-Learning algorithm is as follows: initialize Q(shape=[#s,#a]) to random values or zeroes Repeat (for each episode) observe current state s Repeat select an action a (apply explore or exploit strategy) observe state s_next as a result of action a update the Q-Table using bellman's equation set current state s = s_next until the episode ends or a max reward / max steps condition is reached Until a number of episodes or a condition is reached (such as max consecutive wins) Q(s, a) in the preceding algorithm represents the Q function. The values of this function are used for selecting the action instead of the rewards, thus this function represents the reward or discounted rewards. The values for the Q-function are updated using the values of the Q function in the future state. The well- known bellman equation captures this update: This basically means that at time step t, in state s, for action a, the maximum future reward (Q) is equal to the reward from the current state plus the max future reward from the next state. Q(s,a) can be implemented as a Q-Table or as a neural network known as a Q-Network. In both cases, the task of the Q-Table or the Q-Network is to provide the best possible action based on the Q value of the given input. The Q-Table-based approach generally becomes intractable as the Q-Table becomes large, thus making neural networks the best candidate for approximating the Q-function through Q-Network. Let us look at both of these approaches in action. Initializing and discretizing for Q-Learning The observations returned by the pole-cart environment involves the state of the environment. The state of pole-cart is represented by continuous values that we need to discretize. If we discretize these values into small state-space, then the agent gets trained faster, but with the caveat of risking the convergence to the optimal policy. We use the following helper function to discretize the state-space of the pole-cart environment: # discretize the value to a state space def discretize(val,bounds,n_states): discrete_val = 0 if val <= bounds[0]: discrete_val = 0 elif val >= bounds[1]: discrete_val = n_states-1 else: discrete_val = int(round( (n_states-1) * ((val-bounds[0])/ (bounds[1]-bounds[0])) return discrete_val def discretize_state(vals,s_bounds,n_s): discrete_vals = [] for i in range(len(n_s)): discrete_vals.append(discretize(vals[i],s_bounds[i],n_s[i])) return np.array(discrete_vals,dtype=np.int) We discretize the space into 10 units for each of the observation dimensions. You may want to try out different discretization spaces. After the discretization, we find the upper and lower bounds of the observations, and change the bounds of velocity and angular velocity to be between -1 and +1, instead of -Inf and +Inf. The code is as follows: env = gym.make('CartPole-v0') n_a = env.action_space.n # number of discrete states for each observation dimension n_s = np.array([10,10,10,10]) # position, velocity, angle, angular velocity s_bounds = np.array(list(zip(env.observation_space.low, env.observation_space.high))) # the velocity and angular velocity bounds are # too high so we bound between -1, +1 s_bounds[1] = (-1.0,1.0) s_bounds[3] = (-1.0,1.0) Q-Learning with Q-Table Since our discretised space is of the dimensions [10,10,10,10], our Q-Table is of [10,10,10,10,2] dimensions: # create a Q-Table of shape (10,10,10,10, 2) representing S X A -> R q_table = np.zeros(shape = np.append(n_s,n_a)) We define a Q-Table policy that exploits or explores based on the exploration_rate: def policy_q_table(state, env): # Exploration strategy - Select a random action if np.random.random() < explore_rate: action = env.action_space.sample() # Exploitation strategy - Select the action with the highest q else: action = np.argmax(q_table[tuple(state)]) return action Define the episode() function that runs a single episode as follows: Start with initializing the variables and the first state: obs = env.reset() state_prev = discretize_state(obs,s_bounds,n_s) episode_reward = 0 done = False t = 0 Select the action and observe the next state: action = policy(state_prev, env) obs, reward, done, info = env.step(action) state_new = discretize_state(obs,s_bounds,n_s) Update the Q-Table: best_q = np.amax(q_table[tuple(state_new)]) bellman_q = reward + discount_rate * best_q indices = tuple(np.append(state_prev,action)) q_table[indices] += learning_rate*( bellman_q - q_table[indices]) Set the next state as the previous state and add the rewards to the episode's rewards: state_prev = state_new episode_reward += reward The experiment() function calls the episode function and accumulates the rewards for reporting. You may want to modify the function to check for consecutive wins and other logic specific to your play or games: # collect observations and rewards for each episode def experiment(env, policy, n_episodes,r_max=0, t_max=0): rewards=np.empty(shape=[n_episodes]) for i in range(n_episodes): val = episode(env, policy, r_max, t_max) rewards[i]=val print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}' .format(policy. name , np.min(rewards), np.max(rewards), np.mean(rewards))) Now, all we have to do is define the parameters, such as learning_rate, discount_rate, and explore_rate, and run the experiment() function as follows: learning_rate = 0.8 discount_rate = 0.9 explore_rate = 0.2 n_episodes = 1000 experiment(env, policy_q_table, n_episodes) For 1000 episodes, the Q-Table-based policy's maximum reward is 180 based on our simple implementation: Policy:policy_q_table, Min reward:8.0, Max reward:180.0, Average reward:17.592 Our implementation of the algorithm is very simple to explain. However, you can modify the code to set the explore rate high initially and then decay as the time-steps pass. Similarly, you can also implement the decay logic for the learning and discount rates. Let us see if we can get a higher reward with fewer episodes as our Q function learns faster. Q-Learning with Q-Network or Deep Q Network (DQN) In the DQN, we replace the Q-Table with a neural network (Q-Network) that will learn to respond with the optimal action as we train it continuously with the explored states and their Q-Values. Thus, for training the network we need a place to store the game memory: Implement the game memory using a deque of size 1000: memory = deque(maxlen=1000) Next, build a simple hidden layer neural network model, q_nn: from keras.models import Sequential from keras.layers import Dense model = Sequential() model.add(Dense(8,input_dim=4, activation='relu')) model.add(Dense(2, activation='linear')) model.compile(loss='mse',optimizer='adam') model.summary() q_nn = model The Q-Network looks like this: Layer (type) Output Shape Param # ================================================================= dense_1 (Dense) (None, 8) 40 dense_2 (Dense) (None, 2) 18 ================================================================= Total params: 58 Trainable params: 58 Non-trainable params: 0 The episode() function that executes one episode of the game, incorporates the following changes for the Q-Network-based algorithm: After generating the next state, add the states, action, and rewards to the game memory: action = policy(state_prev, env) obs, reward, done, info = env.step(action) state_next = discretize_state(obs,s_bounds,n_s) # add the state_prev, action, reward, state_new, done to memory memory.append([state_prev,action,reward,state_next,done]) Generate and update the q_values with the maximum future rewards using the bellman function: states = np.array([x[0] for x in memory]) states_next = np.array([np.zeros(4) if x[4] else x[3] for x in memory]) q_values = q_nn.predict(states) q_values_next = q_nn.predict(states_next) for i in range(len(memory)): state_prev,action,reward,state_next,done = memory[i] if done: q_values[i,action] = reward else: best_q = np.amax(q_values_next[i]) bellman_q = reward + discount_rate * best_q q_values[i,action] = bellman_q Train the q_nn with the states and the q_values we received from memory: q_nn.fit(states,q_values,epochs=1,batch_size=50,verbose=0) The process of saving gameplay in memory and using it to train the model is also known as memory replay in deep reinforcement learning literature. Let us run our DQN-based gameplay as follows: learning_rate = 0.8 discount_rate = 0.9 explore_rate = 0.2 n_episodes = 100 experiment(env, policy_q_nn, n_episodes) We get a max reward of 150 that you can improve upon with hyper-parameter tuning, network tuning, and by using rate decay for the discount rate and explore rate: Policy:policy_q_nn, Min reward:8.0, Max reward:150.0, Average reward:41.27 To summarize, we calculated and trained the model at every step. One can change the code to discard the memory replay and retrain the model for the episodes that return smaller rewards. However, implement this option with caution as it may slow down your learning as initial gameplay would generate smaller rewards more often. Do check out the book Mastering TensorFlow 1.x to explore advanced features of TensorFlow 1.x and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet.

0
0
38886

article-image-selecting-statistical-based-features-in-machine-learning-application

Pravin Dhandre

14 Mar 2018

16 min read

Selecting Statistical-based Features in Machine Learning application

Pravin Dhandre

14 Mar 2018

16 min read

In today’s tutorial, we will work on one of the methods of executing feature selection, the statistical-based method for interpreting both quantitative and qualitative datasets. Feature selection attempts to reduce the size of the original dataset by subsetting the original features and shortlisting the best ones with the highest predictive power. We may intelligently choose which feature selection method might work best for us, but in reality, a very valid way of working in this domain is to work through examples of each method and measure the performance of the resulting pipeline. To begin, let's take a look at the subclass of feature selection modules that are reliant on statistical tests to select viable features from a dataset. Statistical-based feature selections Statistics provides us with relatively quick and easy methods of interpreting both quantitative and qualitative data. We have used some statistical measures in previous chapters to obtain new knowledge and perspective around our data, specifically in that we recognized mean and standard deviation as metrics that enabled us to calculate z-scores and scale our data. In this tutorial, we will rely on two new concepts to help us with our feature selection: Pearson correlations hypothesis testing Both of these methods are known as univariate methods of feature selection, meaning that they are quick and handy when the problem is to select out single features at a time in order to create a better dataset for our machine learning pipeline. Using Pearson correlation to select features We have actually looked at correlations in this book already, but not in the context of feature selection. We already know that we can invoke a correlation calculation in pandas by calling the following method: credit_card_default.corr() The output of the preceding code produces is the following: As a continuation of the preceding table we have: The Pearson correlation coefficient (which is the default for pandas) measures the linear relationship between columns. The value of the coefficient varies between -1 and +1, where 0 implies no correlation between them. Correlations closer to -1 or +1 imply an extremely strong linear relationship. It is worth noting that Pearson’s correlation generally requires that each column be normally distributed (which we are not assuming). We can also largely ignore this requirement because our dataset is large (over 500 is the threshold). The pandas .corr() method calculates a Pearson correlation coefficient for every column versus every other column. This 24 column by 24 row matrix is very unruly, and in the past, we used heatmaps to try and make the information more digestible: # using seaborn to generate heatmaps import seaborn as sns import matplotlib.style as style # Use a clean stylizatino for our charts and graphs style.use('fivethirtyeight') sns.heatmap(credit_card_default.corr()) The heatmap generated will be as follows: Note that the heatmap function automatically chose the most correlated features to show us. That being said, we are, for the moment, concerned with the features correlations to the response variable. We will assume that the more correlated a feature is to the response, the more useful it will be. Any feature that is not as strongly correlated will not be as useful to us. Correlation coefficients are also used to determine feature interactions and redundancies. A key method of reducing overfitting in machine learning is spotting and removing these redundancies. We will be tackling this problem in our model-based selection methods. Let's isolate the correlations between the features and the response variable, using the following code: # just correlations between every feature and the response credit_card_default.corr()['default payment next month'] LIMIT_BAL -0.153520 SEX -0.039961 EDUCATION 0.028006 MARRIAGE -0.024339 AGE 0.013890 PAY_0 0.324794 PAY_2 0.263551 PAY_3 0.235253 PAY_4 0.216614 PAY_5 0.204149 PAY_6 0.186866 BILL_AMT1 -0.019644 BILL_AMT2 -0.014193 BILL_AMT3 -0.014076 BILL_AMT4 -0.010156 BILL_AMT5 -0.006760 BILL_AMT6 -0.005372 PAY_AMT1 -0.072929 PAY_AMT2 -0.058579 PAY_AMT3 -0.056250 PAY_AMT4 -0.056827 PAY_AMT5 -0.055124 PAY_AMT6 -0.053183 default payment next month 1.000000 We can ignore the final row, as is it is the response variable correlated perfectly to itself. We are looking for features that have correlation coefficient values close to -1 or +1. These are the features that we might assume are going to be useful. Let's use pandas filtering to isolate features that have at least .2 correlation (positive or negative). Let's do this by first defining a pandas mask, which will act as our filter, using the following code: # filter only correlations stronger than .2 in either direction (positive or negative) credit_card_default.corr()['default payment next month'].abs() > .2 LIMIT_BAL False SEX False EDUCATION False MARRIAGE False AGE False PAY_0 True PAY_2 True PAY_3 True PAY_4 True PAY_5 True PAY_6 False BILL_AMT1 False BILL_AMT2 False BILL_AMT3 False BILL_AMT4 False BILL_AMT5 False BILL_AMT6 False PAY_AMT1 False PAY_AMT2 False PAY_AMT3 False PAY_AMT4 False PAY_AMT5 False PAY_AMT6 False default payment next month True Every False in the preceding pandas Series represents a feature that has a correlation value between -.2 and .2 inclusive, while True values correspond to features with preceding correlation values .2 or less than -0.2. Let's plug this mask into our pandas filtering, using the following code: # store the features highly_correlated_features = credit_card_default.columns[credit_card_default.corr()['default payment next month'].abs() > .2] highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5', u'default payment next month'], dtype='object') The variable highly_correlated_features is supposed to hold the features of the dataframe that are highly correlated to the response; however, we do have to get rid of the name of the response column, as including that in our machine learning pipeline would be cheating: # drop the response variable highly_correlated_features = highly_correlated_features.drop('default payment next month') highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5'], dtype='object') So, now we have five features from our original dataset that are meant to be predictive of the response variable, so let's try it out with the help of the following code: # only include the five highly correlated features X_subsetted = X[highly_correlated_features] get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y) # barely worse, but about 20x faster to fit the model Best Accuracy: 0.819666666667 Best Parameters: {'max_depth': 3} Average Time to Fit (s): 0.01 Average Time to Score (s): 0.002 Our accuracy is definitely worse than the accuracy to beat, .8203, but also note that the fitting time saw about a 20-fold increase. Our model is able to learn almost as well as with the entire dataset with only five features. Moreover, it is able to learn as much in a much shorter timeframe. Let's bring back our scikit-learn pipelines and include our correlation choosing methodology as a part of our preprocessing phase. To do this, we will have to create a custom transformer that invokes the logic we just went through, as a pipeline-ready class. We will call our class the CustomCorrelationChooser and it will have to implement both a fit and a transform logic, which are: The fit logic will select columns from the features matrix that are higher than a specified threshold The transform logic will subset any future datasets to only include those columns that were deemed important from sklearn.base import TransformerMixin, BaseEstimator class CustomCorrelationChooser(TransformerMixin, BaseEstimator): def __init__(self, response, cols_to_keep=[], threshold=None): # store the response series self.response = response # store the threshold that we wish to keep self.threshold = threshold # initialize a variable that will eventually # hold the names of the features that we wish to keep self.cols_to_keep = cols_to_keep def transform(self, X): # the transform method simply selects the appropiate # columns from the original dataset return X[self.cols_to_keep] def fit(self, X, *_): # create a new dataframe that holds both features and response df = pd.concat([X, self.response], axis=1) # store names of columns that meet correlation threshold self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() > Self.threshold] # only keep columns in X, for example, will remove response Variable self.cols_to_keep = [c for c in self.cols_to_keep if c in X.columns] return self Let's take our new correlation feature selector for a spin, with the help of the following code: # instantiate our new feature selector ccc = CustomCorrelationChooser(threshold=.2, response=y) ccc.fit(X) ccc.cols_to_keep ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5'] Our class has selected the same five columns as we found earlier. Let's test out the transform functionality by calling it on our X matrix, using the following code: ccc.transform(X).head() The preceding code produces the following table as the output: We see that the transform method has eliminated the other columns and kept only the features that met our .2 correlation threshold. Now, let's put it all together in our pipeline, with the help of the following code: # instantiate our feature selector with the response variable set ccc = CustomCorrelationChooser(response=y) # make our new pipeline, including the selector ccc_pipe = Pipeline([('correlation_select', ccc), ('classifier', d_tree)]) # make a copy of the decisino tree pipeline parameters ccc_pipe_params = deepcopy(tree_pipe_params) # update that dictionary with feature selector specific parameters ccc_pipe_params.update({ 'correlation_select__threshold':[0, .1, .2, .3]}) print ccc_pipe_params #{'correlation_select__threshold': [0, 0.1, 0.2, 0.3], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # better than original (by a little, and a bit faster on # average overall get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'correlation_select__threshold': 0.1, 'classifier__max_depth': 5} Average Time to Fit (s): 0.105 Average Time to Score (s): 0.003 Wow! Our first attempt at feature selection and we have already beaten our goal (albeit by a little bit). Our pipeline is showing us that if we threshold at 0.1, we have eliminated noise enough to improve accuracy and also cut down on the fitting time (from .158 seconds without the selector). Let's take a look at which columns our selector decided to keep: # check the threshold of .1 ccc = CustomCorrelationChooser(threshold=0.1, response=y) ccc.fit(X) # check which columns were kept ccc.cols_to_keep ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] It appears that our selector has decided to keep the five columns that we found, as well as two more, the LIMIT_BAL and the PAY_6 columns. Great! This is the beauty of automated pipeline gridsearching in scikit-learn. It allows our models to do what they do best and intuit things that we could not have on our own. Feature selection using hypothesis testing Hypothesis testing is a methodology in statistics that allows for a bit more complex statistical testing for individual features. Feature selection via hypothesis testing will attempt to select only the best features from a dataset, just as we were doing with our custom correlation chooser, but these tests rely more on formalized statistical methods and are interpreted through what are known as p-values. A hypothesis test is a statistical test that is used to figure out whether we can apply a certain condition for an entire population, given a data sample. The result of a hypothesis test tells us whether we should believe the hypothesis or reject it for an alternative one. Based on sample data from a population, a hypothesis test determines whether or not to reject the null hypothesis. We usually use a p-value (a non-negative decimal with an upper bound of 1, which is based on our significance level) to make this conclusion. In the case of feature selection, the hypothesis we wish to test is along the lines of: True or False: This feature has no relevance to the response variable. We want to test this hypothesis for every feature and decide whether the features hold some significance in the prediction of the response. In a way, this is how we dealt with the correlation logic. We basically said that, if a column's correlation with the response is too weak, then we say that the hypothesis that the feature has no relevance is true. If the correlation coefficient was strong enough, then we can reject the hypothesis that the feature has no relevance in favor of an alternative hypothesis, that the feature does have some relevance. To begin to use this for our data, we will have to bring in two new modules: SelectKBest and f_classif, using the following code: # SelectKBest selects features according to the k highest scores of a given scoring function from sklearn.feature_selection import SelectKBest # This models a statistical test known as ANOVA from sklearn.feature_selection import f_classif # f_classif allows for negative values, not all do # chi2 is a very common classification criteria but only allows for positive values # regression has its own statistical tests SelectKBest is basically just a wrapper that keeps a set amount of features that are the highest ranked according to some criterion. In this case, we will use the p-values of completed hypothesis testings as a ranking. Interpreting the p-value The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test. Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it. The big take away from this is that the f_classif function will perform an ANOVA test (a type of hypothesis test) on each feature on its own (hence the name univariate testing) and assign that feature a p-value. The SelectKBest will rank the features by that p-value (the lower the better) and keep only the best k (a human input) features. Let's try this out in Python. Ranking the p-value Let's begin by instantiating a SelectKBest module. We will manually enter a k value, 5, meaning we wish to keep only the five best features according to the resulting p-values: # keep only the best five features according to p-values of ANOVA test k_best = SelectKBest(f_classif, k=5) We can then fit and transform our X matrix to select the features we want, as we did before with our custom selector: # matrix after selecting the top 5 features k_best.fit_transform(X, y) # 30,000 rows x 5 columns array([[ 2, 2, -1, -1, -2], [-1, 2, 0, 0, 0], [ 0, 0, 0, 0, 0], ..., [ 4, 3, 2, -1, 0], [ 1, -1, 0, 0, 0], [ 0, 0, 0, 0, 0]]) If we want to inspect the p-values directly and see which columns were chosen, we can dive deeper into the select k_best variables: # get the p values of columns k_best.pvalues_ # make a dataframe of features and p-values # sort that dataframe by p-value p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value') # show the top 5 features p_values.head() The preceding code produces the following table as the output: We can see that, once again, our selector is choosing the PAY_X columns as the most important. If we take a look at our p-value column, we will notice that our values are extremely small and close to zero. A common threshold for p-values is 0.05, meaning that anything less than 0.05 may be considered significant, and these columns are extremely significant according to our tests. We can also directly see which columns meet a threshold of 0.05 using the pandas filtering methodology: # features with a low p value p_values[p_values['p_value'] < .05] The preceding code produces the following table as the output: The majority of the columns have a low p-value, but not all. Let's see the columns with a higher p_value, using the following code: # features with a high p value p_values[p_values['p_value'] >= .05] The preceding code produces the following table as the output: These three columns have quite a high p-value. Let's use our SelectKBest in a pipeline to see if we can grid search our way into a better machine learning pipeline, using the following code: k_best = SelectKBest(f_classif) # Make a new pipeline with SelectKBest select_k_pipe = Pipeline([('k_best', k_best), ('classifier', d_tree)]) select_k_best_pipe_params = deepcopy(tree_pipe_params) # the 'all' literally does nothing to subset select_k_best_pipe_params.update({'k_best__k':range(1,23) + ['all']}) print select_k_best_pipe_params # {'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all'], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # comparable to our results with correlationchooser get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'k_best__k': 7, 'classifier__max_depth': 5} Average Time to Fit (s): 0.102 Average Time to Score (s): 0.002 It seems that our SelectKBest module is getting about the same accuracy as our custom transformer, but it's getting there a bit quicker! Let's see which columns our tests are selecting for us, with the help of the following code: k_best = SelectKBest(f_classif, k=7) # lowest 7 p values match what our custom correlationchooser chose before # ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] p_values.head(7) The preceding code produces the following table as the output: They appear to be the same columns that were chosen by our other statistical method. It's possible that our statistical method is limited to continually picking these seven columns for us. There are other tests available besides ANOVA, such as Chi2 and others, for regression tasks. They are all included in scikit-learn's documentation. For more info on feature selection through univariate testing, check out the scikit-learn documentation here: http://scikit-learn.org/stable/modules/feature_selection. html#univariate-feature-selection Before we move on to model-based feature selection, it's helpful to do a quick sanity check to ensure that we are on the right track. So far, we have seen two statistical methods for feature selection that gave us the same seven columns for optimal accuracy. But what if we were to take every column except those seven? We should expect a much lower accuracy and worse pipeline overall, right? Let's make sure. The following code helps us to implement sanity checks: # sanity check # If we only the worst columns the_worst_of_X = X[X.columns.drop(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])] # goes to show, that selecting the wrong features will # hurt us in predictive performance get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y) Best Accuracy: 0.783966666667 Best Parameters: {'max_depth': 5} Average Time to Fit (s): 0.21 Average Time to Score (s): 0.002 Hence, by selecting the columns except those seven, we see not only worse accuracy (almost as bad as the null accuracy), but also slower fitting times on average. We statistically selected features from the dataset for our machine learning pipeline. [box type="note" align="" class="" width=""]This article is an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla. Do check out the book to get access to alternative techniques such as the model-based method to achieve optimum results from the machine learning application.[/box]

0
0
34086

article-image-feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique

Pravin Dhandre

13 Mar 2018

9 min read

Feature Improvement: Identifying missing values using EDA (Exploratory Data Analysis) technique

Pravin Dhandre

13 Mar 2018

9 min read

0
0
14296

article-image-build-cartpole-game-using-openai-gym

Savia Lobo

10 Mar 2018

11 min read

How to build a cartpole game using OpenAI Gym

Savia Lobo

10 Mar 2018

11 min read

0
1
28647

article-image-perform-data-partitioning-postgresql-10

Sugandha Lahoti

09 Mar 2018

11 min read

How to perform data partitioning in PostgreSQL 10

Sugandha Lahoti

09 Mar 2018

11 min read

0
0
21791

article-image-learning-the-salesforce-analytics-query-language-saql

Amey Varangaonkar

09 Mar 2018

6 min read

Learning the Salesforce Analytics Query Language (SAQL)

Amey Varangaonkar

09 Mar 2018

6 min read

Salesforce Einstein offers its own query language to retrieve your data from various sources, called Salesforce Analytics Query Language (SAQL). The lenses and dashboards in Einstein use SAQL behind the scenes to manipulate data for meaningful visualizations. In this article, we see how to use Salesforce Analytics Query Language effectively. Using SAQL There are the following three ways to use SAQL in Einstein Analytics: Creating steps/lenses: We can use SAQL while creating a lens or step. It is the easiest way of using SAQL. While creating a step, Einstein Analytics provides the flexibility of switching between modes such as Chart Mode, Table Mode, and SAQL Mode. In this chapter, we will use this method for SAQL. Analytics REST API: Using this API, the user can access the datasets, lenses, dashboards, and so on. This is a programmatic approach and you can send the queries to the Einstein Analytics platform. Einstein Analytics uses the OAuth 2.0 protocol to securely access the platform data. The OAuth protocol is a way of securely authenticating the user without asking them for credentials. The first step to using the Analytics REST API to access Analytics is to authenticate the user using OAuth 2.0. Using Dashboard JSON: We can use SAQL while editing the Dashboard JSON. We have already seen the Dashboard JSON in previous chapters. To access Dashboard JSON, you can open the dashboard in the edit mode and press Ctrl + E. The simplest way of using SAQL is while creating a step or lens. A user can switch between the modes here. To use SAQL for lens, perform the following steps: Navigate to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner, as shown in the following screenshot: In SAQL, the query is made up of multiple statements. In the first statement, the query loads the input data from the dataset, operates on it, and then finally gives the result. The user can use the Run Query button to see the results and errors after changing or adding statements. The user can see the errors at the bottom of the Query editor. SAQL is made up of statements that take the input dataset, and we build our logic on that. We can add filters, groups, orders, and so on, to this dataset to get the desired output. There are certain order rules that need to be followed while creating these statements and those rules are as follows: There can be only one offset in the foreach statement The limit statement must be after offset The offset statement must be after filter and order The order and filter statements can be swapped as there is no rule for them In SAQL, we can perform all the mathematical calculations and comparisons. SAQL also supports arithmetic operators, comparison operators, string operators, and logical operators. Using foreach in SAQL The foreach statement applies the set of expressions to every row, which is called projection. The foreach statement is mandatory to get the output of the query. The following is the syntax for the foreach statement: q = foreach q generate expression as 'expresion name'; Let's look at one example of using the foreach statement: Go to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner. In the Query editor you will see the following code: q = load "opportunity"; q = group q by all; q = foreach q generate count() as 'count'; q = limit q 2000; You can see the result of this query just below the Query editor: 4. Now replace the third statement with the following statement: q = foreach q generate sum('Amount') as 'Sum Amount'; 5. Click on the Run Query button and observe the result as shown in the following screenshot: Using grouping in SAQL The user can group records of the same value in one group by using the group statements. Use the following syntax: q = group rows by fieldName Let's see how to use grouping in SAQL by performing the following steps: Replace the second and third statement with the following statement: q = group q by 'StageName'; q = foreach q generate 'StageName' as 'StageName', sum('Amount') as 'Sum Amount'; 2. Click on the Run Query button and you should see the following result: Using filters in SAQL Filters in SAQL behave just like a where clause in SOQL and SQL, filtering the data as per the condition or clause. In Einstein Analytics, it selects the row from the dataset that satisfies the condition added. The syntax for the filter is as follows: q = filter q by fieldName 'Operator' value Click on Run Query and view the result as shown in the following screenshot: Using functions in SAQL The beauty of a function is in its reusability. Once the function is created it can be used multiple times. In SAQL, we can use different types of functions, such as string functions, math functions, aggregate functions, windowing functions, and so on. These functions are predefined and saved quite a few times. Let's use a math function power. The syntax for the power is power(m, n). The function returns the value of m raised to the nth power. Replace the following statement with the fourth statement: q = foreach q generate 'StageName' as 'StageName', power(sum('Amount'), 1/2) as 'Amount Squareroot', sum('Amount') as 'Sum Amount'; Click on the Run Query button. We saw how to apply different kinds of case-specific functions in Salesforce Einstein to play with data in order to get the desired outcome. [box type="note" align="" class="" width=""]The above excerpt is taken from the book Learning Einstein Analytics, written by Santosh Chitalkar. It covers techniques to set-up and create apps, lenses, and dashboards using Salesforce Einstein Analytics for effective business insights. If you want to know more about these techniques, check out the book Learning Einstein Analytics.[/box]

0
1
30864

article-image-how-to-perform-audio-video-image-scraping-with-python

Amarabha Banerjee

08 Mar 2018

9 min read

How to perform Audio-Video-Image Scraping with Python

Amarabha Banerjee

08 Mar 2018

9 min read

[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server. Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally. Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It's a simple step from there to also transcode video with ffmpeg. Downloading media content from the web Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content. Getting ready There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file. How to do it Here is how we proceed with the recipe: The URLUtility class can download content from a URL. The code in the recipe's file is the following: import const from util.urls import URLUtility util = URLUtility(const.ApodEclipseImage()) print(len(util.data)) When running this you will see the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014 The example reads 171014 bytes of data. How it works The URL is defined as a constant const.ApodEclipseImage() in the const module: def ApodEclipseImage(): return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg" The constructor of the URLUtility class has the following implementation: def __init__(self, url, readNow=True): """ Construct the object, parse the URL, and download now if specified""" self._url = url self._response = None self._parsed = urlparse(url) if readNow: self.read() The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method: def read(self): self._response = urllib.request.urlopen(self._url) self._data = self._response.read() This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property: @property def data(self): self.ensure_response() return self._data The code then simply reports on the length of that data, with the value of 171014. There's more This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next. Parsing a URL with urllib to get the filename When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name? Getting ready We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py. How to do it Execute the recipe's file with your python interpreter. It will run the following code: util = URLUtility(const.ApodEclipseImage()) print(util.filename_without_ext) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The filename is: BT5643s How it works In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively: >>> parsed = urlparse(const.ApodEclipseImage()) >>> parsed ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='') The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension: @property def filename_without_ext(self): filename = os.path.splitext(os.path.basename(self._parsed.path))[0] return filename The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename). There's more It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That's our next recipe. Determining the type of content for a URL When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content. Getting ready We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py. How to do it We proceed as follows: Execute the script for the recipe. It contains the following code: util = URLUtility(const.ApodEclipseImage()) print("The content type is: " + util.contenttype) With the following result: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The content type is: image/jpeg How it works The .contentype property is implemented as follows: @property def contenttype(self): self.ensure_response() return self._response.headers['content-type'] The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed. There's more The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned: >>> response = urllib.request.urlopen(const.ApodEclipseImage()) >>> for header in response.headers: print(header) Date Server Last-Modified ETag Accept-Ranges Content-Length Connection Content-Type Strict-Transport-Security And we can see the values for each of these headers. >>> for header in response.headers: print(header + " ==> " + response.headers[header]) Date ==> Tue, 26 Sep 2017 19:31:41 GMT Server ==> WebServer/1.0 Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT ETag ==> "547bb44-29c06-5581275ce2b86" Accept-Ranges ==> bytes Content-Length ==> 171014 Connection ==> close Content-Type ==> image/jpeg Strict-Transport-Security ==> max-age=31536000; includeSubDomains Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist. Determining the file extension from a content type It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file. Getting ready We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):. How to do it Proceed by running the recipe's script. An extension for the media type can be found using the .extension property: util = URLUtility(const.ApodEclipseImage()) print("Filename from content-type: " + util.extension_from_contenttype) print("Filename from url: " + util.extension_from_url) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes Filename from content-type: .jpg Filename from url: .jpg This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same. How it works The following is the implementation of the .extension_from_contenttype property: @property def extension_from_contenttype(self): self.ensure_response() map = const.ContentTypeToExtensions() if self.contenttype in map: return map[self.contenttype] return None The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension: def ContentTypeToExtensions(): return { "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/png": ".png" } If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url: @property def extension_from_url(self): ext = os.path.splitext(os.path.basename(self._parsed.path))[1] return ext This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename. To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python. If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.

0
0
33710

article-image-learning-dependency-injection-di

Packt

08 Mar 2018

15 min read

Learning Dependency Injection (DI)

Packt

08 Mar 2018

15 min read

0
0
25875

article-image-4-common-challenges-web-scraping-handle

Amarabha Banerjee

08 Mar 2018

13 min read

4 common challenges in Web Scraping and how to handle them

Amarabha Banerjee

08 Mar 2018

13 min read

[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] In this article, we will explore primary challenges of Web Scraping and how to get away with it easily. Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds. Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands: docker pull mheydt/pywebscrapecookbook docker run -p 5001:5001 pywebscrapecookbook Retrying failed page downloads Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes: [500, 502, 503, 504, 408] The process can be further configured using the following parameters: RETRY_ENABLED (True/False - default is True) RETRY_TIMES (# of times to retry on any errors - default is 2) RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408]) How to do it The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.retry.RetryMiddleware": 500 }, 'RETRY_ENABLED': True, 'RETRY_TIMES': 3 }) process.crawl(Spider) process.start() How it works Scrapy will pick up the configuration for retries as specified when the spider is run. When encountering errors, Scrapy will retry up to three times before giving up. Supporting page redirects Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters: REDIRECT_ENABLED: (True/False - default is True) REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20) How to do it The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are a few lines demonstrating the output: Parsing: <200 https://www.nasa.gov/content/earth-expeditions-above/> ['http://www.nasa.gov/content/earth-expeditions-above', 'https://www.nasa.gov/content/earth-expeditions-above'] This particular URL was processed after one redirection, from an HTTP to an HTTPS version of the URL. The list defines all of the URLs that were involved in the redirection. You will also be able to see where redirection exceeded the specified level (2) in the output pages. The following is one example: 2017-10-22 17:55:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://www.nasa.gov/topics/journeytomars/news/index.html>: max redirections reached How it works The spider is defined as the following: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] def parse(self, response): print("Parsing: ", response) print (response.request.meta.get('redirect_urls')) This is identical to our previous NASA sitemap based crawler, with the addition of one line printing the redirect_urls. In any call to parse, this metadata will contain all redirects that occurred to get to this page. The crawling process is configured with the following code: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 500 }, 'REDIRECT_ENABLED': True, 'REDIRECT_MAX_TIMES': 2 }) Redirect is enabled by default, but this sets the maximum number of redirects to 2 instead of the default of 20. Waiting for content to be available in Selenium A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asynchronously to the page after loading. Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button: When pressing the button, we are presented with a progress bar for five seconds: And when this is completed, we are presented with Hello World! Now suppose we want to scrape this page to get the content that is exposed only after the button is pressed and after the wait? How do we do this? How to do it We can do this using Selenium. We will use two features of Selenium. The first is the ability to click on page elements. The second is the ability to wait until an element with a specific ID is available on the page. First, we get the button and click it. The button's HTML is the following: <div id='start'> <button>Start</button> </div> When the button is pressed and the load completes, the following HTML is added to the document: <div id='finish'> <h4>Hello World!"</h4> </div> We will use the Selenium driver to find the Start button, click it, and then wait until a div with an ID of 'finish' is available. Then we get that element and return the text in the enclosed <h4> tag. You can try this by running 06/03_press_and_wait.py. It's output will be the following: clicked Hello World! Now let's see how it worked. How it works Let us break down the explanation: We start by importing the required items from Selenium: from selenium import webdriver from selenium.webdriver.support import ui Now we load the driver and the page: driver = webdriver.PhantomJS() driver.get("http://the-internet.herokuapp.com/dynamic_loading/2") With the page loaded, we can retrieve the button: button = driver.find_element_by_xpath("//*/div[@id='start']/button") And then we can click the button: button.click() print("clicked") Next we create a WebDriverWait object: wait = ui.WebDriverWait(driver, 10) With this object, we can request Selenium's UI wait for certain events. This also sets a maximum wait of 10 seconds. Now using this, we can wait until we meet a criterion; that an element is identifiable using the following XPath: wait.until(lambda driver: driver.find_element_by_xpath("//*/div[@id='finish']")) When this completes, we can retrieve the h4 element and get its enclosing text: finish_element=driver.find_element_by_xpath("//*/div[@id='finish']/ h4") print(finish_element.text) Limiting crawling to a single domain We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class. How to do it The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov. How it works The code is the same as previous NASA site crawlers except that we include allowed_domains=['nasa.gov']: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] allowed_domains=['nasa.gov'] def parse(self, response): print("Parsing: ", response) The NASA site is fairly consistent with staying within its root domain, but there are occasional links to other sites such as content on boeing.com. This code will prevent moving to those external sites. Processing infinitely scrolling pages Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart. While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page. Let's look at spidyquotes.herokuapp.com/scroll as an example. Getting ready Open http://spidyquotes.herokuapp.com/scroll in your browser. This page will load additional content when you scroll to the bottom of the page: Screenshot of the quotes to scrape Once the page is open, go into your developer tools and select the network panel. Then, scroll to the bottom of the page. You will see new content in the network panel: When we click on one of the links, we can see the following JSON: { "has_next": true, "page": 2, "quotes": [{ "author": { "goodreads_link": "/author/show/82952.Marilyn_Monroe", "name": "Marilyn Monroe", "slug": "Marilyn-Monroe" }, "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "u201cThis life is what you make it...." }, { "author": { "goodreads_link": "/author/show/1077326.J_K_Rowling", "name": "J.K. Rowling", "slug": "J-K-Rowling" }, "tags": ["courage", "friends"], "text": "u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.u201d" }, This is great because all we need to do is continually generate requests to /api/quotes?page=x, increasing x until the has_next tag exists in the reply document. If there are no more pages, then this tag will not be in the document. How to do it The 06/05_scrapy_continuous.py file contains a Scrapy agent, which crawls this set of pages. Run it with your Python interpreter and you will see output similar to the following (the following is multiple excerpts from the output): <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'Sisters']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'Understand']} When this gets to page 10 it will stop as it will see that there is no next page flag set in the Content. How it works Let's walk through the spider to see how this works. The spider starts with the following definition of the start URL: class Spider(scrapy.Spider): name = 'spidyquotes' quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes' start_urls = [quotes_base_url] download_delay = 1.5 The parse method then prints the response and also parses the JSON into the data variable: def parse(self, response): print(response) data = json.loads(response.body) Then it loops through all the items in the quotes element of the JSON objects. For each item, it yields a new Scrapy item back to the Scrapy engine: for item in data.get('quotes', []): yield { 'text': item.get('text'), 'author': item.get('author', {}).get('name'), 'tags': item.get('tags'), } It then checks to see if the data JSON variable has a 'has_next' property, and if so it gets the next page and yields a new request back to Scrapy to parse the next page: if data['has_next']: next_page = data['page'] + 1 yield scrapy.Request(self.quotes_base_url + "?page=%s" % next_page) There's more... It is also possible to process infinite, scrolling pages using Selenium. The following code is in 06/06_scrape_continuous_twitter.py: from selenium import webdriver import time driver = webdriver.PhantomJS() print("Starting") driver.get("https://twitter.com") scroll_pause_time = 1.5 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: print(last_height) # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(scroll_pause_time) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") print(new_height, last_height) if new_height == last_height: break last_height = new_height The output would be similar to the following: Starting 4882 8139 4882 8139 11630 8139 11630 15055 11630 15055 15055 15055 Process finished with exit code 0 This code starts by loading the page from Twitter. The call to .get() will return when the page is fully loaded. The scrollHeight is then retrieved, and the program scrolls to that height and waits for a moment for the new content to load. The scrollHeight of the browser is retrieved again, and if different than last_height, it will loop and continue processing. If the same as last_height, no new content has loaded and you can then continue on and retrieve the HTML for the completed page. We have discussed the common challenges faced in performing Web Scraping using Python and got to know their workaround. If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Python and perform efficient web scraping.

0
0
35552

article-image-spam-filtering-natural-language-processing-approach

Packt

08 Mar 2018

16 min read

Spam Filtering - Natural Language Processing Approach

Packt

08 Mar 2018

16 min read

0
0
7810

article-image-data-explorationusing-spark-sql

Packt

08 Mar 2018

9 min read

Data Exploration using Spark SQL

Packt

08 Mar 2018

9 min read

In this article, Aurobindo Sarkar, the author of the book, Learning Spark SQL, we will be covering the following points to introduce you to using Spark SQL for exploratory data analysis. What is exploratory Data Analysis (EDA)? Why EDA is important? Using Spark SQL for basic data analysis Visualizing data with Apache Zeppelin Introducing Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a “feel”, for the data set as a result of such exploration. More specifically, the graphical techniques allow us to, efficiently, select and validate appropriate models, test our assumptions, identify relationships, select estimators, and detect outliers. EDA involves a lot of trial and error, and several iterations.The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement.For example, you should plot simple histograms and scatterplots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take long to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java, and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer there are many good tools available such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data then this section will offer some good tools and techniques data exploration. Here, we will do some basic data exploration exercises to understand a sample dataset. We will use a dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We use the bank-additional-full.csv file that contains 41188 records and 20 input fields, ordered by date (from May 2008 to November 2010). As a first step, let’s define a schema and read in the CSV file to create a DataFrame. You can use :paste to paste-in the initial set of statements in the Spark shell, as shown in the following figure: After the DataFrame is created, we first verify the number of records. We can also define a case class called Call for our input records, and then create a strongly-typed Dataset as follows: In the next section, we will begin our data exploration by identifying missing data in our dataset. Identifying Missing Data Missing data can occur in datasets due to reasons ranging from negligence to a refusal on part of respondants to provide a specific data point. However, in all cases missing data is a common occurrence in real-world datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies for dealing with it. Here, weanalyzethe numbers of records with missing data fields in our sample dataset. In order to simulate missing data, we will edit our sample dataset by replacing fields containing “unknown” values with empty strings. First, we created a DataFrame / Dataset from our edited file, as shown in the following figure: The following two statements give us a count of rows with certain fields having missing data. Later, we will look at effective ways of dealing with missing data and compute some basic statistics for sample dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a dataset containing a subset of fields from our original DataFrame. In our following example, we choose some of the numeric fields and the outcome field that is the “term deposit subscribed” field. Next, we use describe to quickly compute the count, mean, standard deviation, min and max values for the numeric columns in our dataset. Further, we use the stat package to compute additional statistics like covariance, correlation, creating crosstabs, examining items that occur most frequently in data columns, and computing quantiles. These computations are shown in the following figure: Next, we use the typed aggregation functions to summarize our data to understand our data better. In the following statement, we aggregate the results by whether a term deposit was subscribed along with total customers contacted, average number of calls made per customer, the average duration of the calls and the average number of previous calls made to such customers. The results are rounded to two decimal points. Similarly, executing the following statement givessimilar results by customers’ age. After getting a better understanding of our data by computing basic statistics, we shift our focus to identifying outliers in our data. Identifying data outliers An outlier or an anomalyis an observation of the data that deviates significantly from other observations in the dataset. Erroneous outliers are observations thatare distorted due to possible errors in the data-collection process. These outliers may exert undue influence on the results of statistical analysis, sothey should be identified using reliable detection methods prior to performing data analysis. Many algorithms find outliers as a side-product of clustering algorithms. However these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using a statistical distributions, and the outliers identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically doesnot have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. SparkMLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. In the following example, we use the k-means clustering algorithm to compute two clusters in our data. Other distributed algorithms useful for EDA includeclassification, regression, dimensionality reduction, correlation and hypothesis testing. Visualizing data with Apache Zeppelin Typically, we will generate many graphs to verify our hunches about the data.A lot of thesequick and dirty graphs used during EDA are,ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers, typically, cannot handle millions of data points.Hence we have to summarize, sample or model our data before we can effectively visualize it. Traditionally, BI tools provided extensive aggregation and pivoting features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data wassubsequently downloaded and visualizedon the practitioner’s workstations. Spark can eliminate many of these batch jobs to support interactive data visualization. Here, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is a web-based tool that supports interactive data analysis and visualization. It supports several language interpreters and comes with built-in Spark integration. Hence, it is quick and easy to get started with exploratory data analysis using Apache Zeppelin. You can download Appache Zeppelin from https://zeppelin.apache.org/. Unzip the package on your hard drive and start Zeppelin using the following command: Aurobindos-MacBook-Pro-2:zeppelin-0.6.2-bin-allaurobindosarkar$ bin/zeppelin-daemon.sh start You should see the following message: Zeppelin start [ OK] You should be able to see the Zeppelin home page at: http://localhost:8080/ Click on Create new note link, and specify a path and name for your notebook, as shown in the following figure: In the next step, we paste the same code as in the beginning of this article to create a DataFrame for our sample dataset. We can execute typical DataFrameoperations as shown in the following figure: Next, we create a table from our DataFrame and execute some SQL on it. The results of the SQL statements execution can be charted by clicking on the appropriatechart-type required. Here, we create bar charts as an illustrative example of summarizing and visualizing data: We can also plot a scatter plot, and read the coordinate values of each of the points plotted, as shown in the following two figures. Additionally, we can create a textbox that accepts input values to make experience interactive. In the following figure we create a textbox that can accept different values for the age parameter and the bar chart is updated, accordingly. Similarly, we can also create dropdown lists where the user can select the appropriate option, and the table of values or chart, automatically gets updated. Summary In this article, we demonstrated using Spark SQL for exploring datasets, performing basic data quality checks, generating samples and pivot tables, and visualizing data with Apache Zeppelin.

0
0
27233

Prime numbers, modern encryption and their ancient Greek connection!

Troubleshooting in SQL Server

5 polarizing Quotes from Professor Stephen Hawking on artificial intelligence

Stack Overflow Developer Survey 2018: A Quick Overview

Getting started with Q-learning using TensorFlow

Selecting Statistical-based Features in Machine Learning application

Feature Improvement: Identifying missing values using EDA (Exploratory Data Analysis) technique

How to build a cartpole game using OpenAI Gym

How to perform data partitioning in PostgreSQL 10

Learning the Salesforce Analytics Query Language (SAQL)

Trending Topics

How to perform Audio-Video-Image Scraping with Python

Learning Dependency Injection (DI)

4 common challenges in Web Scraping and how to handle them

Spam Filtering - Natural Language Processing Approach

Data Exploration using Spark SQL

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access