Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-implementing-automation-process-with-salesforce-crm
Richa Tripathi
01 May 2018
4 min read
Save for later

Implementing Automation Process with Salesforce CRM

Richa Tripathi
01 May 2018
4 min read
A CRM system must help its users to be as productive as possible to justify its investment; therefore, if there are any aspects that can be made more efficient, it is usually worth considering. The Salesforce CRM application aims to be as efficient as possible out of the box; however, there are often organization-specific business processes and rules that need to be implemented, and this is where the power of the Salesforce platform becomes truly apparent. [box type="note" align="" class="" width=""]This article is an excerpt from a book written by Paul Goodey, titled Salesforce CRM Admin Cookbook - Second Edition. This book will enable you to instantly extend and unleash the power of Salesforce CRM and its Lightning Experience framework.[/box] In this post, we have provided recipes to create business processes and automate data manipulation that can be used to satisfy an organization's unique requirements for business rules and logic. Deriving year and month values from an Opportunity close date using a formula To simplify the format of dates for presentation and reporting, we can automatically derive the year and month from a date field that contains month, day, and year. In this recipe, we will display a derived year and month text value for the opportunity close date on the opportunity record detail and edit pages calculated from the standard date field called CloseDate. How to do it… Carry out the following steps to create a formula field to derive year and month values from the opportunity close date for opportunity records: Click on the Setup gear icon in the top right-hand corner of the main Home page, as shown in the following screenshot: 2. Click on Setup, as shown in the following screenshot: 3. Navigate to the Opportunity customization setup page as follows: Objects and Fields | Object Manager | Opportunity | Fields & Relationships. Locate the Fields & Relationships section on the right of the page. 4. Click on New. We will be presented with the Step 1. Choose the field type page. 5. Select the Formula option. 6. Click on Next. We will be presented with the Step 2. Choose output type page. 7. Enter CloseDate YEAR MONTH in the Field Label textbox. 8. Click on the Field Name. When clicking out of the Field Label textbox the Field Name is automatically filled     with the value Close_Date_Year_Month. 9. Set the Formula Return Type as Text. 10. Click on Next. We will be presented with the Step 3. Enter formula page. 11. Paste or enter the following code in the formula editor box: TEXT(YEAR(CloseDate)) & " " & CASE( MONTH(CloseDate), 1, "January", 2, "February", 3, "March", 4, "April", 5, "May", 6, "June", 7, "July", 8, "August", 9, "September", 10, "October", 11, "November", 12, "December", "Error!") The formula field is to be set according to the following screenshot: Optionally, enter details in the Description field. 12. Optionally, enter details in the Help Text field. 13. In the Blank Field Handling section, select the option Treat blank fields as blanks. 14. Click on Next. We will be presented with the Step 4. Establish field-level security page. 15. Select the profiles to which you want to grant read access to this field via field- level security. The field will be hidden from all profiles if you do not add it to field-level security. 16. Click on Next. We will be presented with the Step 5. Add to page layouts page. 17. Select the page layouts that should include this field. The field will be added as the last field in the first two column section of these page layouts. The field will not appear on any pages if you do not select a layout. 18. Finally, click on Save. How it works… The Opportunity record formula field Close Date Year Month is automatically derived showing the year and the month name and appears on both the opportunity detail and edit pages. You can see what this looks like when the Close Date for an opportunity record is 12/31/2020, resulting in the automatic year and month of 2020 December, as shown in the following screenshot:   To summarize, we learned about automating tasks like how to derive year and month values from an Opportunity close date using a formula in the Salesforce CRM. If you enjoyed this post, check out the book Salesforce CRM Admin Cookbook - Second Edition to discover hidden features and hacks that extend standard configuration to provide enhanced functionality and customization in Salesforce CRM. Salesforce Spring 18 – New features to be excited about in this release! Learning the Salesforce Analytics Query Language (SAQL) How to create and prepare your first dataset in Salesforce Einstein
Read more
  • 0
  • 0
  • 11019

article-image-mysql-errors-to-be-aware
Amey Varangaonkar
30 Apr 2018
9 min read
Save for later

12 most common MySQL errors you should be aware of

Amey Varangaonkar
30 Apr 2018
9 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide written by Chintan Mehta, Ankit Bhavsar, Subhash Shah and Hetal Oza. This book provides tips and tricks to tackle problems you might encounter while administering MySQL solution.[/box] While using MySQL 8 there can be few scenarios where you would not be able to access or use MySQL properly. These situations can be very annoying, but are easily fixable. However, before you look for the solution, you must know the problem! Here are some of the common errors you might come across when using MySQL 8. 1. Access denied MySQL provides a privilege system that authenticates the user who connects from a host, and associates the user with access privileges on a database. The privileges include SELECT, INSERT, UPDATE, and DELETE and are able to identify anonymous users and grant privileges for MySQL specific functions, such as LOAD DATA INFILE and administrative operations. The access denied error may occur because of many causes. In many cases, the problem is caused because of MySQL accounts that the client programs use to connect with the MySQL server with permission from the server. 2. Lost connection to MySQL server The lost connection to MySQL server error can occur because of one of the three likely causes explained in this section. One potential reason for the error is that the network connectivity is troublesome. Network conditions should be checked if this is a frequent error. If an error message like “Lost connection to MySQL server” appears while querying the database, it is certain that the error has occurred because of network connection issues. The connection_timeout system variable defines the number of seconds that the mysqld server waits for a connection packet before connection timeout response. Infrequently, this error may occur when a client is trying for the initial connection to the server and the connection_timeout value is set to a few seconds. In this case, the problem can be resolved by increasing the connection_timeout value based on the the distance and connection speed. SHOW GLOBAL STATUS LIKE and Aborted_connects can be used to determine if we are experiencing this more frequently. It can be certainly said that increasing the connection_timeout value is the solution if the error message contains reading authorization packet. It is possible that the problem may be faced because of larger Binary Large OBject (BLOB) values than max_allowed_packet. This can cause a lost connection to the MySQL server error with clients. If the ER_NET_PACKET_TOO_LARGE error is observed, it confirms that the max_allowed_packet value should be increased. 3. Password fails when entered incorrectly MySQL clients ask for a password when the client program is invoked with the -- password or -p option without the password value. The following is the command: > mysql -u user_name -p Enter password: On a few systems, it may happen that the password works fine when specified in an option file or on the command line. But it does not work when entered interactively on the Command Prompt at the Enter password: prompt. It occurs because the system-provided library to read the passwords limits the password values to a small number of characters (usually eight). It is an issue with the system library and not with MySQL. As a workaround to this, change the MySQL password to a value that is eight or fewer characters or store the password in the option file. 4. Host host_name is blocked If the mysqld server receives too many connection requests from the host that is interrupted in the middle, the following error occurs: Host 'host_name' is blocked because of many connection errors. Unblock with 'mysqladmin flush-hosts' The max_connect_errors system variable determines the number of successive interrupted connection requests that are allowed. Once there are max_connect_errors failed requests without a successful connection, mysqld assumes that something is wrong and blocks the host from further connections until the FLUSH HOSTS statement or mysqladmin flush-hosts command is issued. mysqld blocks a host after 100 connection errors as a default. It can be adjusted by setting the max_connect_errors value on the server startup, as follows: > mysqld_safe --max_connect_errors=10000 This value can also be set up at runtime, as follows: mysql> SET GLOBAL max_connect_errors=10000; It should be checked first that there is nothing wrong with TCP/IP connections from the host if the host_name is blocked error is received for a particular host. Increasing the value of the max_connect_errors variable does not help if the network has problems. 5. Too many connections This error indicates that all available connection are in use for other client connections. The max_connections is the system variable that controls the number of connections to the server. The default value for the maximum number of connections is 151. We can set a larger value than 151 for the max_connections system variable to support more connections than 151. The mysqld server process actually allows one more than max_connections (max_connections + 1) value clients to connect. The additional one connection is kept reserved for accounts with CONNECTION_ADMIN or the SUPER privilege. The privilege can be granted to the administrators with access to the PROCESS privilege. With this access, the administrator can connect to the server using the reserved connection. They can execute the SHOW PROCESSLIST command to diagnose the problems even though the maximum number of client connections is exhausted. 6. Out of memory If the mysql does not have enough memory to store the entire request of the query issued by the MySQL client program, the server throws the following error: mysql: Out of memory at line 42, 'malloc.c' mysql: needed 8136 byte (8k), memory in use: 12481367 bytes (12189k) ERROR 2008: MySQL client ran out of memory In order to fix the problem, we must first check if the query is correct. Do we expect the query to return so many rows? If not, we should correct the query and execute it again. If the query is correct and needs no correction, we can connect mysql with the --quick option. Using the --quick option results in the mysql_use_result() C API function for fetching the result set. The function adds more load on the server and less load on the client. 7. Packet too large The communication packet is one of the following: A single SQL statement that the MySQL client sends to the MySQL server A single row that is sent to the MySQL client from the MySQL server A binary log event that is sent from a replication master server to the replication slave A 1 GB packet size is the largest possible packet size that can be transmitted to or from the MySQL 8 server or client. The MySQL server or client issues an ER_NET_PACKET_TOO_LARGE error and closes the connection if it receives a packet bigger than max_allowed_packet bytes. The default max_allowed_packet size is 16 MB for the MySQL client program. The following command can be used to set a larger value: > mysql --max_allowed_packet=32M The default value for the MySQL server is 64 MB. It should be noted that there is no harm in setting a larger value for this system variable, as the additional memory is allocated as needed. 8. The table is full The table-full error occurs in one of the following conditions: The disk is full The table has reached the maximum size The actual maximum table size in the MySQL database can be determined by the constraints imposed by the operating system on the file sizes. 9. Can't create/write to file This indicates that MySQL is unable to create a temporary file in the temporary directory for the result set if we get the following error while executing a query: Can't create/write to file 'sqla3fe_0.ism' The possible workaround for the error is to start the mysqld server with the --tmpdir option. The following is the command: > mysqld --tmpdir C:/temp 10. Commands out of sync If the client functions are called in the wrong order, the commands out of sync error is  received. It means that the command cannot be executed in the client code. As an example, if we execute mysql_use_result() and try to execute another query before executing mysql_free_result(), this error may occur. It may also happen if we execute two queries that return a result set without calling the mysql_use_result() or mysql_store_result() functions in between. 11. Ignoring user The following error is received when an account in the user table is found with an invalid password upon the mysqld server startup or when the server reloads the grant tables: Found wrong password for user 'some_user'@'some_host'; ignoring user The account is ignored by the MySQL permission system as a result. To fix the problem, we should assign a new valid password for the account. 12. Table tbl_name doesn't exist The following error indicates that a specified table does not exist in the default database: Table 'tbl_name' doesn't exist Can't find file: 'tbl_name' (errno: 2) In some cases, the user may be referring to the table incorrectly. It is possible because the MySQL server uses directories and files for storing database tables. Depending upon the operating system file management, the database and table names can be case sensitive. For non case-sensitive file systems, such as Windows, the references to a specified table used within a query must use the same letter case. In addition to these, you might come across MySQL 8 server errors such as issue with permission, or client errors like problem with NULL values. To know how to deal with them, you may check out this book MySQL 8 Administrator’s Guide. MySQL 8.0 is generally available with added features Basic Website using Node.js and MySQL database  
Read more
  • 0
  • 0
  • 56288

article-image-debug-application-using-qt-creator
Gebin George
27 Apr 2018
9 min read
Save for later

How to Debug an application using Qt Creator

Gebin George
27 Apr 2018
9 min read
Today, we will learn about debugging an application using Qt Creator. A debugger is a program that can be used to test and debug other programs, in case of a sudden crash during the program execution or an unexpected behavior in the logic of the program. Most of the time (if not always), debuggers are used in the development environment and in conjunction with an IDE. In our case, we will learn how to use a debugger with Qt Creator. It is important to note that debuggers are not part of the Qt Framework, and, just like compilers, they are usually provided by the operating system SDK. Qt Creator automatically detects and uses debuggers if they are present on a system. This can be checked by navigating into the Qt Creator Options page via the main menu Tools and then Options. Make sure to select Build & Run from the list on the left side and then switch to the Debuggers tab from the top. You should be able to see one or more autodetected debuggers on the list. [box type="info" align="" class="" width=""]Windows Users: You should see something similar to the screenshot after this information box. If not, this means you have not installed any debuggers. You can easily download and install it using the instructions provided here: https:/ / docs. microsoft. com/ en- us/ windows- hardware/ drivers/debugger/ Or, you can independently search for the following topic online: Debugging Tools for Windows (WinDbg, KD, CDB, NTSD). Nevertheless, after the debugger is installed (assumingly, CDB or Microsoft Console Debugger for Microsoft Visual C++ Compilers and GDB for GCC Compilers), you can restart Qt Creator and return to this page. You should be able to have one or more entries similar to the following. Since we have installed a 32-bit version of the Qt and OpenCV Frameworks, choose the entry with x86 in its name to view its path, type, and other properties. macOS and Linux Users: There shouldn't be any action needed on your part and, depending on the OS, you'll see a GDB, LLDB, or some other debugger in the entries.[/box] Here's the screenshot of the Build & Run tab on the Options page: Depending on the operating system and the installed debugger, the preceding screenshot might be slightly different. Nevertheless, you'll have a debugger that you need to make sure is correctly set as the debugger for the Qt Kit you are using. So, make a note of the debugger path and name and switch to the Kits tab, and, after selecting the Qt Kit you were using, make sure the debugger for it is correctly set, as you can see in the following screenshot: Don't worry about choosing the wrong debugger, or any other options, since you'll be warned with relevant icons beside the Qt Kit icon selected at the top. The icon seen in the following image on the left side is usually displayed when everything is okay with the Kit, the second one from the left is an indication that something is not right, and the one on the right means a critical error. Move your mouse over the icon when it appears to see more information about the required actions needed to fix the issue: [box type="info" align="" class="" width=""]Critical issues with Qt Kits can be caused by many different factors such as a missing compiler which will make the kit completely useless until the issue is resolved. An example of a warning message in a Qt Kit would be a missing debugger, which will not make the kit useless, but you won't be able to use the debugger with it, thus it means less functionality than a completely configured Qt Kit.[/box] After the debugger is correctly set, you can start debugging your applications in one of the following ways, which basically have the same result: ending up in the Debugger view of the Qt Creator: Starting an application in Debugging mode Attaching to a running application (or process) [box type="info" align="" class="" width=""]Note that a debugging process can be started in many ways, such as remotely, by attaching to a process running on a separate machine and so on. However, the preceding methods will suffice for most cases and especially for the ones relevant to the Qt+OpenCV application development and what we learned throughout this book.[/box] Getting started with the debugging mode To start an application in the debugging mode, after opening a Qt project, you can use one of the following methods: Pressing the F5 button Using the Start Debugging button, right below the usual Run button with a similar icon, but with a small bug on it Using the main menu entries in the following order: Debug/Start Debugging/Start Debugging. To attach the debugger to a running application, you can use the main menu entries in the following order: Debug/Start Debugging/Attach to Running Application. This will open up the List of Processes window, from which you can choose your application or any other process you want to debug using its process ID or executable name. You can also use the Filter field (as seen in the following image) to find your application, since, most probably, the list of processes will be quite a long one. After choosing the correct process, make sure to press the Attach to Process button. No matter which one of the preceding methods you use, you will end up in the Qt Creator Debug mode, which is quite similar to the Edit mode, but it also allows you to do the following, among many others: Add, Enable, Disable, and View Breakpoints in the code (a Breakpoint is simply a point or a line in the code that we want the debugger to pause in the process and allow us to do a more detailed analysis of the status of the program) Interrupt running programs and processes to view and examine the code View and examine the function call stack (the call stack is a stack containing the hierarchical list of functions that led to a breakpoint or interrupted state) View and examine the variables Disassemble the source codes (disassembling in this sense means extracting the exact instructions that correspond to the function calls and other C++ codes in our program) You'll notice a performance drop in the application when it is started in debugging mode, which is obviously because of the fact that codes are being monitored and traced by the debugger. Here's a screenshot of the Qt Creator Debug mode, in which all of the capabilities mentioned earlier are visible in a single window and in the Debug mode of the Qt Creator: The area specified with the number 1 in the preceding screenshot in the code editor that you have already used through the book and are quite familiar with. Each line of code has a line number; you can click on their left side to toggle a breakpoint anywhere you want in the code. You can also right-click on the line numbers to set, remove, disable, or enable a breakpoint by selecting Set Breakpoint at Line X, Remove Breakpoint X, Disable Breakpoint X, or Enable Breakpoint X, where X in all of the commands mentioned here needs to be replaced by the line number. Apart from the code editor, you can also use the area mentioned with number 4 in the preceding screenshot to add, delete, edit, and further modify breakpoints in the code. You can also right-click on the same toolbar below the code editor that contains the debugger controls to open up the following menu and add or remove more panes to display additional debug and analysis information. We will cover the default debugger view, but make sure to check out each one of the following options on your own to familiarize yourself with the debugger even more: The area specified with number 2 in the preceding code can be used to view the call stack. Whether you interrupt the program by pressing the Interrupt button or choosing Debug/Interrupt from the menu while the it is running, set a breakpoint and stop the program in a specific line of code, or a malfunctioning code causes the program to fall into a trap and pause the process (since a crash and exception will be caught by the debugger), you can always view the hierarchy of function calls that led to the interrupted state, or further analyze them by checking the area 2 in the preceding Qt Creator screenshot. Finally, you can use the third area in the previous screenshot to view the local and global variables of the program in the interrupted location in the code. You can see the contents of the variables, whether they are standard data types, such as integers and floats or structures and classes, and also you can further expand and analyze their content to test and analyze any possible issues in your code. Using a debugger efficiently can mean hours of difference in testing and solving the issues in your code. In terms of practical usage of the debuggers, there is really no other way but to use it as much as you can and develop habits of your own to use the debugger, but also make note of good practices and tricks you found along the way and the ones we just went through. If you are interested, you can also read online about other possible methods of debugging, such as remote debugging, debugging using crash dump files (on Windows), and more. We saw how to practically debug an application using QT debugging mode. [box type="note" align="" class="" width=""]You read an excerpt from the book, Computer Vision with OpenCV 3 and Qt 5 written by Amin Ahmadi Tazehkandi.  The book covers development of cross-platform applications using OpenCV 3 and Qt 5.[/box] 3 ways to deploy a QT and OpenCV application Debugging Your .NET Application    
Read more
  • 0
  • 0
  • 29850

article-image-top-10-mysql-8-performance-benchmarking-aspects-to-know
Amey Varangaonkar
27 Apr 2018
5 min read
Save for later

Top 10 MySQL 8 performance benchmarking aspects to know

Amey Varangaonkar
27 Apr 2018
5 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book MySQL 8 Administrator’s Guide, co-authored by Chintan Mehta, Ankit Bhavsar, Hetal Oza and Subhash Shah. This book presents an in-depth view of the newly released features of MySQL 8 and how you can leverage them to administer a high-performance MySQL solution.[/box] Following the best practices for the configuration of MySQL helps us design and manage efficient database, and are quite a cherry on top - without which, it might seem a bit incomplete. In addition to configuration, benchmarking helps us validate and find bottlenecks in the database system and address them. In this article, we look at specific areas that will help us understand the best practices for configuration and performance benchmarking. 1. Resource utilization IO activity, CPU, and memory usage is something that you should not miss out. These metrics help us know how the system is performing while doing benchmarking and at the time of scaling. It also helps us derive impacts per transaction. 2. Stretching your benchmarking timelines We may often like to have a quick glance at performance metrics; however, ensuring that MySQL behaves in the same way for a longer duration of testing is also a key element. There is some basic stuff that might impact on performance when you stretch your benchmark timelines, such as memory fragmentation, degradation of IO, impact after data accumulation, cache management, and so on. We don't want our database to get restarted just to clean up junk items, correct? Therefore, it is suggested to run benchmarking for a long duration for stability and performance Validation. 3. Replicating production settings Let's benchmark in a production-replicated environment. Wait! Let's disable database replication in a replica environment until we are done with benchmarking. Gotcha! We have got some good numbers! It often happens that we don't simulate everything completely that we are going to configure in the production environment. It could prove to be costly, as we might unintentionally be benchmarking something in an environment that might have an adverse impact when it's in production. Replicate production settings, data, workload, and so on in your replicated environment while you do benchmarking. 4. Consistency of throughput and latency Throughput and latency go hand in hand. It is important to keep your eyes primarily focused on throughput; however, latency over time might be something to look out for. Performance dips, slowness, or stalls were noticed in InnoDB in its earlier days. It has improved a lot since then, but as there might be other cases depending on your workload, it is always good to keep an eye on throughput along with latency. 5. Sysbench can do more Sysbench is a wonderful tool to simulate your workloads, whether it be thousands of tables, transaction intensive, data in-memory, and so on. It is a splendid tool to simulate and gives you nice representation. 6. Virtualization world I would like to keep this simple; bare metal as compared to virtualization isn't the same. Hence, while doing benchmarking, measure your resources according to your environment. You might be surprised to see the difference in results if you compare both. 7. Concurrency Big data is seated on heavy data workload; high concurrency is important. MySQL 8 is extending its maximum CPU core support in every new release, optimizing concurrency based on your requirements and hardware resources should be taken care of. 8. Hidden workloads Do not miss out factors that run in the background, such as reporting for big data analytics, backups, and on-the-fly operations while you are benchmarking. The impact of such hidden workloads or obsolete benchmarking workloads can make your days (and nights) Miserable. 9. Nerves of your query Oops! Did we miss the optimizer? Not yet. An optimizer is a powerful tool that will read the nerves of your query and provide recommendations. It's a tool that I use before making changes to a query in production. It's a savior when you have complex queries to be optimized. These are a few areas that we should look out for. Let's now look at a few benchmarks that we did on MySQL 8 and compare them with the ones on MySQL 5.7. 10. Benchmarks To start with, let's fetch all the column names from all the InnoDB tables. The following is the query that we executed: SELECT t.table_schema, t.table_name, c.column_name FROM information_schema.tables t, information_schema.columns c WHERE t.table_schema = c.table_schema AND t.table_name = c.table_name AND t.engine='InnoDB'; The following figure shows how MySQL 8 performed a thousand times faster when having four instances: Following this, we also performed a benchmark to find static table metadata. The following is the query that we executed: SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE, ENGINE, ROW_FORMAT FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA LIKE 'chintan%'; The following figure shows how MySQL 8 performed around 30 times faster than MySQL 5.7:   It made us eager to go into a bit more detail. So, we thought of doing one last test to find dynamic table metadata. The following is the query that we executed: SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA LIKE 'chintan%'; The following figure shows how MySQL 8 performed around 30 times faster than MySQL 5.7: MySQL 8.0 brings enormous performance improvement to the table. Scaling from one to million tables, is a need for big data requirements, which is now achievable. We look forward to more benchmarks being officially released once MySQL 8 is available for general purpose. If you found this post useful, make sure to check out the book MySQL 8 Administrator’s Guide for more tips and tricks to manage MySQL 8 effectively. MySQL 8.0 is generally available with added features New updates to Microsoft Azure services for SQL Server, MySQL, and PostgreSQL  
Read more
  • 0
  • 0
  • 35254

article-image-how-data-scientists-test-hypotheses-and-probability
Richard Gall
23 Apr 2018
4 min read
Save for later

How data scientists test hypotheses and probability

Richard Gall
23 Apr 2018
4 min read
Why hypotheses are important in statistical analysis Hypothesis testing allows researchers and statisticians to develop hypotheses which are then assessed to determine the probability or the likelihood of those findings. This statistics tutorial has been taken from Basic Statistics and Data Mining for Data Science. Whenever you wish to make an inference about a population from a sample, you must test a specific hypothesis. It’s common practice to state 2 different hypotheses: Null hypothesis which states that there is no effect Alternative/research hypothesis which states that there is an effect So, the null hypothesis is one which says that there is no difference. For example, you might be looking at the mean income between males and females, but the null hypothesis you are testing is that there is no difference between the 2 groups. The alternative hypothesis, meanwhile, is generally, although not exclusively, the one that researchers are really interested in. In this example, you might hypothesize that the mean income between males and females is different. Read more: How to predict Bitcoin prices from historical and live data. Why probability is important in statistical analysis In statistics, nothing is ever certain because we are always dealing with samples rather than populations. This is why we always have to work in probabilities. The way hypotheses are assessed is by calculating the probability or the likelihood of finding our result. A probability value, which can range from zero to one, corresponding to 0% and 100% in percentages, is essentially a way of measuring the likelihood of a particular event occurring. You can use these values to assess whether the likelihood of any of these differences that you have found are the result of random chance. How do hypotheses and probability interact? It starts getting really interesting once we begin looking at how hypotheses and probability interact. Here’s an example. Suppose you want to know who is going to win the Super Bowl. I ask a fellow statistician, and he tells me that she’s built a predictive model and that he knows which team is going to win. Fine - my next question is how confident he is in that prediction. He says he’s 50% confident - are you going to trust his prediction? Of course you’re not - there are only 2 possible outcomes and 50% is ultimately just random chance. So, say I ask another statistician. He also tells me that he has a prediction and that he has built a predictive model, and he’s 75% confident in the prediction he has made. You’re more likely to trust this prediction - you have a 75% chance of being right and a 25% chance of being wrong. But let’s say you’re feeling cautious - a 25% chance of being wrong is too high. So, you ask another statistician for their prediction. She tells me that she’s also built a predictive model which she has 90% confidence is correct. So, having formally stated our hypotheses we then have to select a criterion for acceptance or rejection of the null hypothesis. With probability tests like the chi-squared test, the t-test, or regression or correlation, you’re testing the likelihood that a statistic of the magnitude that you obtained or greater would have occurred by chance, assuming that the null hypothesis is true. It’s important to remember that you always assess the probability of the null hypothesis as true. You only reject the null hypothesis if you can say that the results would have been extremely unlikely under the conditions set by the null hypothesis. In this case, if you can reject the null hypothesis, you have found support for the alternative/research hypothesis. This doesn’t prove the alternative hypothesis, but it does tell you that the null hypothesis is unlikely to be true. The criterion we typically use is whether the significance level sits above or below 0.05 (5%), indicating that a statistic of the size that we obtained, would only be likely to occur on 5% of occasions. By choosing a 5% criterion you are accepting that you will make a mistake in rejecting the null hypothesis 1 in 20 times. Replication and data mining If in traditional statistics we work with hypotheses and probabilities to deal with the fact that we’re always working with a sample rather than a population, in data mining, we can work in a slightly different way - we can use something called replication instead. In a data mining project we might have 2 data sets - a training data set and a testing data set. We build our model on a training set and once we’ve done that, we take the results of that model and then apply it to a testing data set to see if we find similar results.
Read more
  • 0
  • 0
  • 55468

article-image-azure-stream-analytics-7-reasons-to-choose
Sugandha Lahoti
19 Apr 2018
11 min read
Save for later

How to get started with Azure Stream Analytics and 7 reasons to choose it

Sugandha Lahoti
19 Apr 2018
11 min read
In this article, we will introduce Azure Stream Analytics, and show how to configure it. We will then look at some of key the advantages of the Stream Analytics platform including how it will enhance developer productivity, reduce and improve the Total Cost of Ownership (TCO) of building and maintaining a scaling streaming solution among other factors. What is Azure Stream Analytics and how does it work? Microsoft Azure Stream Analytics falls into the category of PaaS services where the customers don't need to manage the underlying infrastructure. However, they are still responsible for and manage an application that they build on the top of PaaS service and more importantly the customer data. Azure Stream Analytics is a fully managed server-less PaaS service that is built for real-time analytics computations on streaming data. The service can consume from a multitude of sources. Azure will take care of the hosting, scaling, and management of the underlying hardware and software ecosystem. The following are some of the examples of different use cases for Azure Stream Analytics. When we are designing the solution that involves streaming data, in almost every case, Azure Stream Analytics will be part of a larger solution that the customer was trying to deploy. This can be real-time dashboarding for monitoring purposes or real-time monitoring of IT infrastructure equipment, preventive maintenance (auto-manufacturing, vending machines, and so on), and fraud detection. This means that the streaming solution needs to be thoughtful about providing out-of-the-box integration with a whole plethora of services that could help build a solution in a relatively quick fashion. Let's review a usage pattern for Azure Stream Analytics using a canonical model: We can see devices and applications that generate data on the left in the preceding illustration that can connect directly or through cloud gateways to your stream ingest sources. Azure Stream Analytics can pick up the data from these ingest sources, augment it with reference data, run necessary analytics, gather insights and push them downstream for action. You can trigger business processes, write the data to a database or directly view the anomalies on a dashboard. In the previous canonical pattern, the number of streaming ingest technologies are used; let's review them in the following section: Event Hub: Global scale event ingestion system, where one can publish events from millions of sensors and applications. This will guarantee that as soon as an event comes in here, a subscriber can pick that event up within a few milliseconds. You can have one or more subscriber as well depending on your business requirements. A typical use case for an Event Hub is real-time financial fraud detection and social media sentiment analytics. IoT Hub: IoT Hub is very similar to Event Hub but takes the concept a lot further forward—in that you can take bidirectional actions. It will not only ingest data from sensors in real time but can also send commands back to them. It also enables you to do things like device management. Enabling fundamental aspects such as security is a primary need for IoT built with it. Azure Blob: Azure Blob is a massively scalable object storage for unstructured data, and is accessible through HTTP or HTTPS. Blob storage can expose data publicly to the world or store application data privately. Reference Data: This is auxiliary data that is either static or that changes slowly. Reference data can be used to enrich incoming data to perform correlation and lookups. On the ingress side, with a few clicks, you can connect to Event Hub, IOT Hub, or Blob storage. The Streaming data can be enriched with reference data in the Blob store. Data from the ingress process will be consumed by the Azure Stream Analytics service; we can call machine learning (ML) for event scoring in real time. The data can be egressed to live Dashboarding to Power BI, or could also push data back to Event Hub from where dashboards and reports can pick it up. The following is a summary of the ingress, egress, and archiving options: Ingress choices: Event Hub IoT Hub Blob storage Egress choices: Live Dashboards: PowerBI Event Hub Driving workflows: Event Hubs Service Bus Archiving and post analysis: Blob storage Document DB Data Lake SQL Server Table storage Azure Functions One key point to note is there the number of customers who push data from Stream Analytics processing (egress point) to Event Hub and then add Azure website-as hosted solutions into their own custom dashboard. One can drive workflows by pushing the events to Azure Service Bus and PowerBI. For example, customer can build IoT support solutions to detect an anomaly in connected appliances and pushing the result into Azure Service Bus. A worker role can run as a daemon to pull the messages and create support tickets using Dynamics CRM API. Then use Power BI on the ticket can be archived for post analysis. This solution eliminates the need for the customer to log a ticket , but the system will automatically do it based on predefined anomaly thresholds. This is just one sample of real-time connected solution. There are a number of use cases that don't even involve real-time alerts. You can also use it to aggregate data, filter data, and store it in Blob storage, Azure Data Lake (ADL), Document DB, SQL, and then run U-SQL Azure Data Lake Analytics (ADLA), HDInsight, or even call ML models for things like predictive maintenance. Configuring Azure Stream Analytics Azure Stream Analytics (ASA) is a fully managed, cost-effective real-time event processing engine. Stream Analytics makes it easy to set up real-time analytic computations on data streaming from devices, sensors, websites, social media, applications, infrastructure systems, and more. The service can be hosted with a few clicks in the Azure portal; users can author a Stream Analytics job specifying the input source of the streaming data, the output sink for the results of your job, and a data transformation expressed in a SQL-like language. The jobs can be monitored and you can adjust the scale/speed of the job in the Azure portal to scale from a few kilobytes to a gigabyte or more of events processed per second. Let's review how to configure Azure Stream Analytics step by step: Log in to the Azure portal using your Azure credentials, click on New, and search for Stream Analytics job: 2. Click on Create to create an Azure Stream Analytics instance: 3. Provide a Job Name and Resource group name for the Azure Stream Analytics job deployment: 4. After a few minutes, the deployment will be complete: 5. Review the following in the deployment--audit trail of the creation: 6. Ability stream up and down using a simple UI: 7. Build in the Query interface to run queries: 8. Run Queries using a SQL-like interface, with the ability to accept late-arriving events with simple GUI-based configuration: Key advantages of Azure Stream Analytics Let's quickly review how traditional streaming solutions are built; the core deployment starts with procuring and setting up the basic infrastructure necessary to host the streaming solution. Once this is done, we can then build the ingress and egress solution on top of the deployed infrastructure. Once the core infrastructure is built, customer tools will be used to build business intelligence (BI) or machine-learning integration. After the system goes into production, scaling during runtime needs to be taken care of by capturing the telemetry and building and configuration of HW/SW resources as necessary. As business needs ramp up, so does the monitoring and troubleshooting. Security Azure Stream Analytics provides a number of inbuilt security mechanics in areas such as authentication, authorization, auditing, segmentation, and data protection. Let's quickly review them. Authentication support: Authentication support in Azure Stream Analytics is done at portal level. Users should have a valid subscription ID and password to access the Azure Stream Analytics job. Authorization: Authorization is the process during login where users provide their credentials (for example, user account name and password, smart card and PIN, Secure ID and PIN, and so on) to prove their Microsoft identity so that they can retrieve their access token from the authentication server. Authorization is supported by Azure Stream Analytics. Only authenticated/authorized users can access the Azure Stream Analytics job. Support for encryption: Data-at-rest using client-side encryption and TDE. Support for key management: Key management is supported through ingress and egress points. Programmer productivity One of the key features of Azure Stream Analytics is developer productivity, and it is driven a lot by the query language that is based on SQL constructs. It provides a wide array of functions for analytics on streaming data, all the way from simple data manipulation functions, data and time functions, temporal functions, mathematical, string, scaling, and much more. It provides two features natively out of the box. Let's review the features in detail in the next section Declarative SQL constructs Built-in temporal semantics Declarative SQL constructs A simple-to-use UI is provided and queries can be constructed using the provided user interface. The following is the feature set of the declarative SQL constructs: Filters (Where) Projections (Select) Time-window and property-based aggregates (Group By) Time-shifted joins (specifying time bounds within which the joining events must occur) All combinations thereof The following is a summary of different constructs to manipulate streaming data: Data manipulation: SELECT, FROM, WHERE GROUP BY, HAVING, CASE WHEN THEN ELSE, INNER/LEFT OUTER JOIN, UNION, CROSS/OUTER APPLY, CAST, INTO, ORDER BY ASC, DSC Date and time functions: DateName, DatePart, Day, Month, Year, DateDiff, DateTimeFromParts, DateAdd Temporal functions: Lag, IsFirst, LastCollectTop Aggregate functions: SUM, COUNT, AVG, MIN, MAX, STDEV, STDEVP, VAR VARP, TopOne Mathematical functions: ABS, CEILING, EXP, FLOOR POWER, SIGN, SQUARE, SQRT String functions: Len, Concat, CharIndex Substring, Lower Upper, PatIndex Scaling extensions: WITH, PARTITION BY OVER Geospatial: CreatePoint, CreatePolygon, CreateLineString, ST_DISTANCE, ST_WITHIN, ST_OVERLAPS, ST_INTERSECTS Built-in temporal semantics Azure Stream Analytics provides prebuilt temporal semantics to query time-based information and merge streams with multiple timelines. Here is a list of temporal semantics: Application or ingest timestamp Windowing functions Policies for event ordering Policies to manage latencies between ingress sources Manage streams with multiple timelines Join multiple streams of temporal windows Join streaming data with data-at-rest Lowest total cost of ownership Azure Stream Analytics is a fully managed PaaS service on Azure. There are no upfront costs or costs involved in setting up computer clusters and complex hardware wiring like you would do with an on-prem solution. It's a simple job service where there is no cluster provisioning and customers pay for what they use. A key consideration is the variable workloads. With Azure Stream Analytics, you do not need to design your system for peak throughput and can add more compute footprint as you go. If you have scenarios where data comes in spurts, you do not want to design a system for peak usage and leave it unutilized for other times. Let's say you are building a traffic monitoring solution—naturally, there is the expectation that it will expect peaks to show up during morning and evening rush hours. However, you would not want to design your system or investments to cater to these extremes. Cloud elasticity that Azure offers is a perfect fit here. Azure Stream Analytics also offers fast recovery by checkpointing and at-least-once event delivery. Mission-critical and enterprise-less scalability and availability Azure Stream Analytics is available across multiple worldwide data centers and sovereign clouds. Azure Stream Analytics promises 3-9s availability that is financially guaranteed with built-in auto recovery so that you will never lose the data. The good thing is customers do not need to write a single line of code to achieve this. The bottom-line is that enterprise readiness is built into the platform. Here is a summary of the Enterprise-ready features: Distributed scale-out architecture Ingests millions of events per second Accommodates variable loads Easily adds incremental resources to scale Available across multiple data centres and sovereign clouds Global compliance In addition, Azure Stream Analytics is compliant with many industries and government certifications. It is already HIPPA-compliant built-in and suitable to host healthcare applications. That's how customers can scale up their businesses confidently. Here is a summary of global compliance: ISO 27001 ISO 27018 SOC 1 Type 2 SOC 2 Type 2 SOC 3 Type 2 HIPAA/HITECH PCI DSS Level 1 European Union Model Clauses China GB 18030 Thus we reviewed Azure Stream Analytics and understood its key advantages. These advantages included: Ease in terms of developer productivity, Ease of development and how to reduces total cost of ownership, Global compliance certifications, The value of the PaaS based streaming solution to host mission-critical applications and security This post is taken from the book, Stream Analytics with Microsoft Azure, written by Anindita Basak, Krishna Venkataraman, Ryan Murphy, and Manpreet Singh. This book will help you to understand Azure Stream Analytics so that you can develop efficient analytics solutions that can work with any type of data. Say hello to Streaming Analytics How to build a live interactive visual dashboard in Power BI with Azure Stream Performing Vehicle Telemetry job analysis with Azure Stream Analytics tools    
Read more
  • 0
  • 0
  • 40529
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-performing-vehicle-telemetry-job-analysis-with-azure-stream-analytics-tools
Sugandha Lahoti
18 Apr 2018
8 min read
Save for later

Performing Vehicle Telemetry job analysis with Azure Stream Analytics tools

Sugandha Lahoti
18 Apr 2018
8 min read
This tutorial is a step-by-step blueprint for a Vehicle Telemetry job analysis on Azure using Streaming Analytics tools for Visual Studio.For connected car and real-time predictive Vehicle Telemetry Analysis, there's a necessity to specify opportunities for new solutions. These opportunities include   How a car could be shipped globally with the required smart hardware to connect to the internet within the next few years. How the embedded connections could define Vehicle Telemetry predictive health status so automotive companies will be able to collect data on the performance of cars, How to send interactive updates and patches to car's instrumentation remotely, How to avoid car equipment damage with precautionary measures with prior notification All these require an intelligent vehicle health telemetry analysis which you can implement using Azure Streaming. Stream Analytics tools for Visual Studio The Stream Analytics tools for Visual Studio help prepare, build, and deploy real-time events on Azure. Optionally, the tools enable you to monitor the streaming job using local sample data as job input testing as well as real-time monitoring, job metrics, diagram view, and so on. This tool provides a complete development setup for the implementation and deployment of real-world Azure Stream Analytics jobs using Visual Studio. Developing a Stream Analytics job using Visual Studio Post the installation of the Stream Analytics tool, a new stream analytics job can be created in Visual Studio. You can get started in Visual Studio IDE from File | New Project. Under the Templates, select Stream Analytics and choose Azure Stream Analytics Application. 2. Next, the job name, project, and solution location should be provided. Under Solution menu, you may also select options such as Add to solution or Create new instance apart from Create new solution from the available drop-down menu during Visual Studio Stream Analytics job creation: 3. Once the ASA job is created, in the Solution Explorer, the job topology folder structure could be viewed as Inputs (job input), Outputs (job output), JobConfig.json, Script.asaql (Stream Analytics Query file), Azure Functions (optional), and so on: 4. Next, provide the job topology data input and output event source settings by selecting Input.json and Output.json from Inputs and Outputs directories, respectively. 5. For a Vehicle Telemetry Predictive Analysis demo using an Azure Stream Analytics job, we need to take two different job data streams. One should be a Stream type for an illimitable sequence of real-time events processed through Azure Event Hub along with Hub policy name, policy key, event serialization format, and so on: Defining a Stream Analytics query for Vehicle Telemetry job analysis using Stream Analytics tools To assign the streaming analytics query definition, the Script.asasql file from the ASA project should be selected by specifying the data and reference stream input joining operation along with supplying analyzed output to Blob storage as configured in job properties. Query to define Vehicle Telemetry (Connected Car) engine health status and pollution index over cities For connected car and real-time predictive Vehicle Telemetry Analysis, there's a necessity to specify opportunities for new solutions in terms of how a car could be shipped globally with the required smart hardware to connect to the internet within the next few years. How the embedded connections could define Vehicle Telemetry predictive health status so automotive companies will be able to collect data on the performance of cars, to send interactive updates and patches to car's instrumentation remotely, and just to avoid car equipment damage with precautionary measures with prior notification through intelligent vehicle health telemetry analysis using Azure Streaming. The solution architecture of the Co:tUlected Car-Vehicle Telemetry Analysis case study used in this demo, with Azure Stream Analytics for real-time predictive analysis, is as follows: Testing Stream Analytics queries locally or in the cloud Azure Stream Analytics tools in Visual Studio offer the flexibility to execute the queries either locally or directly in the cloud. In the Script.asaql file, you need to provide the respective query of your streaming job and test against local input stream/Reference data for query testing before processing in Azure: 2. To run the Stream Analytics job query locally, first select Add Local Input by right-clicking on the ASA project in VS Solution Explorer, and choose to Add Local Input: 3. Define the local input for each Event Hub Data Stream and Blob storage data and execute the job query locally before publishing it in Azure: 4. After adding each local input test data, you can test the Stream Analytics job query locally in VS editor by clicking on the Run Locally button in the top left corner of VS IDE: Vehicle diagnostic Usage-based insurance Engine emission control Engine performance remapping Eco-driving Roadside assistance call Fleet management So, specify the following schema during the designing of a connected car streaming job query with Stream Analytics using parameters such as Vehicle Index no, Model, outside temperature, engine speed, fuel meter, tire pressure, and brake status, by defining INNER join with Event Hub data streams along with Blob storage reference streams containing vehicle model information: Select input.vin, BlobSource.Model, input.timestamp, input.outsideTemperature, input.engineTemperature, input.speed, input.fuel, input.engineoil, input.tirepressure, input.odometer, input.city, input.accelerator_pedal_position, input.parking_brake_status, input.headlamp_status, input.brake_pedal_status, input.transmission_gear_position, input.ignition_status, input.windshield_wiper_status, input.abs into output from input join BlobSource on input.vin = BlobSource.VIN The query could be further customized for complex event processing analysis in terms of defining windowing concepts like Tumbling window function, which assigns equal length non-overlapping series of events in streams with a fixed time slice. The following Vehicle Telemetry analytics query will specify a smart car health index parameter with complex streams from a specified two-second timestamp interval in the form of a fixed length series of events: select BlobSource.Model, input.city,count(vin) as cars, avg(input.engineTemperature) as engineTemperature, avg(input.speed) as Speed, avg(input.fuel) as Fuel, avg(input.engineoil) as EngineOil,avg(input.tirepressure) as TirePressure, avg(input.odometer) as Odometer into EventHubOut from input join BlobSource on input.vin = BlobSource.VIN group by BlobSource.model, input.city, TumblingWindow(second,2) The following Vehicle Telemetry analytics query will specify a smart car health index parameter with complex streams from a specified two-second timestamp interval in the form of a fixed length series of events: 5. The query could be executed locally or submitted to Azure. While running the job locally, a Command Prompt will appear asserting the local Stream Analytics job's running status, with the output data folder location: 6. If run locally, the job output folder would contain two files in the project disk location within the ASALocalRun directory named, with the current date timestamp. Two output files would be present in .csv and .json formats respectively: Now, if submitted the job to Azure from the Stream Analytics project in Visual Studio, it offers a beautiful job dashboard while providing an interactive job diagram view, job metrics graph, and errors (if any). The Vehicle Telemetry Predictive Health Analytics job dashboard in Visual Studio provides a nice job diagram with Real-Time Insights of events, with a display refreshed at a minimum rate of every 30 minutes: The Stream Analytics job metrics graph provides interactive insights on input and output events, out of order events, late events, runtime errors, and data conversion errors related to the job as appropriate: For Connected Car-Predictive Vehicle Telemetry Analytics, you may configure the data input streams processed with complex events by using a definite timestamp interval in a non-overlapping mode such as Tumbling window over a two-second time slicer. The output sink should be configured as Service Bus Event Hub in a data partitioning unit of 32, with a maximum message retention period of 7 days. The output job sink processed events in Event Hub could be archived as well in Azure blob storage for a long-term infrequent access perspective: The Azure Service Bus, Event Hub job output metrics dashboard view configured for vehicle telemetry analysis is as follows: On the left side of the job dashboard, the Job Summary provides a comprehensive view controller of the job parameters such as job status, creation time, job output start time, start mode, last output timestamp, output error handling mechanism provided for quick reference logs, late event arrival tolerance windows, and so on. The job can be stopped and started, deleted, or even refreshed by selecting icons from the top left menu of the job view dashboard in VS: Optionally, a Stream Analytics complete project clone can also be generated by clicking on the Generate Project icon from the top menu of the job dashboard. This article is an excerpt from the book, Stream Analytics with Microsoft Azure, written by Anindita Basak, Krishna Venkataraman, Ryan Murphy, and Manpreet Singh. This book provides lessons on Real-time data processing for quick insights using Azure Stream Analytics. Say hello to Streaming Analytics How to get started with Azure Stream Analytics and 7 reasons to choose it  
Read more
  • 0
  • 0
  • 41458

article-image-how-to-build-a-live-interactive-visual-dashboard-in-power-bi-with-azure-stream
Sugandha Lahoti
17 Apr 2018
4 min read
Save for later

How to build a live interactive visual dashboard in Power BI with Azure Stream

Sugandha Lahoti
17 Apr 2018
4 min read
Azure Stream Analytics is a managed complex event processing interactive data engine. As a built-in output connector, it offers the facility of building live interactive intelligent BI charts and graphics using Microsoft's cloud-based Business Intelligent tool called Power BI. In this tutorial we implement a data architecture pipeline by designing a visual dashboard using Microsoft Power BI and Stream Analytics. Prerequisites of building an interactive visual live dashboard in Power BI with Stream Analytics: Azure subscription Power BI Office365 account (the account email ID should be the same for both Azure and Power BI). It can be a work or school account Integrating Power BI as an output job connector for Stream Analytics To start with connecting the Power BI portal as an output of an existing Stream Analytics job, follow the given steps:  First, select Outputs in the Azure portal under JOB TOPOLOGY:  After clicking on Outputs, click on +Add in the top left corner of the job window, as shown in the following screenshot:  After selecting +Add, you will be prompted to enter the New output connectors of the job. Provide details such as job Output name/alias; under Sink, choose Power BI from the drop-down menu.  On choosing Power BI as the streaming job output Sink, it will automatically prompt you to authorize the Power BI work/personal account with Azure. Additionally, you may create a new Power BI account by clicking on Signup. By authorizing, you are granting access to the Stream Analytics output permanently in the Power BI dashboard. You can also revoke the access by changing the password of the Power BI account or deleting the output/job.  Post the successful authorization of the Power BI account with Azure, there will be options to select Group Workspace, which is the Power BI tenant workspace where you may create the particular dataset to configure processed Stream Analytics events. Furthermore, you also need to define the Table Name as data output. Lastly, click on the Create button to integrate the Power BI data connector for real-time data visuals:   Note: If you don't have any custom workspace defined in the Power BI tenant, the default workspace is My Workspace. If you define a dataset and table name that already exists in another Stream Analytics job/output, it will be overwritten. It is also recommended that you just define the dataset and table name under the specific tenant workspace in the job portal and not explicitly create them in Power BI tenants as Stream Analytics automatically creates them once the job starts and output events start to push into the Power BI dashboard.   On starting the Streaming job with output events, the Power BI dataset would appear under the dataset tab following workspace. The  dataset can contain maximum 200,000 rows and supports real-time streaming events and historical BI report visuals as well: Further Power BI dashboard and reports can be implemented using the streaming dataset. Alternatively, you may also create tiles in custom dashboards by selecting CUSTOM STREAMING  DATA under REAL-TIME OATA, as shown in the following screenshot:  By selecting Next, the streaming dataset should be selected and then the visual type, respective fields, Axis, or legends, can be defined: Thus, a complete interactive near real-time Power BI visual dashboard can be implemented with analyzed streamed data from Stream Analytics, as shown in the following screenshot, from the real-world Connected Car-Vehicle Telemetry analytics dashboard: In this article we saw a step-by-step implementation of a real-time visual dashboard using Microsoft Power BI with processed data from Azure Stream Analytics as the output data connector. This article is an excerpt from the book, Stream Analytics with Microsoft Azure, written by Anindita Basak, Krishna Venkataraman, Ryan Murphy, and Manpreet Singh. To learn more on designing and managing Stream Analytics jobs using reference data and utilizing petabyte-scale enterprise data store with Azure Data Lake Store, you may refer to this book. Unlocking the secrets of Microsoft Power BI Ride the third wave of BI with Microsoft Power BI  
Read more
  • 0
  • 0
  • 49665

article-image-what-is-a-support-vector-machine
Packt Editorial Staff
16 Apr 2018
7 min read
Save for later

What is a support vector machine?

Packt Editorial Staff
16 Apr 2018
7 min read
Support vector machines are machine learning algorithms whereby a model 'learns' to categorize data around a linear classifier. The linear classifier is, quite simply, a line that classifies. It's a line that that distinguishes between 2 'types' of data, like positive sentiment and negative language. This gives you control over data, allowing you to easily categorize and manage different data points in a way that's useful too. This tutorial is an extract from Statistics for Machine Learning. But support vector machines do more than linear classification - they are multidimensional algorithms, which is why they're so powerful. Using something called a kernel trick, which we'll look at in more detail later, support vector machines are able to create non-linear boundaries. Essentially they work at constructing a more complex linear classifier, called a hyperplane. Support vector machines work on a range of different types of data, but they are most effective on data sets with very high dimensions relative to the observations, for example: Text classification, in which language has the very dimensions of word vectors For the quality control of DNA sequencing by labeling chromatograms correctly Different types of support vector machines Support vector machines are generally classified into three different groups: Maximum margin classifiers Support vector classifiers Support vector machines Let's take a look at them now. Maximum margin classifiers People often use the term maximum margin classifier interchangeably with support vector machines. They're the most common type of support vector machine, but as you'll see, there are some important differences. The maximum margin classifier tackles the problem of what happens when your data isn't quite clear or clean enough to draw a simple line between two sets - it helps you find the best line, or hyperplane out of a range of options. The objective of the algorithm is to find  furthest distance between the two nearest points in two different categories of data - this is the 'maximum margin', and the hyperplane sits comfortably within it. The hyperplane is defined by this equation: So, this means that any data points that sit directly on the hyperplane have to follow this equation. There are also data points that will, of course, fall either side of this hyperplane. These should follow these equations: You can represent the maximum margin classifier like this: Constraint 2 ensures that observations will be on the correct side of the hyperplane by taking the product of coefficients with x variables and finally, with a class variable indicator. In the diagram below, you can see that we could draw a number of separate hyperplanes to separate the two classes (blue and red). However, the maximum margin classifier attempts to fit the widest slab (maximize the margin between positive and negative hyperplanes) between two classes and the observations touching both the positive and negative hyperplanes. These are the support vectors. It's important to note that in non-separable cases, the maximum margin classifier will not have a separating hyperplane - there's no feasible solution. This issue will be solved with support vector classifiers. Support vector classifiers Support vector classifiers are an extended version of maximum margin classifiers. Here, some violations are 'tolerated' for non-separable cases. This means a best fit can be created. In fact, in real-life scenarios, we hardly find any data with purely separable classes; most classes have a few or more observations in overlapping classes. The mathematical representation of the support vector classifier is as follows, a slight correction to the constraints to accommodate error terms: In constraint 4, the C value is a non-negative tuning parameter to either accommodate more or fewer overall errors in the model. Having a high value of C will lead to a more robust model, whereas a lower value creates the flexible model due to less violation of error terms. In practice, the C value would be a tuning parameter as is usual with all machine learning models. The impact of changing the C value on margins is shown in the two diagrams below. With the high value of C, the model would be more tolerating and also have space for violations (errors) in the left diagram, whereas with the lower value of C, no scope for accepting violations leads to a reduction in margin width. C is a tuning parameter in Support Vector Classifiers: Support vector machines Support vector machines are used when the decision boundary is non-linear. It's useful when it becomes impossible to separate with support vector classifiers. The diagram below explains the non-linearly separable cases for both 1-dimension and 2-dimensions: Clearly, you can't classify using support vector classifiers whatever the cost value is. This is why you would want to then introduce something called the kernel trick. In the diagram below, a polynomial kernel with degree 2 has been applied in transforming the data from 1-dimensional to 2-dimensional data. By doing so, the data becomes linearly separable in higher dimensions. In the left diagram, different classes (red and blue) are plotted on X1 only, whereas after applying degree 2, we now have 2-dimensions, X1 and X21 (the original and a new dimension). The degree of the polynomial kernel is a tuning parameter. You need to tune them with various values to check where higher accuracy might be possible with the model: However, in the 2-dimensional case, the kernel trick is applied as below with the polynomial kernel with degree 2. Observations have been classified successfully using a linear plane after projecting the data into higher dimensions: Different types of kernel functions Kernel functions are the functions that, given the original feature vectors, return the same value as the dot product of its corresponding mapped feature vectors. Kernel functions do not explicitly map the feature vectors to a higher-dimensional space, or calculate the dot product of the mapped vectors. Kernels produce the same value through a different series of operations that can often be computed more efficiently. The main reason for using kernel functions is to eliminate the computational requirement to derive the higher-dimensional vector space from the given basic vector space, so that observations be separated linearly in higher dimensions. Why someone needs to like this is, derived vector space will grow exponentially with the increase in dimensions and it will become almost too difficult to continue computation, even when you have a variable size of 30 or so. The following example shows how the size of the variables grows. Here's an example: When we have two variables such as x and y, with a polynomial degree kernel, it needs to compute x2, y2, and xy dimensions in addition. Whereas, if we have three variables x, y, and z, then we need to calculate the x2, y2, z2, xy, yz, xz, and xyz vector spaces. You will have realized by this time that the increase of one more dimension creates so many combinations. Hence, care needs to be taken to reduce its computational complexity; this is where kernels do wonders. Kernels are defined more formally in the following equation: Polynomial kernels are often used, especially with degree 2. In fact, the inventor of support vector machines, Vladimir N Vapnik, developed using a degree 2 kernel for classifying handwritten digits. Polynomial kernels are given by the following equation: Radial Basis Function kernels (sometimes called Gaussian kernels) are a good first choice for problems requiring nonlinear models. A decision boundary that is a hyperplane in the mapped feature space is similar to a decision boundary that is a hypersphere in the original space. The feature space produced by the Gaussian kernel can have an infinite number of dimensions, a feat that would be impossible otherwise. RBF kernels are represented by the following equation: This is sometimes simplified as the following equation: It is advisable to scale the features when using support vector machines, but it is very important when using the RBF kernel. When the value of the gamma value is small, it gives you a pointed bump in the higher dimensions. A larger value gives you a softer, broader bump. A small gamma will give you low bias and high variance solutions; on the other hand, a high gamma will give you high bias and low variance solutions and that is how you control the fit of the model using RBF kernels: Learn more about support vector machines Support vector machines as a classification engine [read now] 10 machine learning algorithms every engineer needs to know [read now]
Read more
  • 0
  • 0
  • 48969

article-image-4-encryption-options-for-your-sql-server
Vijin Boricha
16 Apr 2018
7 min read
Save for later

4 Encryption options for your SQL Server

Vijin Boricha
16 Apr 2018
7 min read
In today’s tutorial, we will learn about cryptographic elements like T-SQL functions, service master key, and more. SQL Server cryptographic elements Encryption is the process of obfuscating data by the use of a key or password. This can make the data useless without the corresponding decryption key or password. Encryption does not solve access control problems. However, it enhances security by limiting data loss even if access controls are bypassed. For example, if the database host computer is misconfigured and a hacker obtains sensitive data, that stolen information might be useless if it is encrypted. SQL Server provides the following building blocks for the encryption; based on them you can implement all supported features, such as backup encryption, Transparent Data Encryption, column encryption and so on. We already know what the symmetric and asymmetric keys are. The basic concept is the same in SQL Server implementation. Later in the chapter you will practice how to create and implement all elements from the Figure 9-3. Let me explain the rest of the items. T-SQL functions SQL Server has built in support for handling encryption elements and features in the forms of T-SQL functions. You don't need any third-party software to do that, as you do with other database platforms. Certificates A public key certificate is a digitally-signed statement that connects the data of a public key to the identity of the person, device, or service that holds the private key. Certificates are issued and signed by a certification authority (CA). You can work with self-signed certificates, but you should be careful here. This can be misused for the large set of network attacks. SQL Server encrypts data with a hierarchical encryption. Each layer encrypts the layer beneath it using certificates, asymmetric keys, and symmetric keys. In a nutshell, the previous image means that any key in a hierarchy is guarded (encrypted) with the key above it. In practice, if you miss just one element from the chain, decryption will be impossible. This is an important security feature, because it is really hard for an attacker to compromise all levels of security. Let me explain the most important elements in the hierarchy. Service Master Key SQL Server has two primary applications for keys: a Service Master Key (SMK) generated on and for a SQL Server instance, and a database master key (DMK) used for a database. The SMK is automatically generated during installation and the first time the SQL Server instance is started. It is used to encrypt the next first key in the chain. The SMK should be backed up and stored in a secure, off-site location. This is an important step, because this is the first key in the hierarchy. Any damage at this level can prevent access to all encrypted data in the layers below. When the SMK is restored, the SQL Server decrypts all the keys and data that have been encrypted with the current SMK, and then encrypts them with the SMK from the backup. Service Master Key can be viewed with the following system catalog view: 1> SELECT name, create_date 2> FROM sys.symmetric_keys 3> GO name create_date ------------------------- ----------------------- ##MS_ServiceMasterKey## 2017-04-17 17:56:20.793 (1 row(s) affected) Here is an example of how you can back up your SMK to the /var/opt/mssql/backup folder. Note: In the case that you don't have /var/opt/mssql/backup folder execute all 5 bash lines. In the case you don't have permissions to /var/opt/mssql/backup folder execute all lines without first one. # sudo mkdir /var/opt/mssql/backup # sudo chown mssql /var/opt/mssql/backup/ # sudo chgrp mssql /var/opt/mssql/backup/ # sudo /opt/mssql/bin/mssql-conf set filelocation.defaultbackupdir /var/opt/mssql/backup/ # sudo systemctl restart mssql-server 1> USE master 2> GO Changed database context to 'master'. 1> BACKUP SERVICE MASTER KEY TO FILE = '/var/opt/mssql/backup/smk' 2> ENCRYPTION BY PASSWORD = 'S0m3C00lp4sw00rd' 3> --In the real scenarios your password should be more complicated 4> GO exit The next example is how to restore SMK from the backup location: 1> USE master 2> GO Changed database context to 'master'. 1> RESTORE SERVICE MASTER KEY 2> FROM FILE = '/var/opt/mssql/backup/smk' 3> DECRYPTION BY PASSWORD = 'S0m3C00lp4sw00rd' 4> GO You can examine the contents of your SMK with the ls command or some internal Linux file views, such is in Midnight Commander (MC). Basically there is not much to see, but that is the power of encryption. The SMK is the foundation of the SQL Server encryption hierarchy. You should keep a copy at an offsite location. Database master key The DMK is a symmetric key used to protect the private keys of certificates and asymmetric keys that are present in the database. When it is created, the master key is encrypted by using the AES 256 algorithm and a user-supplied password. To enable the automatic decryption of the master key, a copy of the key is encrypted by using the SMK and stored in both the database (user and in the master database). The copy stored in the master is always updated whenever the master key is changed. The next T-SQL code show how to create DMK in the Sandbox database: 1> CREATE DATABASE Sandbox 2> GO 1> USE Sandbox 2> GO 3> CREATE MASTER KEY 4> ENCRYPTION BY PASSWORD = 'S0m3C00lp4sw00rd' 5> GO Let's check where the DMK is with the sys.sysmmetric_keys system catalog view: 1> SELECT name, algorithm_desc 2> FROM sys.symmetric_keys 3> GO name algorithm_desc -------------------------- --------------- ##MS_DatabaseMasterKey## AES_256 (1 row(s) affected) This default can be changed by using the DROP ENCRYPTION BY SERVICE MASTER KEY option of ALTER MASTER KEY. A master key that is not encrypted by the SMK must be opened by using the OPEN MASTER KEY statement and a password. Now that we know why the DMK is important and how to create one, we will continue with the following DMK operations: ALTER OPEN CLOSE BACKUP RESTORE DROP These operations are important because all other encryption keys, on database-level, are dependent on the DMK. We can easily create a new DMK for Sandbox and re-encrypt the keys below it in the encryption hierarchy, assuming that we have the DMK created in the previous steps: 1> ALTER MASTER KEY REGENERATE 2> WITH ENCRYPTION BY PASSWORD = 'S0m3C00lp4sw00rdforN3wK3y' 3> GO Opening the DMK for use: 1> OPEN MASTER KEY 2> DECRYPTION BY PASSWORD = 'S0m3C00lp4sw00rdforN3wK3y' 3> GO Note: If the DMK was encrypted with the SMK, it will be automatically opened when it is needed for decryption or encryption. In this case, it is not necessary to use the OPEN MASTER KEY statement. Closing the DMK after use: 1> CLOSE MASTER KEY 2> GO Backing up the DMK: 1> USE Sandbox 2> GO 1> OPEN MASTER KEY 2> DECRYPTION BY PASSWORD = 'S0m3C00lp4sw00rdforN3wK3y'; 3> BACKUP MASTER KEY TO FILE = '/var/opt/mssql/backup/Snadbox-dmk' 4> ENCRYPTION BY PASSWORD = 'fk58smk@sw0h%as2' 5> GO Restoring the DMK: 1> USE Sandbox 2> GO 1> RESTORE MASTER KEY 2> FROM FILE = '/var/opt/mssql/backup/Snadbox-dmk' 3> DECRYPTION BY PASSWORD = 'fk58smk@sw0h%as2' 4> ENCRYPTION BY PASSWORD = 'S0m3C00lp4sw00rdforN3wK3y'; 5> GO When the master key is restored, SQL Server decrypts all the keys that are encrypted with the currently active master key, and then encrypts these keys with the restored master. Dropping the DMK: 1> USE Sandbox 2> GO 1> DROP MASTER KEY 2> GO You read an excerpt  from the book SQL Server on Linux, written by Jasmin Azemović.  From this book, you will learn to configure and administer database solutions on Linux. How SQL Server handles data under the hood SQL Server basics Creating reports using SQL Server 2016 Reporting Services  
Read more
  • 0
  • 0
  • 17469
article-image-how-to-build-an-options-trading-web-app-using-q-learning
Sunith Shetty
13 Apr 2018
19 min read
Save for later

How to build an options trading web app using Q-learning

Sunith Shetty
13 Apr 2018
19 min read
Today we will learn to develop an options trading web app using Q-learning algorithm and will also evaluate the model. Developing an options trading web app using Q-learning The trading algorithm is the process of using computers programmed to follow a defined set of instructions for placing a trade in order to generate profits at a speed and frequency that is impossible for a human trader. The defined sets of rules are based on timing, price, quantity, or any mathematical model. Problem description Through this project, we will predict the price of an option on a security for N days in the future according to the current set of observed features derived from the time of expiration, the price of the security, and volatility. The question would be: what model should we use for such an option pricing model? The answer is that there are actually many; Black-Scholes stochastic partial differential equations (PDE) is one of the most recognized. In mathematical finance, the Black-Scholes equation is necessarily a PDE overriding the price evolution of a European call or a European put under the Black-Scholes model. For a European call or put on an underlying stock paying no dividends, the equation is: Where V is the price of the option as a function of stock price S and time t, r is the risk-free interest rate, and σ σ (displaystyle sigma) is the volatility of the stock. One of the key financial insights behind the equation is that anyone can perfectly hedge the option by buying and selling the underlying asset in just the right way without any risk. This hedge implies that there is only one right price for the option, as returned by the Black-Scholes formula. Consider a January maturity call option on an IBM with an exercise price of $95. You write a January IBM put option with an exercise price of $85. Let us consider and focus on the call options of a given security, IBM. The following chart plots the daily price of the IBM stock and its derivative call option for May 2014, with a strike price of $190: Figure 1: IBM stock and call $190 May 2014 pricing in May-Oct 2013 Now, what will be the profit and loss be for this position if IBM is selling at $87 on the option maturity date? Alternatively, what if IBM is selling at $100? Well, it is not easy to compute or predict the answer. However, in options trading, the price of an option depends on a few parameters, such as time decay, price, and volatility: Time to expiration of the option (time decay) The price of the underlying security The volatility of returns of the underlying asset A pricing model usually does not consider the variation in trading volume in terms of the underlying security. Therefore, some researchers have included it in the option trading model. As we have described, any RL-based algorithm should have an explicit state (or states), so let us define the state of an option using the following four normalized features: Time decay (timeToExp): This is the time to expiration once normalized in the range of (0, 1). Relative volatility (volatility): within a trading session, this is the relative variation of the price of the underlying security. It is different than the more complex volatility of returns defined in the Black-Scholes model, for example. Volatility relative to volume (vltyByVol): This is the relative volatility of the price of the security adjusted for its trading volume. Relative difference between the current price and the strike price (priceToStrike): This measures the ratio of the difference between the price and the strike price to the strike price. The following graph shows the four normalized features that can be used for the IBM option strategy: Figure 2: Normalized relative stock price volatility, volatility relative to trading volume, and price relative to strike price for the IBM stock Now let us look at the stock and the option price dataset. There are two files IBM.csv and IBM_O.csv contain the IBM stock prices and option prices, respectively. The stock price dataset has the date, the opening price, the high and low price, the closing price, the trade volume, and the adjusted closing price. A shot of the dataset is given in the following diagram: Figure 3: IBM stock data On the other hand, IBM_O.csv has 127 option prices for IBM Call 190 Oct 18, 2014. A few values are 1.41, 2.24, 2.42, 2.78, 3.46, 4.11, 4.51, 4.92, 5.41, 6.01, and so on. Up to this point, can we develop a predictive model using a Q-Learning, algorithm that can help us answer the previously mentioned question: Can it tell us the how IBM can make maximum profit by utilizing all the available features? Well, we know how to implement the Q-Learning, and we know what option trading is. Implementing an options trading web application The goal of this project is to create an options trading web application that creates a Q-Learning model from the IBM stock data. Then the app will extract the output from the model as a JSON object and show the result to the user. Figure 4, shows the overall workflow: Figure 4: Workflow of the options trading Scala web The compute API prepares the input for the Q-learning algorithm, and the algorithm starts by extracting the data from the files to build the option model. Then it performs operations on the data such as normalization and discretization. It passes all of this to the Q-learning algorithm to train the model. After that, the compute API gets the model from the algorithm, extracts the best policy data, and puts it onto JSON to be returned to the web browser. Well, the implementation of the options trading strategy using Q-learning consists of the following steps: Describing the property of an option Defining the function approximation Specifying the constraints on the state transition Creating an option property Considering the market volatility, we need to be a bit more realistic, because any longer- term prediction is quite unreliable. The reason is that it would fall outside the constraint of the discrete Markov model. So, suppose we want to predict the price for next two days—that is, N= 2. That means the price of the option two days in the future is the value of the reward profit or loss. So, let us encapsulate the following four parameters: timeToExp: Time left until expiration as a percentage of the overall duration of the option Volatility normalized Relative volatility of the underlying security for a given trading session vltyByVol: Volatility of the underlying security for a given trading session relative to a trading volume for the session priceToStrike: Price of the underlying security relative to the Strike price for a given trading session The OptionProperty class defines the property of a traded option on a security. The constructor creates the property for an option: class OptionProperty(timeToExp:  Double,volatility: Double,vltyByVol: Double,priceToStrike:  Double) {  nval toArray  = Array[Double](timeToExp,  volatility, vltyByVol,  priceToStrike)  require(timeToExp   > 0.01, s"OptionProperty  time to expiration  found  $timeToExp  required 0.01") } Creating an option model Now we need to create an OptionModel to act as the container and the factory for the properties of the option. It takes the following parameters and creates a list of option properties, propsList, by accessing the data source of the four features described earlier: The symbol of the security. The strike price for option, strikePrice. The source of the data, src. The minimum time decay or time to expiration, minTDecay. Out-of-the-money options expire worthlessly, and in-the-money options have a very different price behavior as they get closer to the expiration. Therefore, the last minTDecay trading sessions prior to the expiration date are not used in the training process. The number of steps (or buckets), nSteps, is used in approximating the values of each feature. For instance, an approximation of four steps creates four buckets: (0, 25), (25, 50), (50, 75), and (75, 100). Then it assembles OptionProperties and computes the normalized minimum time to the expiration of the option. Then it computes an approximation of the value of options by discretization of the actual value in multiple levels from an array of options prices; finally it returns a map of an array of levels for the option price and accuracy. Here is the constructor of the class: class OptionModel( symbol:  String, strikePrice: Double, src:  DataSource, minExpT: Int, nSteps:  Int ) Inside this class implementation, at first, a validation is done using the check() method, by checking the following: strikePrice: A positive price is required minExpT: This has to be between 2 and 16 nSteps: Requires a minimum of two steps Here's the invocation of this method: check(strikePrice,  minExpT, nSteps) The signature of the preceding method is shown in the following code: def check(strikePrice:  Double, minExpT: Int, nSteps:  Int): Unit = { require(strikePrice  > 0.0, s"OptionModel.check  price found $strikePrice required  > 0") require(minExpT  > 2 && minExpT  < 16,s"OptionModel.check  Minimum expiration time found  $minExpT required  ]2, 16[") require(nSteps   > 1,s"OptionModel.check,  number of steps found $nSteps required  > 1") } Once the preceding constraint is satisfied, the list of option properties, named propsList, is created as follows: val propsList  = (for { price  <- src.get(adjClose) volatility  <- src.get(volatility) nVolatility  <- normalize[Double](volatility) vltyByVol  <- src.get(volatilityByVol) nVltyByVol <- normalize[Double](vltyByVol) priceToStrike  <- normalize[Double](price.map(p  => 1.0 - strikePrice / p)) } yield { nVolatility.zipWithIndex./:(List[OptionProperty]())  { case (xs,  (v, n)) => val normDecay  = (n + minExpT).toDouble  / (price.size + minExpT) new OptionProperty(normDecay,  v, nVltyByVol(n), priceToStrike(n))  :: xs } .drop(2).reverse }).get In the preceding code block, the factory uses the zipWithIndex Scala method to represent the index of the trading sessions. All feature values are normalized over the interval (0, 1), including the time decay (or time to expiration) of the normDecay option. The quantize() method of the OptionModel class converts the normalized value of each option property of features into an array of bucket indices. It returns a map of profit and loss for each bucket keyed on the array of bucket indices: def quantize(o:  Array[Double]): Map[Array[Int],  Double] = { val mapper  = new mutable.HashMap[Int,  Array[Int]] val acc:  NumericAccumulator[Int]  = propsList.view.map(_.toArray) map(toArrayInt(_)).map(ar  => { val enc = encode(ar) mapper.put(enc,  ar) enc }) .zip(o)./:( new NumericAccumulator[Int])  { case (_acc,  (t, y)) => _acc  += (t, y); _acc } acc.map  { case (k,  (v, w)) =>  (k, v / w) } .map  { case (k,  v) => (mapper(k),  v) }.toMap } The method also creates a mapper instance to index the array of buckets. An accumulator, acc, of type NumericAccumulator extends the Map[Int,  (Int, Double)] and computes this tuple (number of occurrences of features on each bucket, sum of the increase or decrease of the option price). The toArrayInt method converts the value of each option property (timeToExp, volatility, and so on) into the index of the appropriate bucket. The array of indices is then encoded to generate the id or index of a state. The method updates the accumulator with the number of occurrences and the total profit and loss for a trading session for the option. It finally computes the reward on each action by averaging the profit and loss on each bucket. The signature of the encode(), toArrayInt() is given in the following code: private def encode(arr:  Array[Int]): Int = arr./:((1,  0)) { case ((s,  t), n) =>  (s * nSteps,  t + s * n) }._2 private def toArrayInt(feature:  Array[Double]): Array[Int] = feature.map(x  => (nSteps * x).floor.toInt) final class NumericAccumulator[T] extends mutable.HashMap[T,  (Int, Double)] { def +=(key:  T, x: Double):  Option[(Int, Double)]  = { val newValue  = if (contains(key))  (get(key).get._1 + 1,  get(key).get._2 + x) else (1,  x) super.put(key,  newValue) } } Finally, and most importantly, if the preceding constraints are satisfied (you can modify these constraints though) and once the instantiation of the OptionModel class generates a list of OptionProperty elements if the constructor succeeds; otherwise, it generates an empty list. Putting it altogether Because we have implemented the Q-learning algorithm, we can now develop the options trading application using Q-learning. However, at first, we need to load the data using the DataSource class (we will see its implementation later on). Then we can create an option model from the data for a given stock with default strike and minimum expiration time parameters, using OptionModel, which defines the model for a traded option, on a security. Then we have to create the model for the profit and loss on an option given the underlying security. The profit and loss are adjusted to produce positive values. It instantiates an instance of the Q-learning class, that is, a generic parameterized class that implements the Q-learning algorithm. The Q-learning model is initialized and trained during the instantiation of the class, so it can be in the correct state for the runtime prediction. Therefore, the class instances have only two states: successfully trained and failed training Q-learning value action. Then the model is returned to get processed and visualized. So, let us create a Scala object and name it QLearningMain. Then, inside the QLearningMain object, define and initialize the following parameters: Name: Used to indicate the reinforcement algorithm's name (for our case, it's Q- learning) STOCK_PRICES: File that contains the stock data OPTION_PRICES: File that contains the available option data STRIKE_PRICE: Option strike price MIN_TIME_EXPIRATION: Minimum expiration time for the option recorded QUANTIZATION_STEP: Steps used in discretization or approximation of the value of the security ALPHA: Learning rate for the Q-learning algorithm DISCOUNT (gamma): Discount rate for the Q-learning algorithm MAX_EPISODE_LEN:Maximum number of states visited per episode NUM_EPISODES: Number of episodes used during training MIN_COVERAGE: Minimum coverage allowed during the training of the Q- learning model NUM_NEIGHBOR_STATES: Number of states accessible from any other state REWARD_TYPE: Maximum reward or Random Tentative initializations for each parameter are given in the following code: val name: String = "Q-learning"// Files containing the historical prices for the stock and option val STOCK_PRICES = "/static/IBM.csv" val OPTION_PRICES = "/static/IBM_O.csv"// Run configuration parameters val STRIKE_PRICE = 190.0 // Option strike price val MIN_TIME_EXPIRATION = 6 // Min expiration time for option recorded val QUANTIZATION_STEP = 32 // Quantization step (Double => Int) val ALPHA = 0.2 // Learning rate val DISCOUNT = 0.6 // Discount rate used in Q-Value update equation val MAX_EPISODE_LEN = 128 // Max number of iteration for an episode val NUM_EPISODES = 20 // Number of episodes used for training. val NUM_NEIGHBHBOR_STATES = 3 // No. of states from any other state Now the run() method accepts as input the reward type (Maximum  reward in our case), quantized step (in our case, QUANTIZATION_STEP), alpha (the learning rate, ALPHA in our case) and gamma (in our case, it's DISCOUNT, the discount rate for the Q-learning algorithm). It displays the distribution of values in the model. Additionally, it displays the estimated Q-value for the best policy on a Scatter plot (we will see this later). Here is the workflow of the preceding method: First, it extracts the stock price from the IBM.csv file Then it creates an option model createOptionModel using the stock prices and quantization, quantizeR (see the quantize method for more and the main method invocation later) The option prices are extracted from the IBM_o.csv file After that, another model, model, is created using the option model to evaluate it on the option prices, oPrices Finally, the estimated Q-Value (that is, Q-value = value * probability) is displayed 0n a Scatter plot using the display method By amalgamating the preceding steps, here's the signature of the run() method: private def run(rewardType:  String,quantizeR: Int,alpha:  Double,gamma: Double): Int = { val sPath  = getClass.getResource(STOCK_PRICES).getPath val src  = DataSource(sPath,  false, false, 1).get val option  = createOptionModel(src,  quantizeR) val oPricesSrc  = DataSource(OPTION_PRICES,  false, false, 1).get val oPrices  = oPricesSrc.extract.get val model  = createModel(option,  oPrices, alpha, gamma)model.map(m  => {if (rewardType  != "Random") display(m.bestPolicy.EQ,m.toString,s"$rewardType  with quantization order $quantizeR")1}).getOrElse(-1) } Now here is the signature of the createOptionModel() method that creates an option model using (see the OptionModel class): private def createOptionModel(src:  DataSource, quantizeR: Int): OptionModel = new OptionModel("IBM",  STRIKE_PRICE, src, MIN_TIME_EXPIRATION, quantizeR) Then the createModel() method creates a model for the profit and loss on an option given the underlying security. Note that the option prices are quantized using the quantize() method defined earlier. Then the constraining method is used to limit the number of actions available to any given state. This simple implementation computes the list of all the states within a radius of this state. Then it identifies the neighboring states within a predefined radius. Finally, it uses the input data to train the Q-learning model to compute the minimum value for the profit, a loss so the maximum loss is converted to a null profit. Note that the profit and loss are adjusted to produce positive values. Now let us see the signature of this method: def createModel(ibmOption: OptionModel,oPrice: Seq[Double],alpha: Double,gamma: Double): Try[QLModel] = { val qPriceMap = ibmOption.quantize(oPrice.toArray) val numStates = qPriceMap.size val neighbors = (n: Int) => { def getProximity(idx: Int, radius: Int): List[Int] = { val idx_max = if (idx + radius >= numStates) numStates - 1 else idx + radius val idx_min = if (idx < radius) 0 else idx - radiusRange(idx_min, idx_max + 1).filter(_ != idx)./:(List[Int]())((xs, n) => n :: xs)}getProximity(n, NUM_NEIGHBHBOR_STATES) } val qPrice: DblVec = qPriceMap.values.toVector val profit: DblVec = normalize(zipWithShift(qPrice, 1).map { case (x, y) => y - x}).get val maxProfitIndex = profit.zipWithIndex.maxBy(_._1)._2 val reward = (x: Double, y: Double) => Math.exp(30.0 * (y - x)) val probabilities = (x: Double, y: Double) => if (y < 0.3 * x) 0.0 else 1.0println(s"$name Goal state index: $maxProfitIndex") if (!QLearning.validateConstraints(profit.size, neighbors)) thrownew IllegalStateException("QLearningEval Incorrect states transition constraint") val instances = qPriceMap.keySet.toSeq.drop(1) val config = QLConfig(alpha, gamma, MAX_EPISODE_LEN, NUM_EPISODES, 0.1) val qLearning = QLearning[Array[Int]](config,Array[Int](maxProfitIndex),profit,reward,proba bilities,instances,Some(neighbors)) val modelO = qLearning.getModel if (modelO.isDefined) { val numTransitions = numStates * (numStates - 1)println(s"$name Coverage ${modelO.get.coverage} for $numStates states and $numTransitions transitions") val profile = qLearning.dumpprintln(s"$name Execution profilen$profile")display(qLearning)Success(modelO.get)} else Failure(new IllegalStateException(s"$name model undefined")) } Note that if the preceding invocation cannot create an option model, the code fails to show a message that the model creation failed. Nonetheless, remember that the minCoverage used in the following line is important, considering the small dataset we used (because the algorithm will converge very quickly): val config  = QLConfig(alpha,  gamma, MAX_EPISODE_LEN,  NUM_EPISODES, 0.0) Although we've already stated that it is not assured that the model creation and training will be successful, a Naïve clue would be using a very small minCoverage value between 0.0 and 0.22. Now, if the preceding invocation is successful, then the model is trained and ready for making prediction. If so, then the display method is used to display the estimated Q-value = value * probability in a Scatter plot. Here is the signature of the method: private def display(eq:  Vector[DblPair],results: String,params:  String): Unit = { import org.scalaml.plots.{ScatterPlot,  BlackPlotTheme, Legend} val labels  = Legend(name,  s"Q-learning config:  $params", "States", "States")ScatterPlot.display(eq, labels,  new BlackPlotTheme) } Hang on and do not lose patience! We are finally ready to see a simple rn and inspect the result. So let us do it: def main(args: Array[String]): Unit = {run("Maximum reward", QUANTIZATION_STEP, ALPHA, DISCOUNT) Action: state 71 => state 74 Action: state 71 => state 73 Action: state 71 => state 72 Action: state 71 => state 70 Action: state 71 => state 69 Action: state 71 => state 68...Instance: [I@1f021e6c - state: 124 Action: state 124 => state 125 Action: state 124 => state 123 Action: state 124 => state 122 Action: state 124 => state 121Q-learning Coverage 0.1 for 126 states and 15750 transitions Q-learning Execution profile Q-Value -> 5.572310105096295, 0.013869013819834967, 4.5746487300071825, 0.4037703812585325, 0.17606260549479869, 0.09205272504875522, 0.023205692430068765, 0.06363082458984902, 50.405283888218435... 6.5530411130514015 Model: Success(Optimal policy: Reward - 1.00,204.28,115.57,6.05,637.58,71.99,12.34,0.10,4939.71,521.30,402.73, with coverage: 0.1) Evaluating the model The preceding output shows the transition from one state to another, and for the 0.1 coverage, the Q-Learning model had 15,750 transitions for 126 states to reach goal state 37 with optimal rewards. Therefore, the training set is quite small and only a few buckets have actual values. So we can understand that the size of the training set has an impact on the number of states. Q-Learning will converge too fast for a small training set (like what we have for this example). However, for a larger training set, Q-Learning will take time to converge; it will provide at least one value for each bucket created by the approximation. Also, by seeing those values, it is difficult to understand the relation between Q-values and states. So what if we can see the Q-values per state? Why not! We can see them on a scatter plot: Figure  5: Q-value  per state Now let us display the profile of the log of the Q-value (QLData.value) as the recursive search (or training) progress for different episodes or epochs. The test uses a learning rate α = 0.1 and a discount rate γ = 0.9 (see more in the deployment section): Figure 6: Profile of the logarithmic Q-Value for different  epochs during Q-learning training The preceding chart illustrates the fact that the Q-value for each profile is independent of the order of the epochs during training. However, the number of iterations to reach the goal state depends on the initial state selected randomly in this example. To get more insights, inspect the output on your editor or access the API endpoint at http://localhost:9000/api/compute (see following). Now, what if we display the distribution of values in the model and display the estimated Q-value for the best policy on a Scatter plot for the given configuration parameters? Figure 7: Maximum reward with quantization 32 with the QLearning The final evaluation consists of evaluating the impact of the learning rate and discount rate on the coverage of the training: Figure 7: Impact of the learning rate and discount rate on the coverage of the training The coverage decreases as the learning rate increases. This result confirms the general rule of using learning rate < 0.2. A similar test to evaluate the impact of the discount rate on the coverage is inconclusive. We learned to develop a real-life application for options trading using a reinforcement learning algorithm called Q-learning. You read an excerpt from a book written by Md. Rezaul Karim, titled Scala Machine Learning Projects. In this book, you will learn to develop, build, and deploy research or commercial machine learning projects in a production-ready environment. Check out other related posts: Getting started with Q-learning using TensorFlow How to implement Reinforcement Learning with TensorFlow How Reinforcement Learning works  
Read more
  • 0
  • 0
  • 13444

article-image-image-filtering-techniques-opencv
Vijin Boricha
12 Apr 2018
15 min read
Save for later

Image filtering techniques in OpenCV

Vijin Boricha
12 Apr 2018
15 min read
In the world of computer vision, image filtering is used to modify images. These modifications essentially allow you to clarify an image in order to get the information you want. This could involve anything from extracting edges from an image, blurring it, or removing unwanted objects.  There are, of course, lots of reasons why you might want to use image filtering to modify an image. For example, taking a picture in sunlight or darkness will impact an images clarity - you can use image filters to modify the image to get what you want from it. Similarly, you might have a blurred or 'noisy' image that needs clarification and focus. Let's use an example to see how to do image filtering in OpenCV. This image filtering tutorial is an extract from Practical Computer Vision. Here's an example with considerable salt and pepper noise. This occurs when there is a disturbance in the quality of the signal that's used to generate the image. The image above can be easily generated using OpenCV as follows: # initialize noise image with zeros noise = np.zeros((400, 600)) # fill the image with random numbers in given range cv2.randu(noise, 0, 256) Let's add weighted noise to a grayscale image (on the left) so the resulting image will look like the one on the right: The code for this is as follows: # add noise to existing image noisy_gray = gray + np.array(0.2*noise, dtype=np.int) Here, 0.2 is used as parameter, increase or decrease the value to create different intensity noise. In several applications, noise plays an important role in improving a system's capabilities. This is particularly true when you're using deep learning models. The noise becomes a way of testing the precision of the deep learning application, and building it into the computer vision algorithm. Linear image filtering The simplest filter is a point operator. Each pixel value is multiplied by a scalar value. This operation can be written as follows: Here: The input image is F and the value of pixel at (i,j) is denoted as f(i,j) The output image is G and the value of pixel at (i,j) is denoted as g(i,j) K is scalar constant This type of operation on an image is what is known as a linear filter. In addition to multiplication by a scalar value, each pixel can also be increased or decreased by a constant value. So overall point operation can be written like this: This operation can be applied both to grayscale images and RGB images. For RGB images, each channel will be modified with this operation separately. The following is the result of varying both K and L. The first image is input on the left. In the second image, K=0.5 and L=0.0, while in the third image, K is set to 1.0 and L is 10. For the final image on the right, K=0.7 and L=25. As you can see, varying K changes the brightness of the image and varying L changes the contrast of the image: This image can be generated with the following code: import numpy as np import matplotlib.pyplot as plt import cv2 def point_operation(img, K, L): """ Applies point operation to given grayscale image """ img = np.asarray(img, dtype=np.float) img = img*K + L # clip pixel values img[img > 255] = 255 img[img < 0] = 0 return np.asarray(img, dtype = np.int) def main(): # read an image img = cv2.imread('../figures/flower.png') gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # k = 0.5, l = 0 out1 = point_operation(gray, 0.5, 0) # k = 1., l = 10 out2 = point_operation(gray, 1., 10) # k = 0.8, l = 15 out3 = point_operation(gray, 0.7, 25) res = np.hstack([gray,out1, out2, out3]) plt.imshow(res, cmap='gray') plt.axis('off') plt.show() if __name__ == '__main__': main() 2D linear image filtering While the preceding filter is a point-based filter, image pixels have information around the pixel as well. In the previous image of the flower, the pixel values in the petal are all yellow. If we choose a pixel of the petal and move around, the values will be quite close. This gives some more information about the image. To extract this information in filtering, there are several neighborhood filters. In neighborhood filters, there is a kernel matrix which captures local region information around a pixel. To explain these filters, let's start with an input image, as follows: This is a simple binary image of the number 2. To get certain information from this image, we can directly use all the pixel values. But instead, to simplify, we can apply filters on this. We define a matrix smaller than the given image which operates in the neighborhood of a target pixel. This matrix is termed kernel; an example is given as follows: The operation is defined first by superimposing the kernel matrix on the original image, then taking the product of the corresponding pixels and returning a summation of all the products. In the following figure, the lower 3 x 3 area in the original image is superimposed with the given kernel matrix and the corresponding pixel values from the kernel and image are multiplied. The resulting image is shown on the right and is the summation of all the previous pixel products: This operation is repeated by sliding the kernel along image rows and then image columns. This can be implemented as in following code. We will see the effects of applying this on an image in coming sections. # design a kernel matrix, here is uniform 5x5 kernel = np.ones((5,5),np.float32)/25 # apply on the input image, here grayscale input dst = cv2.filter2D(gray,-1,kernel) However, as you can see previously, the corner pixel will have a drastic impact and results in a smaller image because the kernel, while overlapping, will be outside the image region. This causes a black region, or holes, along with the boundary of an image. To rectify this, there are some common techniques used: Padding the corners with constant values maybe 0 or 255, by default OpenCV will use this. Mirroring the pixel along the edge to the external area Creating a pattern of pixels around the image The choice of these will depend on the task at hand. In common cases, padding will be able to generate satisfactory results. The effect of the kernel is most crucial as changing these values changes the output significantly. We will first see simple kernel-based filters and also see their effects on the output when changing the size. Box filtering This filter averages out the pixel value as the kernel matrix is denoted as follows: Applying this filter results in blurring the image. The results are as shown as follows: In frequency domain analysis of the image, this filter is a low pass filter. The frequency domain analysis is done using Fourier transformation of the image, which is beyond the scope of this introduction. We can see on changing the kernel size, the image gets more and more blurred: As we increase the size of the kernel, you can see that the resulting image gets more blurred. This is due to averaging out of peak values in small neighbourhood where the kernel is applied. The result for applying kernel of size 20x20 can be seen in the following image. However, if we use a very small filter of size (3,3) there is negligible effect on the output, due to the fact that the kernel size is quite small compared to the photo size. In most applications, kernel size is heuristically set according to image size: The complete code to generate box filtered photos is as follows: def plot_cv_img(input_image, output_image): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image, cv2.COLOR_BGR2RGB)) ax[1].set_title('Box Filter (5,5)') ax[1].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # To try different kernel, change size here. kernel_size = (5,5) # opencv has implementation for kernel based box blurring blur = cv2.blur(img,kernel_size) # Do plot plot_cv_img(img, blur) if __name__ == '__main__': main() Properties of linear filters Several computer vision applications are composed of step by step transformations of an input photo to output. This is easily done due to several properties associated with a common type of filters, that is, linear filters: The linear filters are commutative such that we can perform multiplication operations on filters in any order and the result still remains the same: a * b = b * a They are associative in nature, which means the order of applying the filter does not affect the outcome: (a * b) * c = a * (b * c) Even in cases of summing two filters, we can perform the first summation and then apply the filter, or we can also individually apply the filter and then sum the results. The overall outcome still remains the same: Applying a scaling factor to one filter and multiplying to another filter is equivalent to first multiplying both filters and then applying scaling factor These properties play a significant role in other computer vision tasks such as object detection and segmentation. A suitable combination of these filters enhances the quality of information extraction and as a result, improves the accuracy. Non-linear image filtering While in many cases linear filters are sufficient to get the required results, in several other use cases performance can be significantly increased by using non-linear image filtering. Mon-linear image filtering is more complex, than linear filtering. This complexity can, however, give you more control and better results in your computer vision tasks. Let's take a look at how non-linear image filtering works when applied to different images. Smoothing a photo Applying a box filter with hard edges doesn't result in a smooth blur on the output photo. To improve this, the filter can be made smoother around the edges. One of the popular such filters is a Gaussian filter. This is a non-linear filter which enhances the effect of the center pixel and gradually reduces the effects as the pixel gets farther from the center. Mathematically, a Gaussian function is given as: where μ is mean and σ is variance. An example kernel matrix for this kind of filter in 2D discrete domain is given as follows: This 2D array is used in normalized form and effect of this filter also depends on its width by changing the kernel width has varying effects on the output as discussed in further section. Applying gaussian kernel as filter removes high-frequency components which results in removing strong edges and hence a blurred photo: While this filter performs better blurring than a box filter, the implementation is also quite simple with OpenCV: def plot_cv_img(input_image, output_image): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image, cv2.COLOR_BGR2RGB)) ax[1].set_title('Gaussian Blurred') ax[1].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # apply gaussian blur, # kernel of size 5x5, # change here for other sizes kernel_size = (5,5) # sigma values are same in both direction blur = cv2.GaussianBlur(img,(5,5),0) plot_cv_img(img, blur) if __name__ == '__main__': main() The histogram equalization technique The basic point operations, to change the brightness and contrast, help in improving photo quality but require manual tuning. Using histogram equalization technique, these can be found algorithmically and create a better-looking photo. Intuitively, this method tries to set the brightest pixels to white and the darker pixels to black. The remaining pixel values are similarly rescaled. This rescaling is performed by transforming original intensity distribution to capture all intensity distribution. An example of this equalization is as following: The preceding image is an example of histogram equalization. On the right is the output and, as you can see, the contrast is increased significantly. The input histogram is shown in the bottom figure on the left and it can be observed that not all the colors are observed in the image. After applying equalization, resulting histogram plot is as shown on the right bottom figure. To visualize the results of equalization in the image , the input and results are stacked together in following figure. Code for the preceding photos is as follows: def plot_gray(input_image, output_image): """ Converts an image from BGR to RGB and plots """ # change color channels order for matplotlib fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(input_image, cmap='gray') ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(output_image, cmap='gray') ax[1].set_title('Histogram Equalized ') ax[1].axis('off') plt.savefig('../figures/03_histogram_equalized.png') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # grayscale image is used for equalization gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # following function performs equalization on input image equ = cv2.equalizeHist(gray) # for visualizing input and output side by side plot_gray(gray, equ) if __name__ == '__main__': main() Median image filtering Median image filtering a similar technique as neighborhood filtering. The key technique here, of course, is the use of a median value. As such, the filter is non-linear. It is quite useful in removing sharp noise such as salt and pepper. Instead of using a product or sum of neighborhood pixel values, this filter computes a median value of the region. This results in the removal of random peak values in the region, which can be due to noise like salt and pepper noise. This is further shown in the following figure with different kernel size used to create output. In this image first input is added with channel wise random noise as: # read the image flower = cv2.imread('../figures/flower.png') # initialize noise image with zeros noise = np.zeros(flower.shape[:2]) # fill the image with random numbers in given range cv2.randu(noise, 0, 256) # add noise to existing image, apply channel wise noise_factor = 0.1 noisy_flower = np.zeros(flower.shape) for i in range(flower.shape[2]): noisy_flower[:,:,i] = flower[:,:,i] + np.array(noise_factor*noise, dtype=np.int) # convert data type for use noisy_flower = np.asarray(noisy_flower, dtype=np.uint8) The created noisy image is used for median image filtering as: # apply median filter of kernel size 5 kernel_5 = 5 median_5 = cv2.medianBlur(noisy_flower,kernel_5) # apply median filter of kernel size 3 kernel_3 = 3 median_3 = cv2.medianBlur(noisy_flower,kernel_3) In the following photo, you can see the resulting photo after varying the kernel size (indicated in brackets). The rightmost photo is the smoothest of them all: The most common application for median blur is in smartphone application which filters input image and adds an additional artifacts to add artistic effects. The code to generate the preceding photograph is as follows: def plot_cv_img(input_image, output_image1, output_image2, output_image3): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=4) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image1, cv2.COLOR_BGR2RGB)) ax[1].set_title('Median Filter (3,3)') ax[1].axis('off') ax[2].imshow(cv2.cvtColor(output_image2, cv2.COLOR_BGR2RGB)) ax[2].set_title('Median Filter (5,5)') ax[2].axis('off') ax[3].imshow(cv2.cvtColor(output_image3, cv2.COLOR_BGR2RGB)) ax[3].set_title('Median Filter (7,7)') ax[3].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # compute median filtered image varying kernel size median1 = cv2.medianBlur(img,3) median2 = cv2.medianBlur(img,5) median3 = cv2.medianBlur(img,7) # Do plot plot_cv_img(img, median1, median2, median3) if __name__ == '__main__': main() Image filtering and image gradients These are more edge detectors or sharp changes in a photograph. Image gradients widely used in object detection and segmentation tasks. In this section, we will look at how to compute image gradients. First, the image derivative is applying the kernel matrix which computes the change in a direction. The Sobel filter is one such filter and kernel in the x-direction is given as follows: Here, in the y-direction: This is applied in a similar fashion to the linear box filter by computing values on a superimposed kernel with the photo. The filter is then shifted along the image to compute all values. Following is some example results, where X and Y denote the direction of the Sobel kernel: This is also termed as an image derivative with respect to given direction(here X or Y). The lighter resulting photographs (middle and right) are positive gradients, while the darker regions denote negative and gray is zero. While Sobel filters correspond to first order derivatives of a photo, the Laplacian filter gives a second-order derivative of a photo. The Laplacian filter is also applied in a similar way to Sobel: The code to get Sobel and Laplacian filters is as follows: # sobel x_sobel = cv2.Sobel(img,cv2.CV_64F,1,0,ksize=5) y_sobel = cv2.Sobel(img,cv2.CV_64F,0,1,ksize=5) # laplacian lapl = cv2.Laplacian(img,cv2.CV_64F, ksize=5) # gaussian blur blur = cv2.GaussianBlur(img,(5,5),0) # laplacian of gaussian log = cv2.Laplacian(blur,cv2.CV_64F, ksize=5) We learnt about types of filters and how to perform image filtering in OpenCV. To know more about image transformation and 3D computer vision check out this book Practical Computer Vision. Check out for more: Fingerprint detection using OpenCV 3 3 ways to deploy a QT and OpenCV application OpenCV 4.0 is on schedule for July release  
Read more
  • 0
  • 1
  • 85106

article-image-understanding-sql-server-recovery-models-to-effectively-backup-and-restore-your-database
Vijin Boricha
11 Apr 2018
9 min read
Save for later

Understanding SQL Server recovery models to effectively backup and restore your database

Vijin Boricha
11 Apr 2018
9 min read
Before you even think about your backups, you need to understand the SQL Server recovery model that is internally used while the database is in operational mode. In this regard, today we will learn about SQL Server recovery models, backup and restore. SQL Server recovery models A recovery model is about maintaining data in the event of a server failure. Also, it defines the amount of information that SQL Server writes to the log file for the purpose of recovery. SQL Server has three database recovery models: Simple recovery model Full recovery model Bulk-logged recovery model Simple recovery model This model is typically used for small databases and scenarios where data changes are infrequent. It is limited to restoring the database to the point when the last backup was created. It means that all changes made after the backup are lost. You will need to recreate all changes manually. The major benefit of this model is that the log file takes only a small amount of storage space. How and when to use it depends on the business scenario. Full recovery model This model is recommended when recovery from damaged storage is the highest priority and data loss should be minimal. SQL Server uses copies of database and log files to restore the database. The database engine logs all changes to the database, including bulk operation and most DDL statements. If the transaction log file is not damaged, SQL Server can recover all data except any transaction which are in process at the time of failure (that is, not committed in to the database file). All logged transactions give you the opportunity of point-in-time recovery, which is a really cool feature. A major limitation of this model is the large size of the log files which leads you to performance and storage issues. Use it only in scenarios where every insert is important and loss of data is not an option. Bulk-logged recovery model This model is somewhere between simple and full. It uses database and log backups to recreate the database. Compared to the full recovery model, it uses less log space for CREATE INDEX and bulk load operations, such as SELECT INTO. Let's look at this example. SELECT INTO can load a table with 1,000,000 records with a single statement. The log will only record the occurrence of these operations but not the details. This approach uses less storage space compared to the full recovery model. The bulk-logged recovery model is good for databases which are used for ETL process and data migrations. SQL Server has a system database model. This database is the template for each new one you create. If you use only the CREATE DATABASE statement without any additional parameters, it simply copies the model database with all the properties and metadata. It also inherits the default recovery model, which is full. So, the conclusion is that each new database will be in full recovery mode. This can be changed during and after the creation process. Here is a SQL statement to check recovery models of all your databases on SQL Server on Linux instance: 1> SELECT name, recovery_model, recovery_model_desc 2> FROM sys.databases 3> GO name recovery_model recovery_model_desc ------------------------ -------------- ------------------- master 3 SIMPLE tempdb 3 SIMPLE model 1 FULL msdb 3 SIMPLE AdventureWorks 3 SIMPLE WideWorldImporters 3 SIMPLE (6 rows affected) The following DDL statement will change the recovery model for the model database from full to simple: 1> USE master 2> ALTER DATABASE model 3> SET RECOVERY SIMPLE 4> GO If you now execute the SELECT statement again to check recovery models, you will notice that model now has different properties. Backup and restore Now it's time for SQL coding and implementing backup/restore operations in our own Environments. First let's create a full database backup of our University database: 1> BACKUP DATABASE University 2> TO DISK = '/var/opt/mssql/data/University.bak' 3> GO Processed 376 pages for database'University', file 'University' on file 1. Processed 7 pages for database 'University', file 'University_log' on file 1. BACKUP DATABASE successfully processed 383 pages in 0.562 seconds (5.324 MB/sec) 2. Now let's check the content of the table Students: 1> USE University 2> GO Changed database context to 'University' 1> SELECT LastName, FirstName 2> FROM Students 3> GO LastName FirstName --------------- ---------- Azemovic Imran Avdic Selver Azemovic Sara Doe John (4 rows affected) 3. As you can see there are four records. Let's now simulate a large import from the AdventureWorks database, Person.Person table. We will adjust the PhoneNumber data to fit our 13 nvarchar characters. But first we will drop unique index UQ_user_name so that we can quickly import a large amount of data. 1> DROP INDEX UQ_user_name 2> ON dbo.Students 3> GO 1> INSERT INTO Students (LastName, FirstName, Email, Phone, UserName) 2> SELECT T1.LastName, T1.FirstName, T2.PhoneNumber, NULL, 'user.name' 3> FROM AdventureWorks.Person.Person AS T1 4> INNER JOIN AdventureWorks.Person.PersonPhone AS T2 5> ON T1.BusinessEntityID = T2.BusinessEntityID 6> WHERE LEN (T2.PhoneNumber) < 13 7> AND LEN (T1.LastName) < 15 AND LEN (T1.FirstName)< 10 8> GO (10661 rows affected) 4. Let's check the new row numbers: 1> SELECT COUNT (*) FROM Students 2> GO ----------- 10665 (1 rows affected) Note: As you see the table now has 10,665 rows (10,661+4). But don't forget that we had created a full database backup before the import procedure. 5. Now, we will create a differential backup of the University database: 1> BACKUP DATABASE University 2> TO DISK = '/var/opt/mssql/data/University-diff.bak' 3> WITH DIFFERENTIAL 4> GO Processed 216 pages for database 'University', file 'University' on file 1. Processed 3 pages for database 'University', file 'University_log' on file 1. BACKUP DATABASE WITH DIFFERENTIAL successfully processed 219 pages in 0.365 seconds (4.676 MB/sec). 6. If you want to see the state of .bak files on the disk, follow this procedure. However, first enter superuser mode with sudo su. This is necessary because a regular user does not have access to the data folder: 7. Now let's test the transaction log backup of University database log file. However, first you will need to make some changes inside the Students table: 1> UPDATE Students 2> SET Phone = 'N/A' 3> WHERE Phone IS NULL 4> GO 1> BACKUP LOG University 2> TO DISK = '/var/opt/mssql/data/University-log.bak' 3> GO Processed 501 pages for database 'University', file 'University_log' on file 1. BACKUP LOG successfully processed 501 pages in 0.620 seconds (6.313 MB/sec) Note: Next steps are to test restore database options of full and differential backup procedures. 8. First, restore the full database backup of University database. Remember that the Students table had four records before the first backup, and it currently has 10,665 (as we checked in step 4): 1> ALTER DATABASE University 2> SET SINGLE_USER WITH ROLLBACK IMMEDIATE 3> RESTORE DATABASE University 4> FROM DISK = '/var/opt/mssql/data/University.bak' 5> WITH REPLACE 6> ALTER DATABASE University SET MULTI_USER 7> GO Nonqualified transactions are being rolled back. Estimated rollback completion: 0%. Nonqualified transactions are being rolled back. Estimated rollback completion: 100%. Processed 376 pages for database 'University', file 'University' on file 1. Processed 7 pages for database 'University', file 'University_log' on file 1. RESTORE DATABASE successfully processed 383 pages in 0.520 seconds (5.754 MB/sec). Note: Before the restore procedure, the database is switched to single user mode.This way we are closing all connections that could abort the restore procedure. In the last step, we are switching the database to multi-user mode again. 9. Let's check the number of rows again. You will see the database is restored to its initial state, before the import of more than 10,000 records from the AdventureWorks database: 1> SELECT COUNT (*) FROM Students 2> GO ------- 4 (1 rows affected) 10. Now it's time to restore the content of the differential backup and return the University database to its state after the import procedure: 1> USE master 2> ALTER DATABASE University 3> SET SINGLE_USER WITH ROLLBACK IMMEDIATE 4> RESTORE DATABASE University 5> FROM DISK = N'/var/opt/mssql/data/University.bak' 6> WITH FILE = 1, NORECOVERY, NOUNLOAD, REPLACE, STATS = 5 7> RESTORE DATABASE University 8> FROM DISK = N'/var/opt/mssql/data/University-diff.bak' 9> WITH FILE = 1, NOUNLOAD, STATS = 5 10> ALTER DATABASE University SET MULTI_USER 11> GO Processed 376 pages for database 'University', file 'University' on file 1. Processed 7 pages for database 'University', file 'University_log' on file 1. RESTORE DATABASE successfully processed 383 pages in 0.529 seconds (5.656 MB/sec). Processed 216 pages for database 'University', file 'University' on file 1. Processed 3 pages for database 'University', file 'University_log' on file 1. RESTORE DATABASE successfully processed 219 pages in 0.309 seconds (5.524 MB/sec). We'll look at a really cool feature of SQL Server: backup compression. A backup can be a very large file, and if companies create backups on daily basis, then you can do the math on the amount of storage required. Disk space is cheap today, but it is not free. As a database administrator on SQL Server on Linux, you should consider any possible option to optimize and save money. Backup compression is just that kind of feature. It provides you with a compression procedure (ZIP, RAR) after creating regular backups. So, you save time, space, and money. Let's consider a full database backup of the University database. The uncompressed file is about 3 MB. After we create a new one with compression, the size should be reduced. The compression ratio mostly depends on data types inside the database. It is not a magic stick but it can save space. The following SQL command will create a full database backup of the University database and compress it: 1> BACKUP DATABASE University 2> TO DISK = '/var/opt/mssql/data/University-compress.bak' 3> WITH NOFORMAT, INIT, SKIP, NOREWIND, NOUNLOAD, COMPRESSION, STATS = 10 4> GO Now exit to bash, enter superuser mode, and type the following ls command to compare the size of the backup files: tumbleweed:/home/dba # ls -lh /var/opt/mssql/data/U*.bak As you can see, the compression size is 676 KB and it is around five times smaller. That is a huge space saving without any additional tools. SQL Server on Linux has one security feature with backup. We learned about SQL Server recovery model, and how to efficiently backup and restore our database. You can know more about transaction logs and elements of backup strategy from this book SQL Server on Linux. Also, check out: Getting to know SQL Server options for disaster recovery Get SQL Server user management right How to integrate SharePoint with SQL Server Reporting Services
Read more
  • 0
  • 0
  • 48511
article-image-recurrent-neural-networks-lstm-architecture
Richard Gall
11 Apr 2018
2 min read
Save for later

Recurrent neural networks and the LSTM architecture

Richard Gall
11 Apr 2018
2 min read
A recurrent neural network is a class of artificial neural networks that contain a network like series of nodes, each with a directed or one-way connection to every other node. These nodes can be classified as either input, output, or hidden. Input nodes receive data from outside of the network, hidden nodes modify the input data, and output nodes provide the intended results. RNNs are well known for their extensive usage in NLP tasks. The video tutorial above has been taken from Natural Language Processing with Python. Why are recurrent neural networks well suited for NLP? What makes RNNs so popular and effective for natural language processing tasks is that they operate sequentially over data sets. For example, a movie review is an arbitrary sequence of letters and characters, which the RNN can take as an input. The subsequent hidden and output layers are also capable of working with sequences. In a basic sentiment analysis example, you might just have a binary output - like classifying movie reviews as positive or negative. RNNs can do more than this - they are capable of generating a sequential output, such as taking an input sentence in English and translating it into Spanish. This ability to sequentially process data is what makes recurrent neural networks so well suited for NLP tasks. RNNs and long short-term memory Recurrent neural networks can sometimes become unstable due to the complexity of the connections they are built upon. That's where LSTM architecture helps. LSTM introduces something called a memory cell. The memory cell simplifies what could be incredibly by using a series of different gates to govern the way it changes within the network. The input gate manages inputs The output gates manage outputs Self-recurrent connection that keeps the memory cell in a consistent state between different steps The forget gate simply allows the memory cell to 'forget' its previous state [dropcap]R[/dropcap]ead Next High-level concepts Neural Network Architectures 101: Understanding Perceptrons 4 ways to enable Continual learning into Neural Networks Tutorials Build a generative chatbot using recurrent neural networks (LSTM RNNs) Training RNNs for Time Series Forecasting Implement Long-short Term Memory (LSTM) with TensorFlow How to auto-generate texts from Shakespeare writing using deep recurrent neural networks Research in this area Paper in Two minutes: Attention Is All You Need
Read more
  • 0
  • 0
  • 22547

article-image-mark-zuckerberg-congressional-testimony-5-things-learned
Richard Gall
11 Apr 2018
8 min read
Save for later

Mark Zuckerberg's Congressional testimony: 5 things we learned

Richard Gall
11 Apr 2018
8 min read
Mark Zuckerberg yesterday (April 10 2018) testified in front of congress. That's a pretty big deal. Congress has been waiting some time for the chance to grill the Facebook chief, with "Zuck" resisting. So the fact that he finally had his day in D.C. indicates the level of pressure currently on him. Some have lamented the fact that senators were given so little time to respond to Zuckerberg - there was no time to really get deep into the issues at hand. However, although it's true that there was a lot that was superficial about the event, if you looked closely, there was plenty to take away from it. Here are the 5 of the most important things we learned from Mark Zuckerberg's testimony in front of Congress. Policy makers don't really understand that much about tech The most shocking thing to come out of Zuckerberg's testimony was unsurprising; the fact that some of the most powerful people in the U.S. don't really understand the technology that's being discussed. More importantly this is technology they're going to have to be making decisions on. One Senator brought printouts of Facebook pages and asked Zuckerberg if these were examples of Russian propaganda groups. Another was confused about Facebook's business model - how could it run a free service and still make money? Those are just two pretty funny examples, but the senators' lack of understanding could be forgiven due to their age. However, there surely isn't any excuse for 45 year old Senator Brian Schatz to misunderstand the relationship between Whatsapp and Facebook. https://twitter.com/pdmcleod/status/983809717116993537 Chris Cillizza argued on CNN that "the senate's tech illiteracy saved Zuckerberg". He explained: The problem was that once Zuckerberg responded - and he largely stuck to a very strict script in doing so - the lack of tech knowledge among those asking him questions was exposed. The result? Zuckerberg was rarely pressed, rarely forced off his talking points, almost never made to answer for the very real questions his platform faces. This lack of knowledge led to proceedings being less than satisfactory for onlookers. Until this knowledge gap is tackled, it's always going to be a challenge for political institutions to keep up with technological innovators. Ultimately, that's what makes regulation hard. Zuckerberg is still held up as the gatekeeper of tech in 2018 Zuckerberg is still held up as a gatekeeper or oracle of modern technology. That is probably a consequence of the point above. Because there's such a knowledge gap within the institutions that govern and regulate, it's more manageable for them to look to a figurehead. That, of course, goes both ways - on the one hand Zuckerberg is a fountain of knowledge, someone who can solve these problems. On the other hand is part of a Silicon Valley axis of evil, nefariously plotting the downfall of democracy and how to read your WhatsApp messages. Most people know that neither is true. The key point, though, is that however you feel about Zuckerberg, he is not the man you need to ask about regulation. This is something that Zephy Teachout argues on the Guardian. "We shouldn’t be begging for Facebook’s endorsement of laws, or for Mark Zuckerberg’s promises of self-regulation" she writes. In fact, one of the interesting subplots of the hearing was the fact that Zuckerberg didn't actually know that much. For example, a lot has been made of how extensive his notes were. And yes, you certainly would expect someone facing a panel of Senators in Washington to be well-briefed. But it nevertheless underlines an important point - the fact that Facebook is a complex and multi-faceted organization that far exceeds the knowledge of its founder and CEO. In turn, this tells you something about technology that's often lost within the discourse: the fact that its hard to consider what's happening at a superficial or abstract level without completely missing the point. There's a lot you could say about Zuckerberg's notes. One of the most interesting was the point around GDPR. The note is very prescriptive: it says "Don't say we already do what GDPR requires." Many have noted that this throws up a lot of issues, not least how Facebook plan to tackle GDPR in just over a month if they haven't moved on it already. But it's the suggestion that Zuckerberg was completely unaware of the situation that is most remarkable here. He doesn't even know where his company is on one of the most important pieces of data legislation for decades. Facebook is incredibly naive If senators were often naive - or plain ignorant - on matters of technology - during the hearing, there was plenty of evidence to indicate that Zuckerberg is just as naive. The GDPR issue mentioned above is just one example. But there are other problems too. You can't, for example, get much more naive than thinking that Cambridge Analytica had deleted the data that Facebook had passed to it. Zuckerberg's initial explanation was that he didn't realize that Cambridge Analytica was "not an app developer or advertiser", but he corrected this saying that his team told him they were an advertiser back in 2015, which meant they did have reason to act on it but chose not to. Zuckerberg apologized for this mistake, but it's really difficult to see how this would happen. There almost appears to be a culture of naivety within Facebook, whereby the organization generally, and Zuckerberg specifically, don't fully understand the nature of the platform it has built and what it could be used for. It's only now, with Zuckerberg talking about an "arms race" with Russia that this naivety is disappearing. But its clear there was an organizational blindspot that has got us to where we are today. Facebook still thinks AI can solve all of its problems The fact that Facebook believes AI is the solution to so many of its problems is indicative of this ingrained naivety. When talking to Congress about the 'arms race' with Russian intelligence, and the wider problem of hate speech, Zuckerberg signaled that the solution lies in the continued development of better AI systems. However, he conceded that building systems actually capable of detecting such speech could be 5 to 10 years away. This is a problem. It's proving a real challenge for Facebook to keep up with the 'misuse' of its platform. Foreign Policy reports that: "...just last week, the company took down another 70 Facebook accounts, 138 Facebook pages, and 65 Instagram accounts controlled by Russia’s Internet Research Agency, a baker’s dozen of whose executives and operatives have been indicted by Special Counsel Robert Mueller for their role in Russia’s campaign to propel Trump into the White House." However, the more AI comes to be deployed on Facebook, the more that the company is going to have to rethink how it describes itself. By using algorithms to regulate the way the platform is used, there comes to be an implicit editorializing of content. That's not necessarily a bad thing, but it does mean we again return to this final problem... There's still confusion about the difference between a platform and a publisher Central to every issue that was raised in Zuckerberg's testimony was the fact that Facebook remains confused about whether it is a platform or a publisher. Or, more specifically, the extent to which it is responsible for the content on the platform. It's hard to single out Zuckerberg here because everyone seems to be confused on this point. But it's interesting that he seems to have never really thought about the problem. That does seem to be changing, however. In his testimony, Zuckerberg said that "Facebook was responsible" for the content on its platforms. This statement marks a big change from the typical line used by every social media platform - that platforms are just platforms, they bear no responsibility for what is published on them. However, just when you think Zuckerberg is making a definitive statement, he steps back. He went on to say that "I agree that we are responsible for the content, but we don't produce the content." This statement hints that he still wants to keep the distinction between platform and publisher. Unfortunately for Zuckerberg, that might be too late. Read Next OpenAI charter puts safety, standards, and transparency first ‘If tech is building the future, let’s make that future inclusive and representative of all of society’ – An interview with Charlotte Jee What your organization needs to know about GDPR 20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017
Read more
  • 0
  • 0
  • 17950
Modal Close icon
Modal Close icon