How-To Tutorials

article-image-hyper-v-architecture-and-components

04 Jan 2017

15 min read

Hyper-V Architecture and Components

04 Jan 2017

0
0
33224

article-image-task-parallel-library-multi-threading-net-core

Aaron Lazar

16 Aug 2018

11 min read

Task parallel library for easy multi-threading in .NET Core [Tutorial]

Aaron Lazar

16 Aug 2018

11 min read

Compared to the classic threading model in .NET, Task Parallel Library minimizes the complexity of using threads and provides an abstraction through a set of APIs that help developers focus more on the application program instead of focusing on how the threads will be provisioned. In this article, we'll learn how TPL benefits of using traditional threading techniques for concurrency and high performance. There are several benefits of using TPL over threads: It autoscales the concurrency to a multicore level It autoscales LINQ queries to a multicore level It handles the partitioning of the work and uses ThreadPool where required It is easy to use and reduces the complexity of working with threads directly This tutorial is an extract from the book, C# 7 and .NET Core 2.0 High Performance, authored by Ovais Mehboob Ahmed Khan. Creating a task using TPL TPL APIs are available in the System.Threading and System.Threading.Tasks namespaces. They work around the task, which is a program or a block of code that runs asynchronously. An asynchronous task can be run by calling either the Task.Run or TaskFactory.StartNew methods. When we create a task, we provide a named delegate, anonymous method, or a lambda expression that the task executes. Here is a code snippet that uses a lambda expression to execute the ExecuteLongRunningTasksmethod using Task.Run: class Program { static void Main(string[] args) { Task t = Task.Run(()=>ExecuteLongRunningTask(5000)); t.Wait(); } public static void ExecuteLongRunningTask(int millis) { Thread.Sleep(millis); Console.WriteLine("Hello World"); } } In the preceding code snippet, we have executed the ExecuteLongRunningTask method asynchronously using the Task.Run method. The Task.Run method returns the Task object that can be used to further wait for the asynchronous piece of code to be executed completely before the program ends. To wait for the task, we have used the Wait method. Alternatively, we can also use the Task.Factory.StartNew method, which is more advanced and provides more options. While calling the Task.Factory.StartNew method, we can specify CancellationToken, TaskCreationOptions, and TaskScheduler to set the state, specify other options, and schedule tasks. TPL uses multiple cores of the CPU out of the box. When the task is executed using the TPL API, it automatically splits the task into one or more threads and utilizes multiple processors, if they are available. The decision as to how many threads will be created is calculated at runtime by CLR. Whereas a thread only has an affinity to a single processor, running any task on multiple processors needs a proper manual implementation. Task-based asynchronous pattern (TAP) When developing any software, it is always good to implement the best practices while designing its architecture. The task-based asynchronous pattern is one of the recommended patterns that can be used when working with TPL. There are, however, a few things to bear in mind while implementing TAP. Naming convention The method executing asynchronously should have the naming suffix Async. For example, if the method name starts with ExecuteLongRunningOperation, it should have the suffix Async, with the resulting name of ExecuteLongRunningOperationAsync. Return type The method signature should return either a System.Threading.Tasks.Task or System.Threading.Tasks.Task<TResult>. The task's return type is equivalent to the method that returns void, whereas TResult is the data type. Parameters The out and ref parameters are not allowed as parameters in the method signature. If multiple values need to be returned, tuples or a custom data structure can be used. The method should always return Task or Task<TResult>, as discussed previously. Here are a few signatures for both synchronous and asynchronous methods: Synchronous methodAsynchronous methodVoid Execute();Task ExecuteAsync();List<string> GetCountries();Task<List<string>> GetCountriesAsync();Tuple<int, string> GetState(int stateID);Task<Tuple<int, string>> GetStateAsync(int stateID);Person GetPerson(int personID);Task<Person> GetPersonAsync(int personID); Exceptions The asynchronous method should always throw exceptions that are assigned to the returning task. However, the usage errors, such as passing null parameters to the asynchronous method, should be properly handled. Let's suppose we want to generate several documents dynamically based on a predefined templates list, where each template populates the placeholders with dynamic values and writes it on the filesystem. We assume that this operation will take a sufficient amount of time to generate a document for each template. Here is a code snippet showing how the exceptions can be handled: static void Main(string[] args) { List<Template> templates = GetTemplates(); IEnumerable<Task> asyncDocs = from template in templates select GenerateDocumentAsync(template); try { Task.WaitAll(asyncDocs.ToArray()); }catch(Exception ex) { Console.WriteLine(ex); } Console.Read(); } private static async Task<int> GenerateDocumentAsync(Template template) { //To automate long running operation Thread.Sleep(3000); //Throwing exception intentionally throw new Exception(); } In the preceding code, we have a GenerateDocumentAsync method that performs a long running operation, such as reading the template from the database, populating placeholders, and writing a document to the filesystem. To automate this process, we used Thread.Sleep to sleep the thread for three seconds and then throw an exception that will be propagated to the calling method. The Main method loops the templates list and calls the GenerateDocumentAsync method for each template. Each GenerateDocumentAsync method returns a task. When calling an asynchronous method, the exception is actually hidden until the Wait, WaitAll, WhenAll, and other methods are called. In the preceding example, the exception will be thrown once the Task.WaitAll method is called, and will log the exception on the console. Task status The task object provides a TaskStatus that is used to know whether the task is executing the method running, has completed the method, has encountered a fault, or whether some other occurrence has taken place. The task initialized using Task.Run initially has the status of Created, but when the Start method is called, its status is changed to Running. When applying the TAP pattern, all the methods return the Task object, and whether they are using the Task.Run inside, the method body should be activated. That means that the status should be anything other than Created. The TAP pattern ensures the consumer that the task is activated and the starting task is not required. Task cancellation Cancellation is an optional thing for TAP-based asynchronous methods. If the method accepts the CancellationToken as the parameter, it can be used by the caller party to cancel a task. However, for a TAP, the cancellation should be properly handled. Here is a basic example showing how cancellation can be implemented: static void Main(string[] args) { CancellationTokenSource tokenSource = new CancellationTokenSource(); CancellationToken token = tokenSource.Token; Task.Factory.StartNew(() => SaveFileAsync(path, bytes, token)); } static Task<int> SaveFileAsync(string path, byte[] fileBytes, CancellationToken cancellationToken) { if (cancellationToken.IsCancellationRequested) { Console.WriteLine("Cancellation is requested..."); cancellationToken.ThrowIfCancellationRequested } //Do some file save operation File.WriteAllBytes(path, fileBytes); return Task.FromResult<int>(0); } In the preceding code, we have a SaveFileAsync method that takes the byte array and the CancellationToken as parameters. In the Main method, we initialize the CancellationTokenSource that can be used to cancel the asynchronous operation later in the program. To test the cancellation scenario, we will just call the Cancel method of the tokenSource after the Task.Factory.StartNew method and the operation will be canceled. Moreover, when the task is canceled, its status is set to Cancelled and the IsCompleted property is set to true. Task progress reporting With TPL, we can use the IProgress<T> interface to get real-time progress notifications from the asynchronous operations. This can be used in scenarios where we need to update the user interface or the console app of asynchronous operations. When defining the TAP-based asynchronous methods, defining IProgress<T> in a parameter is optional. We can have overloaded methods that can help consumers to use in the case of specific needs. However, they should only be used if the asynchronous method supports them. Here is the modified version of SaveFileAsync that updates the user about the real progress: static void Main(string[] args) { var progressHandler = new Progress<string>(value => { Console.WriteLine(value); }); var progress = progressHandler as IProgress<string>; CancellationTokenSource tokenSource = new CancellationTokenSource(); CancellationToken token = tokenSource.Token; Task.Factory.StartNew(() => SaveFileAsync(path, bytes, token, progress)); Console.Read(); } static Task<int> SaveFileAsync(string path, byte[] fileBytes, CancellationToken cancellationToken, IProgress<string> progress) { if (cancellationToken.IsCancellationRequested) { progress.Report("Cancellation is called"); Console.WriteLine("Cancellation is requested..."); } progress.Report("Saving File"); File.WriteAllBytes(path, fileBytes); progress.Report("File Saved"); return Task.FromResult<int>(0); } Implementing TAP using compilers Any method that is attributed with the async keyword (for C#) or Async for (Visual Basic) is called an asynchronous method. The async keyword can be applied to a method, anonymous method, or a Lambda expression, and the language compiler can execute that task asynchronously. Here is a simple implementation of the TAP method using the compiler approach: static void Main(string[] args) { var t = ExecuteLongRunningOperationAsync(100000); Console.WriteLine("Called ExecuteLongRunningOperationAsync method, now waiting for it to complete"); t.Wait(); Console.Read(); } public static async Task<int> ExecuteLongRunningOperationAsync(int millis) { Task t = Task.Factory.StartNew(() => RunLoopAsync(millis)); await t; Console.WriteLine("Executed RunLoopAsync method"); return 0; } public static void RunLoopAsync(int millis) { Console.WriteLine("Inside RunLoopAsync method"); for(int i=0;i< millis; i++) { Debug.WriteLine($"Counter = {i}"); } Console.WriteLine("Exiting RunLoopAsync method"); } In the preceding code, we have the ExecuteLongRunningOperationAsync method, which is implemented as per the compiler approach. It calls the RunLoopAsync that executes a loop for a certain number of milliseconds that is passed in the parameter. The async keyword on the ExecuteLongRunningOperationAsync method actually tells the compiler that this method has to be executed asynchronously, and, once the await statement is reached, the method returns to the Main method that writes the line on a console and waits for the task to be completed. Once the RunLoopAsync is executed, the control comes back to await and starts executing the next statements in the ExecuteLongRunningOperationAsync method. Implementing TAP with greater control over Task As we know, that the TPL is centered on the Task and Task<TResult> objects. We can execute an asynchronous task by calling the Task.Run method and execute a delegate method or a block of code asynchronously and use Wait or other methods on that task. However, this approach is not always adequate, and there are scenarios where we may have different approaches to executing asynchronous operations, and we may use an Event-based Asynchronous Pattern (EAP) or an Asynchronous Programming Model (APM). To implement TAP principles here, and to get the same control over asynchronous operations executing with different models, we can use the TaskCompletionSource<TResult> object. The TaskCompletionSource<TResult> object is used to create a task that executes an asynchronous operation. When the asynchronous operation completes, we can use the TaskCompletionSource<TResult> object to set the result, exception, or state of the task. Here is a basic example that executes the ExecuteTask method that returns Task, where the ExecuteTask method uses the TaskCompletionSource<TResult> object to wrap the response as a Task and executes the ExecuteLongRunningTask through the Task.StartNew method: static void Main(string[] args) { var t = ExecuteTask(); t.Wait(); Console.Read(); } public static Task<int> ExecuteTask() { var tcs = new TaskCompletionSource<int>(); Task<int> t1 = tcs.Task; Task.Factory.StartNew(() => { try { ExecuteLongRunningTask(10000); tcs.SetResult(1); }catch(Exception ex) { tcs.SetException(ex); } }); return tcs.Task; } public static void ExecuteLongRunningTask(int millis) { Thread.Sleep(millis); Console.WriteLine("Executed"); } So now, we've been able to use TPL and TAP over traditional threads, thus improving performance. If you liked this article and would like to learn more such techniques, pick up this book, C# 7 and .NET Core 2.0 High Performance, authored by Ovais Mehboob Ahmed Khan. Get to know ASP.NET Core Web API [Tutorial] .NET Core completes move to the new compiler – RyuJIT Applying Single Responsibility principle from SOLID in .NET Core

0
0
33211

How-To Tutorials

article-image-why-choose-opencv-over-matlab-for-your-next-computer-vision-project

Vincy Davis

20 Dec 2019

6 min read

Why choose OpenCV over MATLAB for your next Computer Vision project

Vincy Davis

20 Dec 2019

6 min read

Scientific Computing relies on executing computer algorithms coded in different programming languages. One such interdisciplinary scientific field is the study of Computer Vision, often abbreviated as CV. Computer Vision is used to develop techniques that can automate tasks like acquiring, processing, analyzing and understanding digital images. It is also utilized for extracting high-dimensional data from the real world to produce symbolic information. In simple words, Computer Vision gives computers the ability to see, understand and process images and videos like humans. The vast advances in hardware, machine learning tools, and frameworks have resulted in the implementation of Computer Vision in various fields like IoT, manufacturing, healthcare, security, etc. Major tech firms like Amazon, Google, Microsoft, and Facebook are investing immensely in the research and development of this field. Out of the many tools and libraries available for Computer Vision nowadays, there are two major tools OpenCV and Matlab that stand out in terms of their speed and efficiency. In this article, we will have a detailed look at both of them. Further Reading [box type="shadow" align="" class="" width=""]To learn how to build interesting image recognition models like setting up license plate recognition using OpenCV, read the book “Computer Vision Projects with OpenCV and Python 3” by author Matthew Rever. The book will also guide you to design and develop production-grade Computer Vision projects by tackling real-world problems.[/box] OpenCV: An open-source multiplatform solution tailored for Computer Vision OpenCV, developed by Intel and now supported by Willow Garage, is released under the BSD 3-Clause license and is free for commercial use. It is one of the most popular computer vision tools aimed at providing a well-optimized, well tested, and open-source (C++)-based implementation for computer vision algorithms. The open-source library has interfaces for multiple languages like C++, Python, and Java and supports Linux, macOS, Windows, iOS, and Android. Many of its functions are implemented on GPU. The first stable release of OpenCV version 1.0 was in the year 2006. The OpenCV community has grown rapidly ever since and with its latest release, OpenCV version 4.1.1, it also brings improvements in the dnn (Deep Neural Networks) module, which is a popular module in the library that implements forward pass (inferencing) with deep networks, which are pre-trained using popular deep learning frameworks. Some of the features offered by OpenCV include: imread function to read the images in the BGR (Blue-Green-Red) format by default. Easy up and downscaling for resizing an image. Supports various interpolation and downsampling methods like INTER_NEAREST to represent the nearest neighbor interpolation. Supports multiple variations of thresholding like adaptive thresholding, bitwise operations, edge detection, image filtering, image contours, and more. Enables image segmentation (Watershed Algorithm) to classify each pixel in an image to a particular class of background and foreground. Enables multiple feature-matching algorithms, like brute force matching, knn feature matching, among others. With its active community and regular updates for Machine Learning, OpenCV is only going to grow by leaps and bounds in the field of Computer Vision projects. MATLAB: A licensed quick prototyping tool with OpenCV integration One disadvantage of OpenCV, which makes novice computer vision users tilt towards Matlab is the former's complex nature. OpenCV is comparatively harder to learn due to lack of documentation and error handling codes. Matlab, developed by MathWorks is a proprietary programming language with a multi-paradigm numerical computing environment. It has over 3 million users worldwide and is considered one of the easiest and most productive software for engineers and scientists. It has a very powerful and swift matrix library. Matlab also works in integration with OpenCV. This enables MATLAB users to explore, analyze, and debug designs that incorporate OpenCV algorithms. The support package of MATLAB includes the data type conversions necessary for MATLAB and OpenCV. MathWorks provided Computer Vision Toolbox renders algorithms, functions, and apps for designing and testing computer vision, 3D vision, and video processing systems. It also allows detection, tracking, feature extraction, and matching of objects. Matlab can also train custom object detectors using deep learning and machine learning algorithms such as YOLO v2, Faster R-CNN, and ACF. Most of the toolbox algorithms in Matlab support C/C++ code generation for integrating with existing code, desktop prototyping, and embedded vision system deployment. However, Matlab does not contain as many functions for computer vision as OpenCV, which has more of its functions implemented on GPU. Another issue with Matlab is that it's not open-source, it’s license is costly and the programs are not portable. Another important factor which matters a lot in computer vision is the performance of a code, especially when working on real-time video processing. Which has a faster execution time? OpenCV or Matlab? Along with Computer Vision, other fields also require faster execution while choosing a programming language or library for implementing any function. This factor is analyzed in detail in a paper titled “Matlab vs. OpenCV: A Comparative Study of Different Machine Learning Algorithms”. The paper provides a very practical comparative study between Matlab and OpenCV using 20 different real datasets. The differentiation is based on the execution time for various machine learning algorithms like Classification and Regression Trees (CART), Naive Bayes, Boosting, Random Forest and K-Nearest Neighbor (KNN). The experiments were run on an Intel core 2 duo P7450 machine, with 3GB RAM, and Ubuntu 11.04 32-bit operating system on Matlab version 7.12.0.635 (R2011a), and OpenCV C++ version 2.1. The paper states, “To compare the speed of Matlab and OpenCV for a particular machine learning algorithm, we run the algorithm 1000 times and take the average of the execution times. Averaging over 1000 experiments is more than necessary since convergence is reached after a few hundred.” The outcome of all the experiments revealed that though Matlab is a successful scientific computing environment, it is outrun by OpenCV for almost all the experiments when their execution time is considered. The paper points out that this could be due to a combination of a number of dimensionalities, sample size, and the use of training sets. One of the listed machine learning algorithms KNN produced a log time ratio of 0.8 and 0.9 on datasets D16 and D17 respectively. Clearly, Matlab is great for exploring and fiddling with computer vision concepts as researchers and students at universities that can afford the software. However, when it comes to building production-ready real-world computer vision projects, OpenCV beats Matlab hand down. You can learn about building more Computer Vision projects like human pose estimation using TensorFlow from our book ‘Computer Vision Projects with OpenCV and Python 3’. Master the art of face swapping with OpenCV and Python by Sylwek Brzęczkowski, developer at TrustStamp NVIDIA releases Kaolin, a PyTorch library to accelerate research in 3D computer vision and AI Generating automated image captions using NLP and computer vision [Tutorial] Computer vision is growing quickly. Here’s why. Introducing Intel’s OpenVINO computer vision toolkit for edge computing

0
0
33178

Packt

25 Mar 2011

7 min read

Anatomy of a WordPress Plugin

Packt

25 Mar 2011

7 min read

WordPress 3 Plugin Development Essentials Create your own powerful, interactive plugins to extend and add features to your WordPress site Read more about this book WordPress is a popular content management system (CMS), most renowned for its use as a blogging / publishing application. According to usage statistics tracker, BuiltWith (http://builtWith.com), WordPress is considered to be the most popular blogging software on the planet—not bad for something that has only been around officially since 2003. Before we develop any substantial plugins of our own, let's take a few moments to look at what other people have done, so we get an idea of what the final product might look like. By this point, you should have a fresh version of WordPress installed and running somewhere for you to play with. It is important that your installation of WordPress is one with which you can tinker. In this article by Brian Bondari and Everett Griffiths, authors of WordPress 3 Plugin Development Essentials, we will purposely break a few things to help see how they work, so please don't try anything in this article on a live production site. Deconstructing an existing plugin: "Hello Dolly" WordPress ships with a simple plugin named "Hello Dolly". Its name is a whimsical take on the programmer's obligatory "Hello, World!", and it is trotted out only for pedantic explanations like the one that follows (unless, of course, you really do want random lyrics by Jerry Herman to grace your administration screens). Activating the plugin Let's activate this plugin so we can have a look at what it does: Browse to your WordPress Dashboard at http://yoursite.com/wp-admin/. Navigate to the Plugins section. Under the Hello Dolly title, click on the Activate link. You should now see a random lyric appear in the top-right portion of the Dashboard. Refresh the page a few times to get the full effect. Examining the hello.php file Now that we've tried out the "Hello Dolly" plugin, let's have a closer look. In your favorite text editor, open up the /wp-content/plugins/hello.php file. Can you identify the following integral parts? The Information Header which describes details about the plugin (author and description). This is contained in a large PHP /* comment */. User-defined functions, such as the hello_dolly() function. The add_action() and/or add_filter() functions, which hook a WordPress event to a user-defined function. It looks pretty simple, right? That's all you need for a plugin: An information header Some user-defined functions add_action() and/or add_filter() functions In your WordPress Dashboard, ensure that the "Hello Dolly" plugin has been activated. If applicable, use your preferred (s)FTP program to connect to your WordPress installation. Using your text editor, temporarily delete the information header from wpcontent/ plugins/hello.php and save the file (you can save the header elsewhere for now). Save the file. Refresh the Plugins page in your browser. You should get a warning from WordPress stating that the plugin does not have a valid header: Ensure that the "Hello Dolly" plugin is active. Open the /wp-content/plugins/hello.php file in your text editor. Immediately before the line that contains function hello_dolly_get_lyric, type in some gibberish text, such as "asdfasdf" and save the file. Reload the plugins page in your browser. This should generate a parse error, something like: pre width="70"> Parse error: syntax error, unexpected T_FUNCTION in /path/to/ wordpress/html/wp-content/plugins/hello.php on line 16 Author: Listed below the plugin name Author URI: Together with "Author", this creates a link to the author's site Description: Main block of text describing the plugin Plugin Name: The displayed name of the plugin Plugin URI: Destination of the "Visit plugin site" link Version: Use this to track your changes over time Now that we've identified the critical component parts, let's examine them in more detail. Information header Don't just skim this section thinking it's a waste of breath on the self-explanatory header fields. Unlike a normal PHP file in which the comments are purely optional, in WordPress plugin and theme files, the Information Header is required! It is this block of text that causes a file to show up on WordPress' radar so that you can activate it or deactivate it. If your plugin is missing a valid information header, you cannot use it! Exercise—breaking the header To reinforce that the information header is an integral part of a plugin, try the following exercise: After you've seen the tragic consequences, put the header information back into the hello.php file. This should make it abundantly clear to you that the information header is absolutely vital for every WordPress plugin. If your plugin has multiple files, the header should be inside the primary file—in this article we use index.php as our primary file, but many plugins use a file named after the plugin name as their primary file. Location, name, and format The header itself is similar in form and function to other content management systems, such as Drupal's module.info files or Joomla's XML module configurations—it offers a way to store additional information about a plugin in a standardized format. The values can be extended, but the most common header values are listed below: For more information about header blocks, see the WordPress codex at: http://codex.wordpress.org/File_Header. In order for a PHP file to show up in WordPress' Plugins menu: The file must have a .php extension. The file must contain the information header somewhere in it (preferably at the beginning). The file must be either in the /wp-content/plugins directory, or in a subdirectory of the plugins directory. It cannot be more deeply nested. Understanding the Includes When you activate a plugin, the name of the file containing the information header is stored in the WordPress database. Each time a page is requested, WordPress goes through a laundry list of PHP files it needs to load, so activating a plugin ensures that your own files are on that list. To help illustrate this concept, let's break WordPress again. Exercise – parse errors Try the following exercise: Yikes! Your site is now broken. Why did this happen? We introduced errors into the plugin's main file (hello.php), so including it caused PHP and WordPress to choke. Delete the gibberish line from the hello.php file and save to return the plugin back to normal. The parse error only occurs if there is an error in an active plugin. Deactivated plugins are not included by WordPress and therefore their code is not parsed. You can try the same exercise after deactivating the plugin and you'll notice that WordPress does not raise any errors. Bonus for the curious In case you're wondering exactly where and how WordPress stores the information about activated plugins, have a look in the database. Using your MySQL client, you can browse the wp_options table or execute the following query: SELECT option_value FROM wp_options WHERE option_name='active_ plugins'; The active plugins are stored as a serialized PHP hash, referencing the file containing the header. The following is an example of what the serialized hash might contain if you had activated a plugin named "Bad Example". You can use PHP's unserialize() function to parse the contents of this string into a PHP variable as in the following script: <?php $active_plugin_str = 'a:1:{i:0;s:27:"bad-example/bad-example. php";}'; print_r( unserialize($active_plugin_str) ); ?> And here's its output: Array ( [0] => bad-example/bad-example.php )

0
1
33166

article-image-jim-balsillie-on-data-governance-challenges-and-6-recommendations-to-tackle-them

Savia Lobo

05 Jun 2019

5 min read

Jim Balsillie on Data Governance Challenges and 6 Recommendations to tackle them

Savia Lobo

05 Jun 2019

5 min read

The Canadian Parliament's Standing Committee on Access to Information, Privacy and Ethics hosted the hearing of the International Grand Committee on Big Data, Privacy and Democracy from Monday, May 27 to Wednesday, May 29. Witnesses from at least 11 countries appeared before representatives to testify on how governments can protect democracy and citizen rights in the age of big data. This section of the hearing, which took place on May 28, includes Jim Balsillie’s take on Data Governance. Jim Balsillie, Chair, Centre for International Governance Innovation; Retired Chairman and co-CEO of BlackBerry, starts off by talking about how Data governance is the most important public policy issue of our time. It is cross-cutting with economic, social and security dimensions. It requires both national policy frameworks and international coordination. He applauded the seriousness and integrity of Mr. Zimmer Angus and Erskine Smith who have spearheaded a Canadian bipartisan effort to deal with data governance over the past three years. “My perspective is that of a capitalist and global tech entrepreneur for 30 years and counting. I'm the retired Chairman and co-CEO of Research in Motion, a Canadian technology company [that] we scaled from an idea to 20 billion in sales. While most are familiar with the iconic BlackBerry smartphones, ours was actually a platform business that connected tens of millions of users to thousands of consumer and enterprise applications via some 600 cellular carriers in over 150 countries. We understood how to leverage Metcalfe's law of network effects to create a category-defining company, so I'm deeply familiar with multi-sided platform business model strategies as well as navigating the interface between business and public policy.”, he adds. He further talks about his different observations about the nature, scale, and breadth of some collective challenges for the committee’s consideration: Disinformation in fake news is just two of the negative outcomes of unregulated attention based business models. They cannot be addressed in isolation; they have to be tackled horizontally as part of an integrated whole. To agonize over social media’s role in the proliferation of online hate, conspiracy theories, politically motivated misinformation, and harassment, is to miss the root and scale of the problem. Social media’s toxicity is not a bug, it's a feature. Technology works exactly as designed. Technology products services and networks are not built in a vacuum. Usage patterns drive product development decisions. Behavioral scientists involved with today's platforms helped design user experiences that capitalize on negative reactions because they produce far more engagement than positive reactions. Among the many valuable insights provided by whistleblowers inside the tech industry is this quote, “the dynamics of the attention economy are structurally set up to undermine the human will.” Democracy and markets work when people can make choices align with their interests. The online advertisement driven business model subverts choice and represents a fundamental threat to markets election integrity and democracy itself. Technology gets its power through the control of data. Data at the micro-personal level gives technology unprecedented power to influence. “Data is not the new oil, it's the new plutonium amazingly powerful dangerous when it spreads difficult to clean up and with serious consequences when improperly used.” Data deployed through next-generation 5G networks are transforming passive in infrastructure into veritable digital nervous systems. Our current domestic and global institutions rules and regulatory frameworks are not designed to deal with any of these emerging challenges. Because cyberspace knows no natural borders, digital transformation effects cannot be hermetically sealed within national boundaries; international coordination is critical. With these observations, Balsillie has further provided six recommendations: Eliminate tax deductibility of specific categories of online ads. Ban personalized online advertising for elections. Implement strict data governance regulations for political parties. Provide effective whistleblower protections. Add explicit personal liability alongside corporate responsibility to effect the CEO and board of directors’ decision-making. Create a new institution for like-minded nations to address digital cooperation and stability. Technology is becoming the new 4th Estate Technology is disrupting governance and if left unchecked could render liberal democracy obsolete. By displacing the print and broadcast media and influencing public opinion, technology is becoming the new Fourth Estate. In our system of checks and balances, this makes technology co-equal with the executive that led the legislative and the judiciary. When this new Fourth Estate declines to appear before this committee, as Silicon Valley executives are currently doing, it is symbolically asserting this aspirational co-equal status. But is asserting the status and claiming its privileges without the traditions, disciplines, legitimacy, or transparency that checked the power of the traditional Fourth Estate. The work of this international grand committee is a vital first step towards reset redress of this untenable current situation. Referring to what Professor Zuboff said last night, we Canadians are currently in a historic battle for the future of our democracy with a charade called sidewalk Toronto. He concludes by saying, “I'm here to tell you that we will win that battle.” To know more you can listen to the full hearing video titled, “Meeting No. 152 ETHI - Standing Committee on Access to Information, Privacy, and Ethics” on ParlVU. Speech2Face: A neural network that “imagines” faces from hearing voices. Is it too soon to worry about ethnic profiling? UK lawmakers to social media: “You’re accessories to radicalization, accessories to crimes”, hearing on spread of extremist content Key Takeaways from Sundar Pichai’s Congress hearing over user data, political bias, and Project Dragonfly

0
0
33154

Packt

10 Nov 2016

16 min read

Internationalization

Packt

10 Nov 2016

16 min read

0
0
33122

article-image-introducing-r-rstudio-and-shiny

Packt

25 Sep 2015

9 min read

Introducing R, RStudio, and Shiny

Packt

25 Sep 2015

9 min read

In this article, by Hernán G. Resnizky, author of the book Learning Shiny, the main objective will be to learn how to install all the needed components to build an application in R with Shiny. Additionally, some general ideas about what R is will be covered in order to be able to dive deeper into programming using R. The following topics will be covered: A brief introduction to R, RStudio, and Shiny Installation of R and Shiny General tips and tricks (For more resources related to this topic, see here.) About R As stated on the R-project main website: "R is a language and environment for statistical computing and graphics." R is a successor of S and is a GNU project. This means, briefly, that anyone can have access to its source codes and can modify or adapt it to their needs. Nowadays, it is gaining territory over classic commercial software, and it is, along with Python, the most used language for statistics and data science. Regarding R's main characteristics, the following can be considered: Object oriented: R is a language that is composed mainly of objects and functions. Can be easily contributed to: Similar to GNU projects, R is constantly being enriched by user's contributions either by making their codes accessible via "packages" or libraries, or by editing/improving its source code. There are actually almost 7000 packages in the common R repository, Comprehensive R Archive Network (CRAN). Additionally, there are R repositories of public access, such as bioconductor project that contains packages for bioinformatics. Runtime execution: Unlike C or Java, R does not need compilation. This means that you can, for instance, write 2 + 2 in the console and it will return the value. Extensibility: The R functionalities can be extended through the installation of packages and libraries. Standard proven libraries can be found in CRAN repositories and are accessible directly from R by typing install.packages(). Installing R R can be installed in every operating system. It is highly recommended to download the program directly from http://cran.rstudio.com/ when working on Windows or Mac OS. On Ubuntu, R can be easily installed from the terminal as follows: sudo apt-get update sudo apt-get install r-base sudo apt-get install r-base-dev The installation of r-base-dev is highly recommended as it is a package that enables users to compile the R packages from source, that is, maintain the packages or install additional R packages directly from the R console using the install.packages() command. To install R on other UNIX-based operating systems, visit the following links: http://cran.rstudio.com/ http://cran.r-project.org/doc/manuals/r-release/R-admin.html#Obtaining-R A quick guide to R When working on Windows, R can be launched via its application. After the installation, it is available as any other program on Windows. When opening the program, a window like this will appear: When working on Linux, you can access the R console directly by typing R on the command line: In both the cases, R executes in runtime. This means that you can type in code, press Enter, and the result will be given immediately as follows: > 2+2 [1] 4 The R application in any operating system does not provide an easy environment to develop code. For this reason, it is highly recommended (not only to write web applications in R with Shiny, but for any task you want to perform in R) to use an Integrated Development Environment (IDE). About RStudio As with other programming languages, there is a huge variety of IDEs available for R. IDEs are applications that make code development easier and clearer for the programmer. RStudio is one of the most important ones for R, and it is especially recommended to write web applications in R with Shiny because this contains features specially designed for R. Additionally, RStudio provides facilities to write C++, Latex, or HTML documents and also integrates them to the R code. RStudio also provides version control, project management, and debugging features among many others. Installing RStudio RStudio for desktop computers can be downloaded from its official website at http://www.rstudio.com/products/rstudio/download/ where you can get versions of the software for Windows, MAC OS X, Ubuntu, Debian, and Fedora. Quick guide to RStudio Before installing and running RStudio, it is important to have R installed. As it is an IDE and not the programming language, it will not work at all. The following screenshot shows RStudio's starting view: At the first glance, the following four main windows are available: Text editor: This provides facilities to write the R scripts such as highlighting and a code completer (when hitting Tab, you can see the available options to complete the code written). It is also possible to include the R code in an HTML, Latex, or C++ piece of code. Environment and history: They are defined as follows: In the Environment section, you can see the active objects in each environment. By clicking on Global Environment (which is the environment shown by default), you can change the environment and see the active objects. In the History tab, the pieces of codes executed are stored line by line. You can select one or more lines and send them either to the editor or to the console. In addition, you can look up for a certain specific piece of code by typing it in the textbox in the top right part of this window. Console: This is an exact equivalent of R console, as described in Quick guide of R. Tabs: The different tabs are defined as follows: Files: This consists of a file browser with several additional features (renaming, deleting, and copying). Clicking on a file will open it in editor or the Environment tab depending on the type of the file. If it is a .rda or .RData file, it will open in both. If it is a text file, it will open in one of them. Plots: Whenever a plot is executed, it will be displayed in that tab. Packages: This shows a list of available and active packages. When the package is active, it will appear as clicked. Packages can also be installed interactively by clicking on Install Packages. Help: This is a window to seek and read active packages' documentation. Viewer: This enables us to see the HTML-generated content within RStudio. Along with numerous features, RStudio also provides keyboard shortcuts. A few of them are listed as follows: Description Windows/Linux OSX Complete the code. Tab Tab Run the selected piece of code. If no piece of code is selected, the active line is run. Ctrl + Enter ⌘ + Enter Comment the selected block of code. Ctrl + Shift + C ⌘ + / Create a section of code, which can be expanded or compressed by clicking on the arrow to the left. Additionally, it can be accessed by clicking on it in the bottom left menu. ##### ##### Find and replace. Ctrl + F ⌘ + F The following screenshots show how a block of code can be collapsed by clicking on the arrow and how it can be accessed quickly by clicking on its name in the bottom-left part of the window: Clicking on the circled arrow will collapse the Section 1 block, as follows: The full list of shortcuts can be found at https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts. For further information about other RStudio features, the full documentation is available at https://support.rstudio.com/hc/en-us/categories/200035113-Documentation. About Shiny Shiny is a package created by RStudio, which enables to easily interface R with a web browser. As stated in its official documentation, Shiny is a web application framework for R that makes it incredibly easy to build interactive web applications with R. One of its main advantages is that there is no need to combine R code with HTML/JavaScript code as the framework already contains prebuilt features that cover the most commonly used functionalities in a web interactive application. There is a wide range of software that has web application functionalities, especially oriented to interactive data visualization. What are the advantages of using R/Shiny then, you ask? They are as follows: It is free not only in terms of money, but as all GNU projects, in terms of freedom. As stated in the GNU main page: To understand the concept (GNU), you should think of free as in free speech, not as in free beer. Free software is a matter of the users' freedom to run, copy, distribute, study, change, and improve the software. All the possibilities of a powerful language such as R is available. Thanks to its contributive essence, you can develop a web application that can display any R-generated output. This means that you can, for instance, run complex statistical models and return the output in a friendly way in the browser, obtain and integrate data from the various sources and formats (for instance, SQL, XML, JSON, and so on) the way you need, and subset, process, and dynamically aggregate the data the way you want. These options are not available (or are much more difficult to accomplish) under most of the commercial BI tools. Installing and loading Shiny As with any other package available in the CRAN repositories, the easiest way to install Shiny is by executing install.packages("shiny"). The following output should appear on the console: Due to R's extensibility, many of its packages use elements (mostly functions) from other packages. For this reason, these packages are loaded or installed when the package that is dependent on them is loaded or installed. This is called dependency. Shiny (on its 0.10.2.1 version) depends on Rcpp, httpuv, mime, htmltools, and R6. An R session is started only with the minimal packages loaded. So if functions from other packages are used, they need to be loaded before using them. The corresponding command for this is as follows: library(shiny) When installing a package, the package name must be quoted but when loading the package, it must be unquoted. Summary After these instructions, the reader should be able to install all the fundamental elements to create a web application with Shiny. Additionally, he or she must have acquired at least a general idea of what R and the R project is. Resources for Article: Further resources on this subject: R ─ Classification and Regression Trees[article] An overview of common machine learning tasks[article] Taking Control of Reactivity, Inputs, and Outputs [article]

0
0
33103

How-To Tutorials

article-image-4-popular-algorithms-distance-based-outlier-detection

Sugandha Lahoti

01 Dec 2017

7 min read

4 popular algorithms for Distance-based outlier detection

Sugandha Lahoti

01 Dec 2017

7 min read

[box type="note" align="" class="" width=""]The article is an excerpt from our book titled Mastering Java Machine Learning by Dr. Uday Kamath and Krishna Choppella.[/box] This book introduces you to an array of expert machine learning techniques, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modelling and a lot more. The article given below is extracted from Chapter 5 of the book - Real-time Stream Machine Learning, explaining 4 popular algorithms for Distance-based outlier detection. Distance-based outlier detection is the most studied, researched, and implemented method in the area of stream learning. There are many variants of the distance-based methods, based on sliding windows, the number of nearest neighbors, radius and thresholds, and other measures for considering outliers in the data. We will try to give a sampling of the most important algorithms in this article. Inputs and outputs Most algorithms take the following parameters as inputs: Window size w, corresponding to the fixed size on which the algorithm looks for outlier patterns. Sliding size s, corresponds to the number of new instances that will be added to the window, and old ones removed. The count threshold k of instances when using nearest neighbor computation. The distance threshold R used to define the outlier threshold in distances. Outliers as labels or scores (based on neighbors and distance) are outputs. How does it work? We present different variants of distance-based stream outlier algorithms, giving insights into what they do differently or uniquely. The unique elements in each algorithm define what happens when the slide expires, how a new slide is processed, and how outliers are reported. Exact Storm Exact Storm stores the data in the current window w in a well-known index structure, so that the range query search or query to find neighbors within the distance R for a given point is done efficiently. It also stores k preceding and succeeding neighbors of all data points: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries but are preserved in the preceding list of neighbors. New Slide: For each data point in the new slide, range query R is executed, results are used to update the preceding and succeeding list for the instance, and the instance is stored in the index structure. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance with at least k elements from the succeeding list and non-expired preceding list is reported as an outlier. Abstract-C Abstract-C keeps the index structure similar to Exact Storm but instead of preceding and succeeding lists for every object it just maintains a list of counts of neighbors for the windows the instance is participating in: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the first element from the list of counts is removed corresponding to the last window. New Slide: For each data point in the new slide, range query R is executed and results are used to update the list count. For existing instances, the count gets updated with new neighbors and instances are added to the index structure. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances with a neighbors count less than k in the current window are considered outliers. Direct Update of Events (DUE) DUE keeps the index structure for efficient range queries exactly like the other algorithms but has a different assumption, that when an expired slide occurs, not every instance is affected in the same way. It maintains two priority queues: the unsafe inlier queue and the outlier list. The unsafe inlier queue has sorted instances based on the increasing order of smallest expiration time of their preceding neighbors. The outlier list has all the outliers in the current window: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the unsafe inlier queue is updated for expired neighbors. Those unsafe inliers which become outliers are removed from the priority queue and moved to the outlier list. New Slide: For each data point in the new slide, range query R is executed, results are used to update the succeeding neighbors of the point, and only the most recent preceding points are updated for the instance. Based on the updates, the point is added to the unsafe inlier priority queue or removed from the queue and added to the outlier list. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances in the outlier list are reported as outliers. Micro Clustering based Algorithm (MCOD) Micro-clustering based outlier detection overcomes the computational issues of performing range queries for every data point. The micro-cluster data structure is used instead of range queries in these algorithms. A micro-cluster is centered around an instance and has a radius of R. All the points belonging to the micro-clusters become inliers. The points that are outside can be outliers or inliers and stored in a separate list. It also has a data structure similar to DUE to keep a priority queue of unsafe inliers: Expired Slide: Instances in expired slides are removed from both microclusters and the data structure with outliers and inliers. The unsafe inlier queue is updated for expired neighbors as in the DUE algorithm. Microclusters are also updated for non-expired data points. New Slide: For each data point in the new slide, the instance either becomes a center of a micro-cluster, or part of a micro-cluster or added to the event queue and the data structure of the outliers. If the point is within the distance R, it gets assigned to an existing micro-cluster; otherwise, if there are k points within R, it becomes the center of the new micro cluster; if not, it goes into the two structures of the event queue and possible outliers. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance in the outlier structure with less than k neighboring instances is reported as an outlier. Advantages and limitations The advantages and limitations are as follows: Exact Storm is demanding in storage and CPU for storing lists and retrieving neighbors. Also, it introduces delays; even though they are implemented in efficient data structures, range queries can be slow. Abstract-C has a small advantage over Exact Storm, as no time is spent on finding active neighbors for each instance in the window. The storage and time spent is still very much dependent on the window and slide chosen. DUE has some advantage over Exact Storm and Abstract-C as it can efficiently re-evaluate the "inlierness" of points (that is, whether unsafe inliers remain inliers or become outliers) but sorting the structure impacts both CPU and memory. MCOD has distinct advantages in memory and CPU owing to the use of the micro-cluster structure and removing the pairwise distance computation. Storing the neighborhood information in micro-clusters helps memory too. Validation and evaluation of stream-based outliers is still an open research area. By varying parameters such as window-size, neighbors within radius, and so on, we determine the sensitivity to the performance metrics (time to evaluate in terms of CPU times per object, Number of outliers detected in the streams,TP/Precision/Recall/ Area under PRC curve) and determine the robustness. If you liked the above article, checkout our book Mastering Java Machine Learning to explore more on advanced machine learning techniques using the best Java-based tools available.

0
0
33076

article-image-salesforce-is-buying-tableau-in-a-15-7-billion-all-stock-deal

Richard Gall

10 Jun 2019

4 min read

Salesforce is buying Tableau in a $15.7 billion all-stock deal

Richard Gall

10 Jun 2019

4 min read

Salesforce, one of the world's leading CRM platforms, is buying data visualization software Tableau in an all-stock deal worth $15.7 billion. The news comes just days after it emerged that Google is buying one of Tableau's competitors in the data visualization market, Looker. Taken together, the stories highlight the importance of analytics to some of the planet's biggest companies. They suggest that despite years of the big data revolution, it's only now that market-leading platforms are starting to realise that their customers want the level of capabilities offered by the best in the data visualization space. Salesforce shareholders will use their stock to purchase Tableau. As the press release published on the Salesforce site explains "each share of Tableau Class A and Class B common stock will be exchanged for 1.103 shares of Salesforce common stock, representing an enterprise value of $15.7 billion (net of cash), based on the trailing 3-day volume weighted average price of Salesforce's shares as of June 7, 2019." The acquisition is expected to be completed by the end of October 2019. https://twitter.com/tableau/status/1138040596604575750 Why is Salesforce buying Tableau? The deal is an incredible result for Tableau shareholders. At the end of last week, its market cap was $10.7 billion. This has led to some scepticism about just how good a deal this is for Salesforce. One commenter on Hacker News said "this seems really high for a company without earnings and a weird growth curve. Their ticker is cool and maybe sales force [sic] wants to be DATA on nasdaq. Otherwise, it will be hard to justify this high markup for a tool company." With Salesforce shares dropping 4.5% as markets opened this week, it seems investors are inclined to agree - Salesforce is certainly paying a premium for Tableau. However, whatever the long term impact of the acquisition, the price paid underlines the fact that Salesforce views Tableau as exceptionally important to its long term strategy. It opens up an opportunity for Salesforce to reposition and redefine itself as much more than just a CRM platform. It means it can start compete with the likes of Microsoft, which has a full suite of professional and business intelligence tools. Moreover, it also provides the platform with another way of potentially onboarding customers - given Tableau is well-known as a powerful yet accessible data visualization tool, it create an avenue through which new users can find their way to the Salesforce product. Marc Benioff, Chair and co-CEO of Salesforce, said "we are bringing together the world’s #1 CRM with the #1 analytics platform. Tableau helps people see and understand data, and Salesforce helps people engage and understand customers. It’s truly the best of both worlds for our customers--bringing together two critical platforms that every customer needs to understand their world.” Tableau has been a target for Salesforce for some time. Leaked documents from 2016 found that the data visualization was one of 14 companies that Salesforce had an interest in (another was LinkedIn, which would eventually be purchased by Microsoft). Read next: Alteryx vs. Tableau: Choosing the right data analytics tool for your business What's in it for Tableau (aside from the money...)? For Tableau, there are many other benefits of being purchased by Salesforce alongside the money. Primarily this is about expanding the platform's reach - Salesforce users are people who are interested in data with a huge range of use cases. By joining up with Salesforce, Tableau will become their go-to data visualization tool. "As our two companies began joint discussions," Tableau CEO Adam Selipsky said, "the possibilities of what we might do together became more and more intriguing. They have leading capabilities across many CRM areas including sales, marketing, service, application integration, AI for analytics and more. They have a vast number of field personnel selling to and servicing customers. They have incredible reach into the fabric of so many customers, all of whom need rich analytics capabilities and visual interfaces... On behalf of our customers, we began to dream about we might accomplish if we could combine our ability to help people see and understand data with their ability to help people engage and understand customers." What will happen to Tableau? Tableau won't be going anywhere. It will continue to exist under its own brand with the current leadership all remaining, including Selipsky. What does this all mean for the technology market? At the moment, it's too early to say - but the last year or so has seen some major high-profile acquisitions by tech companies. Perhaps we're seeing the emergence of a tooling arms race as the biggest organizations attempt to arm themselves with ecosystems of established market-leading tools. Whether this is good or bad for users remains to be seen, however.

0
0
33073

article-image-create-local-ubuntu-repository-using-apt-mirror-and-apt-cacher

Packt

05 Oct 2009

7 min read

Create a Local Ubuntu Repository using Apt-Mirror and Apt-Cacher

Packt

05 Oct 2009

7 min read

0
0
33055

article-image-working-with-kibana-in-elasticsearch-5-x

Savia Lobo

26 Jan 2018

9 min read

Working with Kibana in Elasticsearch 5.x

Savia Lobo

26 Jan 2018

9 min read

0
0
33036

article-image-how-sql-server-handles-data-under-the-hood

Sunith Shetty

27 Feb 2018

11 min read

How SQL Server handles data under the hood

Sunith Shetty

27 Feb 2018

11 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Marek Chmel and Vladimír Mužný titled SQL Server 2017 Administrator's Guide. In this book, you will learn the required skills needed to successfully create, design, and deploy database using SQL Server 2017.[/box] Today, we will explore how SQL Server handles data as it is of utmost importance to get an understanding of what, when, and why data should be backed. Data structures and transaction logging We can think about a database as of physical database structure consisting of tables and indexes. However, this is just a human point of view. From the SQL Server's perspective, a database is a set of precisely structured files described in a form of metadata also saved in database structures. A conceptual imagination of how every database works is very helpful when the database has to be backed up correctly. How data is stored Every database on SQL Server must have at least two files: The primary data file with the usual suffix, mdf The transaction log file with the usual suffix, ldf For lots of databases, this minimal set of files is not enough. When the database contains big amounts of data such as historical tables, or the database has big data contention such as production tracking systems, it's good practise to design more data files. Another situation when a basic set of files is not sufficient can arise when documents or pictures would be saved along with relational data. However, SQL Server still is able to store all of our data in the basic file set, but it can lead to a performance bottlenecks and management issues. That's why we need to know all possible storage types useful for different scenarios of deployment. A complete structure of files is depicted in the following image: Database A relational database is defined as a complex data type consisting of tables with a given amount of columns, and each column has its domain that is actually a data type (such as an integer or a date) optionally complemented by some constraints. From SQL Server's perspective, the database is a record written in metadata and containing the name of the database, properties of the database, and names and locations of all files or folders representing storage for the database. This is the same for user databases as well as for system databases. System databases are created automatically during SQL Server installation and are crucial for correct running of SQL Server. We know five system databases. Database master Database master is crucial for the correct running of SQL Server service. In this database is stored data about logins, all databases and their files, instance configurations, linked servers, and so on. SQL Server finds this database at startup via two startup parameters, -d and -l, followed by paths to mdf and ldf files. These parameters are very important in situations when the administrator wants to move the master's files to a different location. Changing their values is possible in the SQL Server Configuration Manager in the SQL Server service Properties dialog on the tab called startup parameters. Database msdb The database msdb serves as the SQL Server Agent service, Database Mail, and Service Broker. In this database are stored job definitions, operators, and other objects needed for administration automation. This database also stores some logs such as backup and restore events of each database. If this database is corrupted or missing, SQL Server Agent cannot start. Database model Database model can be understood as a template for every new database while it is created. During a database creation (see the CREATE DATABASE statement on MSDN), files are created on defined paths and all objects, data and properties of database model are created, copied, and set into the new database during its creation. This database must always exist on the instance, because when it's corrupted, database tempdb can be created at instance start up! Database tempdb Even if database tempdb seems to be a regular database like many others, it plays a very special role in every SQL Server instance. This database is used by SQL Server itself as well as by developers to save temporary data such as table variables or static cursors. As this database is intended for a short lifespan (temporary data only, which can be stored during execution of stored procedure or until session is disconnected), SQL Server clears this database by truncating all data from it or by dropping and recreating this database every time when it's started. As the tempdb database will never contain durable data, it has some special internal behavior and it's the reason why accessing data in this database is several times faster than accessing durable data in other databases. If this database is corrupted, restart SQL Server. Database resourcedb The resourcedb is fifth in our enumeration and consists of definitions for all system objects of SQL Server, for example, sys.objects. This database is hidden and we don't need to care about it that much. It is not configurable and we don't use regular backup strategies for it. It is always placed in the installation path of SQL Server (to the binn directory) and it's backed up within the filesystem backup. In case of an accident, it is recovered as a part of the filesystem as well. Filegroup Filegroup is an organizational metadata object containing one or more data files. Filegroup does not have its own representation in the filesystem--it's just a group of files. When any database is created, a filegroup called primary is always created. This primary filegroup always contains the primary data file. Filegroups can be divided into the following: Row storage filegroups: These filegroup can contain data files (mdf or ndf). Filestream filegroups: This kind of filegroups can contain not files but folders to store binary data. In-memory filegroup: Only one instance of this kind of filegroup can be created in a database. Internally, it is a special case of filestream filegroup and it's used by SQL Server to persist data from in-memory tables. Every filegroup has three simple properties: Name: This is a descriptive name of the filegroup. The name must fulfill the naming convention criteria. Default: In a set of filegroups of the same type, one of these filegroups has this option set to on. This means that when a new table or index is created without explicitly specified to which filegroup it has to store data in, the default filegroup is used. By default, the primary filegroup is the default one. Read-only: Every filegroup, except the primary filegroup, could be set to read- only. Let's say that a filegroup is created for last year's history. When data is moved from the current period to tables created in this historical filegroup, the filegroup could be set as read-only, and later the filegroup cannot be backed up again and again. It is a very good approach to divide the database into smaller parts-- filegroups with more files. It helps in distributing data across more physical storage and also makes the database more manageable; backups can be done part by part in shorter times, which better fit into a service window. Data files Every database must have at least one data file called primary data file. This file is always bound to the primary filegroup. In this file is all the metadata of the database, such as structure descriptions (could be seen through views such as sys.objects, sys.columns, and others), users, and so on. If the database does not have other data files (in the same or other filegroups), all user data is also stored in this file, but this approach is good enough just for smaller databases. Considering how the volume of data in the database grows over time, it is a good practice to add more data files. These files are called secondary data files. Secondary data files are optional and contain user data only. Both types of data files have the same internal structure. Every file is divided into 8 KB small parts called data pages. SQL Server maintains several types of data pages such as data, data pages, index pages, index allocation maps (IAM) pages to locate data pages of tables or indexes, global allocation map (GAM) and shared global allocation maps (SGAM) pages to address objects in the database, and so on. Regardless of the type of a certain data page, SQL Server uses a data page as the smallest unit of I/O operations between hard disk and memory. Let's describe some common properties: A data page never contains data of several objects Data pages don't know each other (and that's why SQL Server uses IAMs to allocate all pages of an object) Data pages don't have any special physical ordering A data row must always fit in size to a data page These properties could seem to be useless but we have to keep in mind that when we know these properties, we can better optimize and manage our databases. Did you know that a data page is the smallest storage unit that can be restored from backup? As a data page is quite a small storage unit, SQL Server groups data pages into bigger logical units called extents. An extent is a logical allocation unit containing eight coherent data pages. When SQL Server requests data from disk, extents are read into memory. This is the reason why 64 KB NTFS clusters are recommended to format disk volumes for data files. Extents could be uniform or mixed. Uniform extent is a kind of extent containing data pages belonging to one object only; on the other hand, a mixed extent contains data pages of several objects. Transaction log When SQL Server processes any transaction, it works in a way called two-phase commit. When a client starts a transaction by sending a single DML request or by calling the BEGIN TRAN command, SQL Server requests data pages from disk to memory called buffer cache and makes the requested changes in these data pages in memory. When the DML request is fulfilled or the COMMIT command comes from the client, the first phase of the commit is finished, but data pages in memory differ from their original versions in a data file on disk. The data page in memory is in a state called dirty. When a transaction runs, a transaction log file is used by SQL Server for a very detailed chronological description of every single action done during the transaction. This description is called write-ahead-logging, shortly WAL, and is one of the oldest processes known on SQL Server. The second phase of the commit usually does not depend on the client's request and is an internal process called checkpoint. Checkpoint is a periodical action that: searches for dirty pages in buffer cache, saves dirty pages to their original data file location, marks these data pages as clean (or drops them out of memory to free memory space), marks the transaction as checkpoint or inactive in the transaction log. Write-ahead-logging is needed for SQL Server during recovery process. Recovery process is started on every database every time SQL Server service starts. When SQL Server service stops, some pages could remain in a dirty state and they are lost from memory. This can lead to two possible situations: The transaction is completely described in the transaction log, the new content of the data page is lost from memory, and data pages are not changed in the data file The transaction was not completed at the moment SQL Server stopped, so the transaction cannot be completely described in the transaction log as well, data pages in memory were not in a stable state (because the transaction was not finished and SQL Server cannot know if COMMIT or ROLLBACK will occur), and the original version of data pages in data files is intact SQL Server decides these two situations when it's starting. If a transaction is complete in the transaction log but was not marked as checkpoint, SQL Server executes this transaction again with both phases of COMMIT. If the transaction was not complete in the transaction log when SQL Server stopped, SQL Server will never know what was the user's intention with the transaction and the incomplete transaction is erased from the transaction log as if it had never started. The aforementioned described recovery process ensures that every database is in the last known consistent state after SQL Server's startup. It's crucial for DBAs to understand write-ahead-logging when planning a backup strategy because when restoring the database, the administrator has to recognize if it's time to run the recovery process or not. To summarize, we introduced internal data handling as it is important not only during performance backups and restores but also for optimizing a database. If you are interested to know more about how to backup, recover and secure SQL Server, do checkout this book SQL Server 2017 Administrator's Guide.

0
0
33020

article-image-writing-perform-test-functions-in-golang-tutorial

Natasha Mathur

10 Jul 2018

9 min read

Writing test functions in Golang [Tutorial]

Natasha Mathur

10 Jul 2018

9 min read

0
0
32940

article-image-big-data-analysis-using-googles-pagerank

Sugandha Lahoti

14 Dec 2017

8 min read

Getting started with big data analysis using Google's PageRank algorithm

Sugandha Lahoti

14 Dec 2017

8 min read

[box type="note" align="" class="" width=""]The article given below is a book excerpt from Java Data Analysis written by John R. Hubbard. Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the aim of discovering useful information. Java is one of the most popular languages to perform your data analysis tasks. This book will help you learn the tools and techniques in Java to conduct data analysis without any hassle. [/box] This post aims to help you learn how to analyse big data using Google’s PageRank algorithm. The term big data generally refers to algorithms for the storage, retrieval, and analysis of massive datasets that are too large to be managed by a single file server. Commercially, these algorithms were pioneered by Google, Google’s PageRank being one of them is considered in this article. Google PageRank algorithm Within a few years of the birth of the web in 1990, there were over a dozen search engines that users could use to search for information. Shortly after it was introduced in 1995, AltaVista became the most popular among them. These search engines would categorize web pages according to the topics that the pages themselves specified. But the problem with these early search engines was that unscrupulous web page writers used deceptive techniques to attract traffic to their pages. For example, a local rug-cleaning service might list "pizza" as a topic in their web page header, just to attract people looking to order a pizza for dinner. These and other tricks rendered early search engines nearly useless. To overcome the problem, various page ranking systems were attempted. The objective was to rank a page based upon its popularity among users who really did want to view its contents. One way to estimate that is to count how many other pages have a link to that page. For example, there might be 100,000 links to https://en.wikipedia.org/wiki/Renaissance, but only 100 to https://en.wikipedia.org/wiki/Ernest_Renan, so the former would be given a much higher rank than the latter. But simply counting the links to a page will not work either. For example, the rug-cleaning service could simply create 100 bogus web pages, each containing a link to the page they want users to view. In 1996, Larry Page and Sergey Brin, while students at Stanford University, invented their PageRank algorithm. It simulates the web itself, represented by a very large directed graph, in which each web page is represented by a node in the graph, and each page link is represented by a directed edge in the graph. The directed graph shown in the figure below could represent a very small network with the same properties: This has four nodes, representing four web pages, A, B, C, and D. The arrows connecting them represent page links. So, for example, page A has a link to each of the other three pages, but page B has a link only to A. To analyze this tiny network, we first identify its transition matrix, M : This square has 16 entries, mij, for 1 ≤ i ≤ 4 and 1 ≤ j ≤ 4. If we assume that a web crawler always picks a link at random to move from one page to another, then mij, equals the probability that it will move to node i from node j, (numbering the nodes A, B, C, and D as 1, 2, 3, and 4). So m12 = 1 means that if it's at node B, there's a 100% chance that it will move next to A. Similarly, m13 = m43 = ½ means that if it's at node C, there's a 50% chance of it moving to A and a 50% chance of it moving to D. Suppose a web crawler picks one of those four pages at random, and then moves to another page, once a minute, picking each link at random. After several hours, what percentage of the time will it have spent at each of the four pages? Here is a similar question. Suppose there are 1,000 web crawlers who obey that transition matrix as we've just described, and that 250 of them start at each of the four pages. After several hours, how many will be on each of the four pages? This process is called a Markov chain. It is a mathematical model that has many applications in physics, chemistry, computer science, queueing theory, economics, and even finance. The diagram in the above figure is called the state diagram for the process, and the nodes of the graph are called the states of the process. Once the state diagram is given, the meaning of the nodes (web pages, in this case) becomes irrelevant. Only the structure of the diagram defines the transition matrix M, and from that we can answer the question. A more general Markov chain would also specify transition probabilities between the nodes, instead of assuming that all transition choices are made at random. In that case, those transition probabilities become the non-zero entries of the M. A Markov chain is called irreducible if it is possible to get to any state from any other state. According to the mathematical theory of Markov chains, if the chain is irreducible, then we can compute the answer to the preceding question using the transition matrix. What we want is the steady state solution; that is, a distribution of crawlers that doesn't change. The crawlers themselves will change, but the number at each node will remain the same. To calculate the steady state solution mathematically, we first have to realize how to apply the transition matrix M. The fact is that if x = (x1 , x2 , x3 , x4 ) is the distribution of crawlers at one minute, and the next minute the distribution is y = (y1 , y2 , y3 , y4 ), then y = Mx , using matrix multiplication. So now, if x is the steady state solution for the Markov chain, then Mx = x. This vector equation gives us four scalar equations in four unknowns: One of these equations is redundant (linearly dependent). But we also know that x1 + x2 + x3 + x4 = 1, since x is a probability vector. So, we're back to four equations in four unknowns. The solution is: The point of that example is to show that we can compute the steady state solution to a static Markov chain by solving an n × n matrix equation, where n is the number of states. By static here, we mean that the transition probabilities mij do not change. Of course, that does not mean that we can mathematically compute the web. In the first place, n > 30,000,000,000,000 nodes! And in the second place, the web is certainly not static. Nevertheless, this analysis does give some insight about the web; and it clearly influenced the thinking of Larry Page and Sergey Brin when they invented the PageRank algorithm. The purpose of the PageRank algorithm is to rank the web pages according to some criteria that would resemble their importance, or at least their frequency of access. The original simple (pre-PageRank) idea was to count the number of links to each page and use something proportional to that count for the rank. Following that line of thought, we can imagine that, if x = (x1 , x2 ,..., xn )T is the page rank for the web (that is, if xj is the relative rank of page j and ∑xj = 1), then Mx = x, at least approximately. Another way to put that is that repeated applications of M to x should nudge x closer and closer to that (unattainable) steady state. That brings us (finally) to the PageRank formula: where ε is a very small positive constant, z is the vector of all 1s, and n is the number of nodes. The vector expression on the right defines the transformation function f which replaces a page rank estimate x with an improved page rank estimate. Repeated applications of this function gradually converge to the unknown steady state. Note that in the formula, f is a function of more than just x. There are really four inputs: x, M, ε , and n. Of course, x is being updated, so it changes with each iteration. But M, ε , and n change too. M is the transition matrix, n is the number of nodes, and ε is a coefficient that determines how much influence the z/n vector has. For example, if we set ε to 0.00005, then the formula becomes: This is how Google's PageRank algorithm can be utilized for the analysis of very large datasets. To learn how to implement this algorithm and various other machine learning algorithms for big data, data visualization, and more using Java, check out this book Java Data Analysis.

0
0
32934

Packt

30 Aug 2013

8 min read

Getting Started with Kinect

Packt

30 Aug 2013

8 min read

(For more resources related to this topic, see here.) Before the birth of Microsoft Kinect, few people were familiar with the technology of motion sensing. Similar devices have been invented and developed originally for monitoring aerial and undersea aggressors in wars. Then in the non-military cases, motion sensors are widely used in alarm systems, lighting systems and so on, which could detect if someone or something disrupts the waves throughout a room and trigger predefined events. Although radar sensors and modern infrared motion sensors are used more popularly in our life, we seldom notice their existence, and can hardly make use of these devices in our own applications. But Kinect changed everything from the time it was launched in North America at the end of 2010. Different from most other user input controllers, Kinect enables users to interact with programs without really touching a mouse or a pad, but only through gestures. In a top-level view, a Kinect sensor is made up of an RGB camera, a depth sensor, an IR emitter, and a microphone array, which consists of several microphones for sound and voice recognition. A standard Kinect (for Windows) equipment is shown as follows: The Kinect device The Kinect drivers and software, which are either from Microsoft or from third-party companies, can even track and analyze advanced gestures and skeletons of multiple players. All these features make it possible to design brilliant and exciting applications with handsfree user inputs. And until now, Kinect had already brought a lot of games and software to an entirely new level. It is believed to be the bridge between the physical world we exist in and the virtual reality we create, and a completely new way of interacting with arts and a pro fitable business opportunity for individuals and companies. In this article, we will try to make an interesting game with the popular Kinect technology for user inputs, As Kinect captures the camera and depth images as video streams, we can also merge this view of our real-world environment with virtual elements, which is called Augmented Reality (AR) . This enables users to feel as if they appear and live in a nonexistent world, or something unbelievable exists in the physical earth. In this article, we will first introduce the installation of Kinect hardware and software on personal computers, and then consider a good enough idea compounded of Kinect and augmented reality elements. Before installing the Kinect device on your PCs, obviously you should buy Kinect equipment first. In this article, we will depend on Kinect for Windows or Kinect for Xbox 360, which can be learned about and bought at: http://www.microsoft.com/en-us/kinectforwindows/ http://www.xbox.com/en-US/kinect Please note that you don't need to buy an Xbox 360 at all. Kinect will be connected to PCs so that we can make custom programs for it. An alternative choice is Kinect for Windows, which is located at: http://www.microsoft.com/en-us/kinectforwindows/purchase/ The uses and developments of both will be of no difference for our cases. Installation of Kinect It is strongly suggested that you have a Windows 7 operating system or higher. It can be either 32-bit or 64-bit and with dual-core or faster processors. Linux developers can also benefit from third-party drivers and SDKs to manipulate Kinect components. Before we start to discuss the software installation, you can download both the Microsoft Kinect SDK and the Developer Toolkit from: http://www.microsoft.com/en-us/kinectforwindows/develop/developerdownloads.aspx In this article, we prefer to develop Kinect-based applications using Kinect SDK Version 1.5 (or higher versions) and the C++ language. Later versions should be backward compatible so that the source code provided in this article doesn't need to be changed. Setting up your Kinect software on PCs After we have downloaded the SDK and the Developer Toolkit, it's time for us to install them on the PC and ensure that they can work with the Kinect hardware. Let's perform the following steps: Run the setup executable with administrator permissions. Select I agree to the license terms and conditions after reading the License Agreement. The Kinect SDK setup dialog Follow the steps until the SDK installation has finished. And then, install the toolkit following similar instructions. The hardware installation is easy: plug the ends of the cable into the USB port and a power point, and plug the USB into your PC. Wait for the drivers to be found automatically. Now, start the Developer Toolkit Browser, choose Samples: C++ from the tabs, and find and run the sample with the name Skeletal Viewer. You should be able to see a new window demonstrating the depth/ skeleton/color images of the current physical scene, which is similar to the following image: The depth (left), skeleton (middle), and color (right) images read from Kinect Why did I do that? We chose to set up the SDK software at first so that it will install the motor and camera drivers, the APIs, and the documentations, as well as the toolkit including resources and samples onto the PC. If the operation steps are inversed, that is, the hardware is connected before installing the SDK, your Windows OS may not be able to recognize the device. Just start the SDK setup at this time and the device should be identified again during the installation process. But before actually using Kinect, you still have to ensure there is nothing between the device and you (the player). And it's best to keep the play space at least 1.8 m wide and about 1.8 m to 3.6 m long from the sensor. If you have more than one Kinect device, don't keep them face-to-face as there may be infrared interference between them. If you have multiple Kinects to install on the same PC, please note that one USB root hub can have one and only one Kinect connected. The problem happens because Kinect takes over 50 percent of the USB bandwidth, and it needs an individual USB controller to run. So plugging more than one device on the same USB hub means only one of them will work. The depth image at the left in the preceding image shows a human (in fact, the author) standing in front of the camera. Some parts may be totally black if they are too near (often less than 80 cm), or too far (often more than 4 m). If you are using Kinect for Windows, you can turn on Near Mode to show objects that are near the camera; however, Kinect for Xbox 360 doesn't have such features. You can read more about the software and hardware setup at: http://www.microsoft.com/en-us/kinectforwindows/purchase/sensor_setup.aspx The idea of the AR-based Fruit Ninja game Now it's time for us to define the goal we are going to achieve in this article. As a quick but practical guide for Kinect and augmented reality, we should be able to make use of the depth detection, video streaming, and motion tracking functionalities in our project. 3D graphics APIs are also important here because virtual elements should also be included and interacted with irregular user inputs not common mouse or keyboard inputs). A fine example is the Fruit Ninja game, which is already a very popular game all over the world. Especially on mobile devices like smartphones and pads, you can see people destroy different kinds of fruits by touching and swiping their fingers on the screen. With the help of Kinect, our arms can act as blades to cut off flying fruits, and our images can also be shown along with the virtual environment so that we can determine the posture of our bodies and position of our arms through the screen display. Unfortunately, this idea is not fresh enough for now. Already, there are commercial products with similar purposes available in the market; for example: http://marketplace.xbox.com/en-US/Product/Fruit-Ninja-Kinect/66acd000-77fe-1000-9115-d80258410b79 But please note that we are not going to design a completely different product here, or even bring it to the market after finishing this article. We will only learn how to develop Kinect-based applications, work in our own way from the very beginning, and benefit from the experience in our professional work or as an amateur. So it is okay to reinvent the wheel this time, and have fun in the process and the results. Summary Kinect, which is a portmanteau of the words "kinetic" and "connect", is a motion sensor developed and released by Microsoft. It provides a natural user interface (NUI) for tracking and manipulating handsfree user inputs such as gestures and skeleton motions. It can be considered as one of the most successful consumer electronics device in recent years, and we will be using this novel device to build the Fruit Ninja game in this article. We will focus on developing Kinect and AR-based applications on Windows 7 or higher using the Microsoft Kinect SDK 1.5 (or higher) and the C++ programming language. Mainly, we have introduced how to install Kinect for Windows SDK in this article. Resources for Article : Further resources on this subject: So, what is KineticJS? [Article] Mission Running in EVE Online [Article] Making Money with Your Game [Article]

0
0
32932

How-To Tutorials

Hyper-V Architecture and Components

Task parallel library for easy multi-threading in .NET Core [Tutorial]

Why choose OpenCV over MATLAB for your next Computer Vision project

Anatomy of a WordPress Plugin

Jim Balsillie on Data Governance Challenges and 6 Recommendations to tackle them

Internationalization

Introducing R, RStudio, and Shiny

4 popular algorithms for Distance-based outlier detection

Salesforce is buying Tableau in a $15.7 billion all-stock deal

Create a Local Ubuntu Repository using Apt-Mirror and Apt-Cacher

Trending Topics

Working with Kibana in Elasticsearch 5.x

How SQL Server handles data under the hood

Writing test functions in Golang [Tutorial]

Getting started with big data analysis using Google's PageRank algorithm

Getting Started with Kinect

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access