Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1228 Articles
article-image-ireport-netbeans
Packt
19 Mar 2010
3 min read
Save for later

iReport in NetBeans

Packt
19 Mar 2010
3 min read
Creating different types of reports inside the NetBeans IDE The first step is to download the NetBeans IDE and the iReport plugin for this. The iReport plugin for NetBeans is available for free download at the following locations: https://sourceforge.net/projects/ireport/files or http://plugins.netbeans.org/PluginPortal/faces/PluginDetailPage.jsp?pluginid=4425 After downloading the plugin, follow the listed steps to install the plugin in NetBeans: Start the NetBeans IDE. Go to Tools | Plugins. Select the Downloaded tab. Press Add Plugins…. Select the plugin files. For iReport 3.7.0 the plugins are: iReport-3.7.0.nbm, jasperreports-components-plugin-3.7.0.nbm, jasperreportsextensions-plugin-3.7.0.nbm, and jasperserver-plugin-3.7.0.nbm. After opening the plugin files you will see the following screen: Check the Install checkbox of ireport-designer, and press the Install button at the bottom of the window. The following screen will appear: Press Next >, and accept the terms of the License Agreement. If the Verify Certificate dialog box appears, click Continue. Press Install, and wait for the installer to complete the installation. After the installation is done, press Finish and close the Plugins dialog. If the IDE requests for a restart, then do it. Now the IDE is ready for creating reports. Creating reports We have already learnt about creating various types of reports, such as reports without parameters, reports with parameters, reports with variables, subreports, crosstab reports, reports with charts and images, and so on. We have also attained knowledge associated with these types of reports. Now, we will learn quickly how to create these reports using NetBeans with the help of the installed iReport plugins. Creating a NetBeans database JDBC connection The first step is to create a database connection, which will be used by the report data sources. Follow the listed steps: Select the Services tab from the left side of the project window. Select Databases. Right-click on Databases, and press New Connection…. In the New Database Connection dialog, set the following under Basic setting, and check the Remember password checkbox: Option Value Driver Name MySQL (Connector/J Driver) Host localhost Port 3306 Database inventory User Name root Password packt Press OK. Now the connection is created, and you can see this under the Services | Databases section, as shown in the following screenshot: Creating a report data source The NetBeans database JDBC connection created previously will be used by a report data source that will be used by the report. Follow the listed steps to create the data source: From the NetBeans toolbar, press the Report Datasources button. You will see the following dialog box: Press New. Select NetBeans Database JDBC connection, and press Next >. Enter inventory in the Name field, and from the Connection drop-down list, select jdbc:mysql://localhost:3306/inventory [root on Default schema]. Press Test, and if the connection is successful, press Save and close the Connections / Datasources dialog box.
Read more
  • 0
  • 0
  • 27718

article-image-putting-your-database-heart-azure-solutions
Packt
28 Oct 2015
19 min read
Save for later

Putting Your Database at the Heart of Azure Solutions

Packt
28 Oct 2015
19 min read
In this article by Riccardo Becker, author of the book Learning Azure DocumentDB, we will see how to build a real scenario around an Internet of Things scenario. This scenario will build a basic Internet of Things platform that can help to accelerate building your own. In this article, we will cover the following: Have a look at a fictitious scenario Learn how to combine Azure components with DocumentDB Demonstrate how to migrate data to DocumentDB (For more resources related to this topic, see here.) Introducing an Internet of Things scenario Before we start exploring different capabilities to support a real-life scenario, we will briefly explain the scenario we will use throughout this article. IoT, Inc. IoT, Inc. is a fictitious start-up company that is planning to build solutions in the Internet of Things domain. The first solution they will build is a registration hub, where IoT devices can be registered. These devices can be diverse, ranging from home automation devices up to devices that control traffic lights and street lights. The main use case for this solution is offering the capability for devices to register themselves against a hub. The hub will be built with DocumentDB as its core component and some Web API to expose this functionality. Before devices can register themselves, they need to be whitelisted in order to prevent malicious devices to start registering. In the following screenshot, we see the high-level design of the registration requirement: The first version of the solution contains the following components: A Web API containing methods to whitelist, register, unregister, and suspend devices DocumentDB, containing all the device information including information regarding other Microsoft Azure resources Event Hub, a Microsoft Azure asset that enables scalable publish-subscribe mechanism to ingress and egress millions of events per second Power BI, Microsoft’s online offering to expose reporting capabilities and the ability to share reports Obviously, we will focus on the core of the solution which is DocumentDB but it is nice to touch some of the Azure components, as well to see how well they co-operate and how easy it is to set up a demonstration for IoT scenarios. The devices on the left-hand side are chosen randomly and will be mimicked by an emulator written in C#. The Web API will expose the functionality required to let devices register themselves at the solution and start sending data afterwards (which will be ingested to the Event Hub and reported using Power BI). Technical requirements To be able to service potentially millions of devices, it is necessary that registration request from a device is being stored in a separate collection based on the country where the device is located or manufactured. Every device is being modeled in the same way, whereas additional metadata can be provided upon registration or afterwards when updating. To achieve country-based partitioning, we will create a custom PartitionResolver to achieve this goal. To extend the basic security model, we reduce the amount of sensitive information in our configuration files. Enhance searching capabilities because we want to service multiple types of devices each with their own metadata and device-specific information. Querying on all the information is desired to support full-text search and enable users to quickly search and find their devices. Designing the model Every device is being modeled similar to be able to service multiple types of devices. The device model contains at least the deviceid and a location. Furthermore, the device model contains a dictionary where additional device properties can be stored. The next code snippet shows the device model: [JsonProperty("id")]         public string DeviceId { get; set; }         [JsonProperty("location")]         public Point Location { get; set; }         //practically store any metadata information for this device         [JsonProperty("metadata")]         public IDictionary<string, object> MetaData { get; set; } The Location property is of type Microsoft.Azure.Documents.Spatial.Point because we want to run spatial queries later on in this section, for example, getting all the devices within 10 kilometers of a building. Building a custom partition resolver To meet the first technical requirement (partition data based on the country), we need to build a custom partition resolver. To be able to build one, we need to implement the IPartitionResolver interface and add some logic. The resolver will take the Location property of the device model and retrieves the country that corresponds with the latitude and longitude provided upon registration. In the following code snippet, you see the full implementation of the GeographyPartitionResolver class: public class GeographyPartitionResolver : IPartitionResolver     {         private readonly DocumentClient _client;         private readonly BingMapsHelper _helper;         private readonly Database _database;           public GeographyPartitionResolver(DocumentClient client, Database database)         {             _client = client;             _database = database;             _helper = new BingMapsHelper();         }         public object GetPartitionKey(object document)         {             //get the country for this document             //document should be of type DeviceModel             if (document.GetType() == typeof(DeviceModel))             {                 //get the Location and translate to country                 var country = _helper.GetCountryByLatitudeLongitude(                     (document as DeviceModel).Location.Position.Latitude,                     (document as DeviceModel).Location.Position.Longitude);                 return country;             }             return String.Empty;         }           public string ResolveForCreate(object partitionKey)         {             //get the country for this partitionkey             //check if there is a collection for the country found             var countryCollection = _client.CreateDocumentCollectionQuery(database.SelfLink).            ToList().Where(cl => cl.Id.Equals(partitionKey.ToString())).FirstOrDefault();             if (null == countryCollection)             {                 countryCollection = new DocumentCollection { Id = partitionKey.ToString() };                 countryCollection =                     _client.CreateDocumentCollectionAsync(_database.SelfLink, countryCollection).Result;             }             return countryCollection.SelfLink;         }           /// <summary>         /// Returns a list of collectionlinks for the designated partitionkey (one per country)         /// </summary>         /// <param name="partitionKey"></param>         /// <returns></returns>         public IEnumerable<string> ResolveForRead(object partitionKey)         {             var countryCollection = _client.CreateDocumentCollectionQuery(_database.SelfLink).             ToList().Where(cl => cl.Id.Equals(partitionKey.ToString())).FirstOrDefault();               return new List<string>             {                 countryCollection.SelfLink             };         }     } In order to have the DocumentDB client use this custom PartitionResolver, we need to assign it. The code is as follows: GeographyPartitionResolver resolver = new GeographyPartitionResolver(docDbClient, _database);   docDbClient.PartitionResolvers[_database.SelfLink] = resolver; //Adding a typical device and have the resolver sort out what //country is involved and whether or not the collection already //exists (and create a collection for the country if needed), use //the next code snippet. var deviceInAmsterdam = new DeviceModel             {                 DeviceId = Guid.NewGuid().ToString(),                 Location = new Point(4.8951679, 52.3702157)             };   Document modelAmsDocument = docDbClient.CreateDocumentAsync(_database.SelfLink,                 deviceInAmsterdam).Result;             //get all the devices in Amsterdam            var doc = docDbClient.CreateDocumentQuery<DeviceModel>(                 _database.SelfLink, null, resolver.GetPartitionKey(deviceInAmsterdam)); Now that we have created a country-based PartitionResolver, we can start working on the Web API that exposes the registration method. Building the Web API A Web API is an online service that can be used by any clients running any framework that supports the HTTP programming stack. Currently, REST is a way of interacting with APIs so that we will build a REST API. Building a good API should aim for platform independence. A well-designed API should also be able to extend and evolve without affecting existing clients. First, we need to whitelist the devices that should be able to register themselves against our device registry. The whitelist should at least contain a device ID, a unique identifier for a device that is used to match during the whitelisting process. A good candidate for a device ID is the mac address of the device or some random GUID. Registering a device The registration Web API contains a POST method that does the actual registration. First, it creates access to an Event Hub (not explained here) and stores the credentials needed inside the DocumentDB document. The document is then created inside the designated collection (based on the location). To learn more about Event Hubs, please visit https://azure.microsoft.com/en-us/services/event-hubs/.  [Route("api/registration")]         [HttpPost]         public async Task<IHttpActionResult> Post([FromBody]DeviceModel value)         {             //add the device to the designated documentDB collection (based on country)             try             { var serviceUri = ServiceBusEnvironment.CreateServiceUri("sb", serviceBusNamespace,                     String.Format("{0}/publishers/{1}", "telemetry", value.DeviceId))                     .ToString()                     .Trim('/');                 var sasToken = SharedAccessSignatureTokenProvider.GetSharedAccessSignature(EventHubKeyName,                     EventHubKey, serviceUri, TimeSpan.FromDays(365 * 100)); // hundred years will do                 //this token can be used by the device to send telemetry                 //this token and the eventhub name will be saved with the metadata of the document to be saved to DocumentDB                 value.MetaData.Add("Namespace", serviceBusNamespace);                 value.MetaData.Add("EventHubName", "telemetry");                 value.MetaData.Add("EventHubToken", sasToken);                 var document = await docDbClient.CreateDocumentAsync(_database.SelfLink, value);                 return Created(document.ContentLocation, value);            }             catch (Exception ex)             {                 return InternalServerError(ex);             }         } After this registration call, the right credentials on the Event Hub have been created for this specific device. The device is now able to ingress data to the Event Hub and have consumers like Power BI consume the data and present it. Event Hubs is a highly scalable publish-subscribe event ingestor. It can collect millions of events per second so that you can process and analyze the massive amounts of data produced by your connected devices and applications. Once collected into Event Hubs, you can transform and store the data by using any real-time analytics provider or with batching/storage adapters. At the time of writing, Microsoft announced the release of Azure IoT Suite and IoT Hubs. These solutions offer internet of things capabilities as a service and are well-suited to build our scenario as well. Increasing searching We have seen how to query our documents and retrieve the information we need. For this approach, we need to understand the DocumentDB SQL language. Microsoft has an online offering that enables full-text search called Azure Search service. This feature enables us to perform full-text searches and it also includes search behaviours similar to search engines. We could also benefit from so called type-ahead query suggestions based on the input of a user. Imagine a search box on our IoT Inc. portal that offers free text searching while the user types and search for devices that include any of the search terms on the fly. Azure Search runs on Azure; therefore, it is scalable and can easily be upgraded to offer more search and storage capacity. Azure Search stores all your data inside an index, offering full-text search capabilities on your data. Setting up Azure Search Setting up Azure Search is pretty straightforward and can be done by using the REST API it offers or on the Azure portal. We will set up the Azure Search service through the portal and later on, we will utilize the REST API to start configuring our search service. We set up the Azure Search service through the Azure portal (http://portal.azure.com). Find the Search service and fill out some information. In the following screenshot, we can see how we have created the free tier for Azure Search: You can see that we use the Free tier for this scenario and that there are no datasources configured yet. We will do that know by using the REST API. We will use the REST API, since it offers more insight on how the whole concept works. We use Fiddler to create a new datasource inside our search environment. The following screenshot shows how to use Fiddler to create a datasource and add a DocumentDB collection: In the Composer window of Fiddler, you can see we need to POST a payload to the Search service we created earlier. The Api-Key is mandatory and also set the content type to be JSON. Inside the body of the request, the connection information to our DocumentDB environment is need and the collection we want to add (in this case, Netherlands). Now that we have added the collection, it is time to create an Azure Search index. Again, we use Fiddler for this purpose. Since we use the free tier of Azure Search, we can only add five indexes at most. For this scenario, we add an index on ID (device ID), location, and metadata. At the time of writing, Azure Search does not support complex types. Note that the metadata node is represented as a collection of strings. We could check in the portal to see if the creation of the index was successful. Go to the Search blade and select the Search service we have just created. You can check the indexes part to see whether the index was actually created. The next step is creating an indexer. An indexer connects the index with the provided data source. Creating this indexer takes some time. You can check in the portal if the indexing process was successful. We actually find that documents are part of the index now. If your indexer needs to process thousands of documents, it might take some time for the indexing process to finish. You can check the progress of the indexer using the REST API again. https://iotinc.search.windows.net/indexers/deviceindexer/status?api-version=2015-02-28 Using this REST call returns the result of the indexing process and indicates if it is still running and also shows if there are any errors. Errors could be caused by documents that do not have the id property available. The final step involves testing to check whether the indexing works. We will search for a device ID, as shown in the next screenshot: In the Inspector tab, we can check for the results. It actually returns the correct document also containing the location field. The metadata is missing because complex JSON is not supported (yet) at the time of writing. Indexing complex JSON types is not supported yet. It is possible to add SQL queries to the data source. We could explicitly add a SELECT statement to surface the properties of the complex JSON we have like metadata or the Point property. Try adding additional queries to your data source to enable querying complex JSON types. Now that we have created an Azure Search service that indexes our DocumentDB collection(s), we can build a nice query-as-you-type field on our portal. Try this yourself. Enhancing security Microsoft Azure offers a capability to move your secrets away from your application towards Azure Key Vault. Azure Key Vault helps to protect cryptographic keys, secrets, and other information you want to store in a safe place outside your application boundaries (connectionstring are also good candidates). Key Vault can help us to protect the DocumentDB URI and its key. DocumentDB has no (in-place) encryption feature at the time of writing, although a lot of people already asked for it to be on the roadmap. Creating and configuring Key Vault Before we can use Key Vault, we need to create and configure it first. The easiest way to achieve this is by using PowerShell cmdlets. Please visit https://msdn.microsoft.com/en-us/mt173057.aspx to read more about PowerShell. The following PowerShell cmdlets demonstrate how to set up and configure a Key Vault: Command Description Get-AzureSubscription This command will prompt you to log in using your Microsoft Account. It returns a list of all Azure subscriptions that are available to you. Select-AzureSubscription -SubscriptionName "Windows Azure MSDN Premium" This tells PowerShell to use this subscription as being subject to our next steps. Switch-AzureMode AzureResourceManager New-AzureResourceGroup –Name 'IoTIncResourceGroup' –Location 'West Europe' This creates a new Azure Resource Group with a name and a location. New-AzureKeyVault -VaultName 'IoTIncKeyVault' -ResourceGroupName 'IoTIncResourceGroup' -Location 'West Europe' This creates a new Key Vault inside the resource group and provide a name and location. $secretvalue = ConvertTo-SecureString '<DOCUMENTDB KEY>' -AsPlainText –Force This creates a security string for my DocumentDB key. $secret = Set-AzureKeyVaultSecret -VaultName 'IoTIncKeyVault' -Name 'DocumentDBKey' -SecretValue $secretvalue This creates a key named DocumentDBKey into the vault and assigns it the secret value we have just received. Set-AzureKeyVaultAccessPolicy -VaultName 'IoTIncKeyVault' -ServicePrincipalName <SPN> -PermissionsToKeys decrypt,sign This configures the application with the Service Principal Name <SPN> to get the appropriate rights to decrypt and sign Set-AzureKeyVaultAccessPolicy -VaultName 'IoTIncKeyVault' -ServicePrincipalName <SPN> -PermissionsToSecrets Get This configures the application with SPN to also be able to get a key. Key Vault must be used together with Azure Active Directory to work. The SPN we need in the steps for powershell is actually is a client ID of an application I have set up in my Azure Active Directory. Please visit https://azure.microsoft.com/nl-nl/documentation/articles/active-directory-integrating-applications/ to see how you can create an application. Make sure to copy the client ID (which is retrievable afterwards) and the key (which is not retrievable afterwards). We use these two pieces of information to take the next step. Using Key Vault from ASP.NET In order to use the Key Vault we have created in the previous section, we need to install some NuGet packages into our solution and/or projects: Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory -Version 2.16.204221202   Install-Package Microsoft.Azure.KeyVault These two packages enable us to use AD and Key Vault from our ASP.NET application. The next step is to add some configuration information to our web.config file: <add key="ClientId" value="<CLIENTID OF THE APP CREATED IN AD" />     <add key="ClientSecret" value="<THE SECRET FROM AZURE AD PORTAL>" />       <!-- SecretUri is the URI for the secret in Azure Key Vault -->     <add key="SecretUri" value="https://iotinckeyvault.vault.azure.net:443/secrets/DocumentDBKey" /> If you deploy the ASP.NET application to Azure, you could even configure these settings from the Azure portal itself, completely removing this from the web.config file. This technique adds an additional ring of security around your application. The following code snippet shows how to use AD and Key Vault inside the registration functionality of our scenario: //no more keys in code or .config files. Just a appid, secret and the unique URL to our key (SecretUri). When deploying to Azure we could             //even skip this by setting appid and clientsecret in the Azure Portal.             var kv = new KeyVaultClient(new KeyVaultClient.AuthenticationCallback(Utils.GetToken));             var sec = kv.GetSecretAsync(WebConfigurationManager.AppSettings["SecretUri"]).Result.Value; The Utils.GetToken method is shown next. This method retrieves an access token from AD by supplying the ClientId and the secret. Since we configured Key Vault to allow this application to get the keys, the call to GetSecretAsync() will succeed. The code is as follows: public async static Task<string> GetToken(string authority, string resource, string scope)         {             var authContext = new AuthenticationContext(authority);             ClientCredential clientCred = new ClientCredential(WebConfigurationManager.AppSettings["ClientId"],                         WebConfigurationManager.AppSettings["ClientSecret"]);             AuthenticationResult result = await authContext.AcquireTokenAsync(resource, clientCred);               if (result == null)                 throw new InvalidOperationException("Failed to obtain the JWT token");             return result.AccessToken;         } Instead of storing the key to DocumentDB somewhere in code or in the web.config file, it is now moved away to Key Vault. We could do the same with the URI to our DocumentDB and with other sensitive information as well (for example, storage account keys or connection strings). Encrypting sensitive data The documents we created in the previous section contains sensitive data like namespaces, Event Hub names, and tokens. We could also use Key Vault to encrypt those specific values to enhance our security. In case someone gets hold of a document containing the device information, he is still unable to mimic this device since the keys are encrypted. Try to use Key Vault to encrypt the sensitive information that is stored in DocumentDB before it is saved in there. Migrating data This section discusses how to use a tool to migrate data from an existing data source to DocumentDB. For this scenario, we assume that we already have a large datastore containing existing devices and their registration information (Event Hub connection information). In this section, we will see how to migrate an existing data store to our new DocumentDB environment. We use the DocumentDB Data Migration Tool for this. You can download this tool from the Microsoft Download Center (http://www.microsoft.com/en-us/download/details.aspx?id=46436) or from GitHub if you want to check the code. The tool is intuitive and enables us to migrate from several datasources: JSON files MongoDB SQL Server CSV files Azure Table storage Amazon DynamoDB HBase DocumentDB collections To demonstrate the use, we migrate our existing Netherlands collection to our United Kingdom collection. Start the tool and enter the right connection string to our DocumentDB database. We do this for both our source and target information in the tool. The connection strings you need to provide should look like this: AccountEndpoint=https://<YOURDOCDBURL>;AccountKey=<ACCOUNTKEY>;Database=<NAMEOFDATABASE>. You can click on the Verify button to make sure these are correct. In the Source Information field, we provide the Netherlands as being the source to pull data from. In the Target Information field, we specify the United Kingdom as the target. In the following screenshot, you can see how these settings are provided in the migration tool for the source information: The following screenshot shows the settings for the target information: It is also possible to migrate data to a collection that is not created yet. The migration tool can do this if you enter a collection name that is not available inside your database. You also need to select the pricing tier. Optionally, setting the partition key could help to distribute your documents based on this key across all collections you add in this screen. This information is sufficient to run our example. Go to the Summary tab and verify the information you entered. Press Import to start the migration process. We can verify a successful import on the Import results pane. This example is a simple migration scenario but the tool is also capable of using complex queries to only migrate those documents that need to moved or migrated. Try migrating data from an Azure Table storage table to DocumentDB by using this tool. Summary In this article, we saw how to integrate DocumentDB with other Microsoft Azure features. We discussed how to setup the Azure Search service and how create an index to our collection. We also covered how to use the Azure Search feature to enable full-text search on our documents which could enable users to query while typing. Next, we saw how to add additional security to our scenario by using Key Vault. We also discussed how to create and configure Key Vault by using PowerShell cmdlets, and we saw how to enable our ASP.NET scenario application to make use of the Key Vault .NET SDK. Then, we discussed how to retrieve the sensitive information from Key Vault instead of configuration files. Finally, we saw how to migrate an existing data source to our collection by using the DocumentDB Data Migration Tool. Resources for Article: Further resources on this subject: Microsoft Azure – Developing Web API For Mobile Apps [article] Introduction To Microsoft Azure Cloud Services [article] Security In Microsoft Azure [article]
Read more
  • 0
  • 0
  • 27672

article-image-how-we-think-ai-urge-ai-founding-fathers
Neil Aitken
31 May 2018
9 min read
Save for later

We must change how we think about AI, urge AI founding fathers

Neil Aitken
31 May 2018
9 min read
In Manhattan, nearly 15,000 Taxis make around 30 journeys each, per day. That’s nearly half a million paid trips. The yellow cabs are part of the never ending, slow progression of vehicles which churn through the streets of New York. The good news is, after a century of worsening traffic, congestion is about to be ameliorated, at least to a degree. Researchers at MIT announced this week, that they have developed an algorithm to optimise the way taxis find their customers. Their product is allegedly so efficient, it can reduce the required number of cabs (for now, the ones with human drivers) in Manhattan, by a third. That’s a non trivial improvement. The trick, apparently, is to use the cabs as a hustler might cue the ball in Pool – lining the next pick up to start where the last drop off ended. The technology behind the improvement offered by the MIT research team, is the same one that is behind most of the incredible technology news stories of the last 3 years – Artificial Intelligence. AI is now a part of most of the digital interactions we have. It fuels the recommendation engines in YouTube, Spotify and Netflix. It shows you products you might like in Google’s search results and on Amazon’s homepage. Undoubtedly, AI is the hot topic of the time – as you cannot possibly have failed to notice. How AI was created – and nearly died AI was, until recently, a long forgotten scientific curiosity, employed seriously only in Sci-Fi movies. The technology fell in to a ‘Winter’– a time when AI related projects couldn’t get funding and decision makers had given up on the technology - in the late 1980s. It was at that time that much of the fundamental work which underpins today’s AI, concepts like neural networks and backpropagation were codified. Artificial Intelligence is now enjoying a rebirth. Almost every new idea funded by Venture Capitalists has AI baked in. The potential excites business owners, especially those involved in the technology sphere, and scares governments in equal measure. It offers better profits and the potential for mass unemployment as if they are two sides of the same coin. Is is a one in a generation technology improvement, similar to Air Conditioning, mass produced motor car and the smartphone, in that it can be applied to all aspects of the economy at the same time. Just as the iPhone has propelled telecommunications technology forward, and created billions of dollars of sales for phone companies selling mobile data plans, AI is fueling totally new businesses and making existing operations significantly more efficient. Behind the fanfare associated with AI, however, lies a simple truth. Today’s AI algorithms use what’s called ‘narrow’ or ‘domain specific’ intelligence. In simple terms, each current AI implementation is specific to the job it is given. IBM trained their AI system ‘Watson’, to beat human contestants at ‘Jeopardy!’ When Google want to build an ‘AI product’ that can be used to beat a living counterpart at the Chinese board game ‘Go’, they create a new AI system. And so on. A new task requires a new AI system. Judea Pearl, inventor of Bayesian networks and Turing Awardee On AI systems that can move from predicting what will happen to what will cause something Now, one of the people behind those original concepts from the 1980s, which underpin today’s AI solutions is back with an even bigger idea which might push AI forward. Judea Pearl, Chancellor's professor of computer science and statistics at UCLA, and a distinguished visiting professor at the Technion, Israel Institute of Technology was awarded the Turing Award 30 years ago. This award was given to him for the Bayesian mathematical models, which gave modern AI its strength. Pearl’s fundamental contribution to computer science was in providing the logic and decision making framework for computers to operate under uncertainty. Some say it was he who provided the spark which thawed that AI winter. Today, he laments the current state of AI, concerned that the field has evolved very little in the last 3 decades since his important theory was presented. Pearl likens current AI implementations to simple tools which can tell you what’s likely to come next, based on the recognition of a familiar pattern. For example, a medical AI algorithm might be able to look at X-Rays of a human chest and ‘discern’ that the patient has, or does not have, lung cancer based on patterns it has learnt from its training datasets. The AI in this scenario doesn’t ‘know’ what lung cancer is or what a tumor is. Importantly, it is a very long way from understanding that smoking can cause the affliction. What’s needed in AI next, says Pearl, is a critical difference: AIs which are evolved to the point where they can determine not just what will happen next, but what will cause it. It’s a fundamental improvement, of the same magnitude as his earlier contributions. Causality – what Pearl is proposing - is one of the most basic units of scientific thought and progress. The ability to conduct a repeatable experiment, showing that A caused B, in multiple locations and have independent peers review the results is one of the fundamentals of establishing truth. In his most recent publication, ‘The Book Of Why’,  Pearl outlines how we can get AI, from where it is now, to where it can develop an understanding of these causal relationships. He believes the first step is to cement the building blocks of reality – ‘what is a lung’, ‘what is smoke’ and that we’ll be able to do in the next 10 years. Geoff Hinton, Inventor of backprop and capsule nets On AI which more closely mimics the human brain Geoff Hinton’s was the mind behind backpropagation, another of the fundamental technologies which has brought AI to the point it is at today. To progress AI, however, he says we might have to start all over again. Hinton has developed (and produced two papers for the University of Toronto to articulate) a new way of training AI systems, involving something he calls ‘Capsule Networks’ – a concept he’s been working on for 30 years, in an effort to improve the capabilities of the backpropagation algorithms he developed. Capsule networks operate in a manner similar to the human brain. When we see an image, our brains breaks it down to it’s components and processes them in parallel. Some brain neurons recognise edges through contrast differences. Others look for corners by examining the points at which edges intersect. Capsule Networks are similar, several acting on a picture at one time, identifying, for example, an ear or a nose on an animal, irrespective of the angle from which it is being viewed. This is a big deal as until now, CNNs (convolution neural networks), the set of AI algorithms that are most often used in image and video recognition systems, could recognize images as well as humans do. CNNs, however, find it hard to recognize images if their angle is changed. It’s too early to judge whether capsule networks are the key to the next step in the AI revolution, but in many tasks, Capsule Networks are identifying images faster and more accurately than current capabilities allow. Andrew Ng, Chief Scientist at Baidu On AI that can learn without humans Andrew Ng is the co-inventor of Google Brain, the team and project that Alphabet put together in 2011 to explore Artificial Intelligence. He now works for Baidu, China’s most successful search engine – analogous in size and scope to Google in the rest of the world. At the moment, he heads up Baidu’s Silicon Valley AI research facility. Beyond concerns over potential job displacement caused by AI, an issue so significant he says it is perhaps all we should be thinking about when it comes to Artificial Intelligence, he suggests that, in the future, the most progress will be made when AI systems can team themselves without human involvement. At the moment, training an AI, even on something that, to us is simple, such as what a cat looks like, is a complicated process. The procedure involves ‘supervised learning.’ It’s shown a lot of pictures (when they did this at Google, they used 10 million images), some of which are cats - labelled appropriately by humans. Once a sufficient level of ‘education’ has been undertaken, the AI can then accurately label cats, most of the time. Ng thinks supervision is problematic, he describes it as having an Achilles heel in the form of the quantity of data that is required. To go beyond current capabilities, says Ng, will require a completely new type of technology – one which can learn through ‘unsupervised learning’ -  machines learning from data that has not been classified by humans. Progress on unsupervised learning is slow. At both Baidu and Google, engineers are focussing on constrained versions of unsupervised learning such as training AI systems to learn about a human face and then using them to create a face themselves. The activity requires that the AI develops what we would call an ‘internal representation’ of a face – something which is required in any unsupervised learning. Other avenues to train without supervision include, ingeniously, pitting an AI system against a computer game – an environment in which they receive feedback (through points awarded in the game) for ‘constructive’ activities, but within which they are not taught directly by a human. Next generation AI depends on ‘scrubbing away’ existing assumptions Artificial Intelligence, as it stands will deliver economy wide efficiency improvements, the likes of which we have not seen in decades. It seems incredible to think that the field is still in its infancy when it can deliver such substantial benefits – like reduced traffic congestion, lower carbon emissions and saved time in New York Taxis. But it is. Isaac Azimov who developed his own concepts behind how Artificial Intelligence might be trained with simple rules said “Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won't come in.” The author should rest assured. Between them, Pearl, Hinton and Ng are each taking revolutionary approaches to elevate AI beyond even the incredible heights it has reached, and starting without reference to the concepts which have brought us this far. 5 polarizing Quotes from Professor Stephen Hawking on artificial intelligence Toward Safe AI – Maximizing your control over Artificial Intelligence Decoding the Human Brain for Artificial Intelligence to make smarter decisions
Read more
  • 0
  • 0
  • 27545

article-image-salesforce-crm-lightning-experience
Richa Tripathi
02 May 2018
4 min read
Save for later

Build a custom Admin Home page in Salesforce CRM Lightning Experience

Richa Tripathi
02 May 2018
4 min read
Today, we will learn how to build a custom Admin Home page and a Home page template with the Salesforce Lightning App Builder. [box type="note" align="" class="" width=""]This article is an excerpt from a book written by Paul Goodey, titled Salesforce CRM Admin Cookbook - Second Edition. This book will help you implement advanced user interface techniques to improve the look and feel of Salesforce CRM.[/box] The sales performance or opportunity top deals components, which are provided out of the box on the Homepage, give sales people a very good insight into their sale activities. However, sales performance may not necessarily be relevant to or desired by non-sales users with job functions in other areas, such as marketing, service, finance, or even Salesforce system administration. Rather than presenting the default Home page to all users in all business functions, you can create custom Home pages with features relevant to specific types of users, and assign the customized pages to different user profiles with Lightning App Builder in Lightning Experience. There are two methods of creating custom Home pages in Salesforce CRM Lightning Experience. These methods involve either editing an existing Home page or creating a new page using a Home page template. In this recipe, we will create a new custom Home page using Lightning App Builder and a Home page template. How to do it... Carry out the following steps to build a new custom Home page using Lightning App Builder: Click on the Setup gear icon at the top right of the main Home page, as shown in the following screenshot:      Click the Setup option, as shown in the following screenshot: Type app builder in the Quick Find search box, as shown in the following screenshot: Select the Lightning App Builder option. Click the New button, as shown in the following screenshot: In the resulting Create a New Lightning Page dialog, choose the Home page Lightning Experience page type, as shown in the following screenshot: Click on Next. Now enter Admin in the Label box presented in the next dialog and then click on Next. In the final dialog, keep the tab option set as CHOOSE PAGE TEMPLATE, which shows the selection of Standard Home Page  as default, as shown in the following screenshot: 10. Click on Finish. 11. In the resulting Home page Layout screen, drag the desired components from the left-hand components pane, which contains all the standard components available for the Home page, onto the canvas section. Here, we will drag the Recent Items to the top section, the Chatter Feed to the bottom left, the Chatter Publisher to the bottom right, and the App Launcher component to the right-hand section. 12. Enter This is a custom Home page created for use by Salesforce CRM Administrators in the Description box of the page, as shown in the following screenshot: 13. Finally, click on Save, as shown in the following screenshot: 14. After the page has been saved, the Home page must be activated. Upon saving the page, the Activation... option will be visible, as shown in the following screenshot: 15. Click on Activation... . When clicking on Save for the very first time you will be presented with a Page Saved dialog that provides an Activate button to active the page. The dialog also presents a message saying Activate this page to make it visible to your users along with a checkbox with the caption Don't show this message again, which, when checked, prevents the dialog from reappearing. 16. In the resulting Activation dialog, choose the Assign this Home page to specific profiles option, as shown in the following screenshot: 17. Click on Next. 18. In the resulting Select Profiles dialog, choose the System Administrator profile, as shown in the following screenshot: 19. Click on Next. 20. Finally, in the resulting Review Assignments confirmation dialog, click on Activate, as shown in the following screenshot: How it works... When system administrators navigate to the Salesforce CRM Home page, they are presented with a custom Home page, as shown in the following screenshot: If you found our post useful, do check out this book Salesforce CRM Admin Cookbook - Second Edition, to explore advanced features of the Salesforce CRM’s Lightning interface. Getting Started with Salesforce Lightning Experience How to create and prepare your first dataset in Salesforce Einstein  
Read more
  • 0
  • 0
  • 27455

article-image-bad-metadata-can-get-you-in-legal-hot-water
Guest Contributor
21 Sep 2019
6 min read
Save for later

Bad Metadata can get you in legal hot water

Guest Contributor
21 Sep 2019
6 min read
Metadata isn't just something that concerns business intelligence and IT teams; but lawyers are extremely interested in it as well. Metadata, it turns out, can win or lose lawsuits, send politicians to jail, and even decide medical malpractice cases. It's not uncommon for attorneys who conduct discovery of electronic records in organizations to find that the claims of plaintiffs or defendants are contradicted by metadata, like time and date, type of data, etc. If a discovery process is initiated against them, an organization had better be sure that its metadata is in order. All it would take for an organization to lose a case would be for an attorney to discover a discrepancy in different databases – a different time stamp on some communication, a different job title for a principal in the case. Such discrepancies could lead to accusations of data tampering, fraud, or worse – and would most definitely put the organization in a very tough position versus a judge or jury. Metadata errors are difficult to spot The problem with that, of course, is that catching metadata errors is extremely difficult. In large organizations, data is stored in repositories that are spread throughout the organization, maybe even the world – in different departments. Each department is responsible for maintaining its own database, and the metadata in it; and on different cloud storage repositories, which may have their own system of classifying data. An enterprising attorney could have a field day with the different categories and tags data is stored under, making claims that the organization is trying to “hide something.” The organization's only defense: We're poor administrators. That may not be enough to impress the court. Types of Metadata Metadata is “data about data,” and comes in three flavors: System Metadata, which is data that is automatically generated from the computer and includes specific labeled criteria, like the date and time of creation and date a document was modified, etc. Substantive Metadata reflects changes to a document, like tracked changes. Embedded metadata is data entered into a document or file but not normally visible, such as formulas in cells in an Excel spreadsheet. All of these have increasingly become targets for attorneys in recent years. Metadata has been used in thousands of cases – medical, financial, patent and trademark law, product liability, civil rights, and many more. Metadata is both discoverable and admissible as evidence. According to one New York court, “General information about the creation of a document, including who authored a document and when it was created, is pedigree information often important for purposes of determining admissibility at trial.” According to legal experts, “from a legal standpoint metadata is evidence… that describes the characteristics, origins, usage, and validity of other electronic evidence.” The biggest metadata-linked payout until now - $10.8 million – occurred in 2017, when a jury awarded a plaintiff $8 million (eventually this was increased to nearly $11 million) after claiming he was fired from a biotechnology company after telling authorities about potential bribery in China. The key piece of evidence was the metadata timestamp on a performance review that was written after the plaintiff was fired; with that evidence, the court increased the defendant's payout for violating laws against firing whistleblowers. In that case, records claiming that the employee was fired for cause were belied by the metadata in the performance review. That, of course, was a case in which there was clear wrongdoing by an organization. But the same metadata errors could have cropped up in any number of scenarios, even if no laws were broken. The precedent in this case, and others like it, might be enough to convince the court to penalize an organization based on claims of a plaintiff. How can organizations defend themselves from this legal bind of metadata The answer would seem obvious; get control of your metadata and make sure it corresponds to the data it represents. With that kind of control over data, organizations would discover for themselves if something was amiss that could cost them in a settlement later. But execution of that obvious answer is a different story. With reams of data to pore through, it would take an organization's business intelligence team months, or even years, to manually sift through the databases. And because to err is human, there would be no guarantee they hadn't missed something. Clearly Business Intelligence and Data Analysis teams need some help in doing this. One solution would be to hire more staff, expanding teams at least temporarily to make sense of the data and metadata that could prove problematic. There are services that will lend their staff to an organization to do just that, and for companies that prefer the “human touch,” adding that temporary staff may be the best solution. Another idea is to automate the process, with advanced tools that will do a full examination of data, both across systems and within repositories themselves. Such automated tools would examine the data in the various repositories and find where the metadata for the same information is different – pointing BI teams in the right direction and cutting down on the time needed to determine what needs to be fixed. Using automated metadata management tools, companies can ensure that they remain secure. If a company is being sued and discovery has commenced, it will be too late for the organization to fix anything. Honest mistakes or disorganized file keeping can no longer be corrected, and the fate of the organization will be in the hands of a jury or a judge. Automated metadata management tools can help Business Intelligence and Data Analysis teams figure out which metadata entries are not consistent across the repositories, ensuring that things are fixed before discovery takes place. There are a variety of tools on the market, with various strengths and weaknesses. Companies will need to decide whether a data dictionary, a business glossary, or a more all-encompassing product best answers their needs. They’ll also need to make sure the enterprise software they currently use is supported by the metadata management solution they are after. As the market develops, AI will be a huge distinguishing factor between metadata solutions, as machine learning will reduce the cost and manpower investment of solution onboarding significantly. With the success of recent metadata-based lawsuits, you can be sure more attorneys will be using metadata in their discovery processes. Organizations that want to defend themselves need to get their data in order, and ensure that they won't end up losing lots of money because of their own errors. Author Bio Amnon Drori is the Co-Founder and CEO of Octopai and has over 20 years of leadership experience in technology companies. Before co-founding Octopai he led sales efforts at companies like Panaya (Acquired by Infosys), Zend Technologies (Acquired by Rogue Wave Software), ModusNovo and Alvarion. Other interesting news in Tech Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data and Society report. Open AI researchers advance multi-agent competition by training AI agents in a hide and seek environment. France and Germany reaffirm blocking Facebook’s Libra cryptocurrency
Read more
  • 0
  • 0
  • 27245

article-image-data-explorationusing-spark-sql
Packt
08 Mar 2018
9 min read
Save for later

Data Exploration using Spark SQL

Packt
08 Mar 2018
9 min read
In this article, Aurobindo Sarkar, the author of the book, Learning Spark SQL, we will be covering the following points to introduce you to using Spark SQL for exploratory data analysis. What is exploratory Data Analysis (EDA)? Why EDA is important? Using Spark SQL for basic data analysis Visualizing data with Apache Zeppelin Introducing Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a “feel”, for the data set as a result of such exploration. More specifically, the graphical techniques allow us to, efficiently, select and validate appropriate models, test our assumptions, identify relationships, select estimators, and detect outliers. EDA involves a lot of trial and error, and several iterations.The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement.For example, you should plot simple histograms and scatterplots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take long to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java, and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer there are many good tools available such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data then this section will offer some good tools and techniques data exploration. Here, we will do some basic data exploration exercises to understand a sample dataset. We will use a dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We use the bank-additional-full.csv file that contains 41188 records and 20 input fields, ordered by date (from May 2008 to November 2010). As a first step, let’s define a schema and read in the CSV file to create a DataFrame. You can use :paste to paste-in the initial set of statements in the Spark shell, as shown in the following figure: After the DataFrame is created, we first verify the number of records. We can also define a case class called Call for our input records, and then create a strongly-typed Dataset as follows: In the next section, we will begin our data exploration by identifying missing data in our dataset. Identifying Missing Data Missing data can occur in datasets due to reasons ranging from negligence to a refusal on part of respondants to provide a specific data point. However, in all cases missing data is a common occurrence in real-world datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies for dealing with it. Here, weanalyzethe numbers of records with missing data fields in our sample dataset. In order to simulate missing data, we will edit our sample dataset by replacing fields containing “unknown” values with empty strings. First, we created a DataFrame / Dataset from our edited file, as shown in the following figure: The following two statements give us a count of rows with certain fields having missing data. Later, we will look at effective ways of dealing with missing data and compute some basic statistics for sample dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a dataset containing a subset of fields from our original DataFrame. In our following example, we choose some of the numeric fields and the outcome field that is the “term deposit subscribed” field. Next, we use describe to quickly compute the count, mean, standard deviation, min and max values for the numeric columns in our dataset. Further, we use the stat package to compute additional statistics like covariance, correlation, creating crosstabs, examining items that occur most frequently in data columns, and computing quantiles. These computations are shown in the following figure: Next, we use the typed aggregation functions to summarize our data to understand our data better. In the following statement, we aggregate the results by whether a term deposit was subscribed along with total customers contacted, average number of calls made per customer, the average duration of the calls and the average number of previous calls made to such customers. The results are rounded to two decimal points. Similarly, executing the following statement givessimilar results by customers’ age.   After getting a better understanding of our data by computing basic statistics, we shift our focus to identifying outliers in our data. Identifying data outliers An outlier or an anomalyis an observation of the data that deviates significantly from other observations in the dataset. Erroneous outliers are observations thatare distorted due to possible errors in the data-collection process. These outliers may exert undue influence on the results of statistical analysis, sothey should be identified using reliable detection methods prior to performing data analysis. Many algorithms find outliers as a side-product of clustering algorithms. However these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using a statistical distributions, and the outliers identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically doesnot have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. SparkMLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. In the following example, we use the k-means clustering algorithm to compute two clusters in our data. Other distributed algorithms useful for EDA includeclassification, regression, dimensionality reduction, correlation and hypothesis testing. Visualizing data with Apache Zeppelin Typically, we will generate many graphs to verify our hunches about the data.A lot of thesequick and dirty graphs used during EDA are,ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers, typically, cannot handle millions of data points.Hence we have to summarize, sample or model our data before we can effectively visualize it. Traditionally, BI tools provided extensive aggregation and pivoting features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data wassubsequently downloaded and visualizedon the practitioner’s workstations. Spark can eliminate many of these batch jobs to support interactive data visualization. Here, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is a web-based tool that supports interactive data analysis and visualization. It supports several language interpreters and comes with built-in Spark integration. Hence, it is quick and easy to get started with exploratory data analysis using Apache Zeppelin. You can download Appache Zeppelin from https://zeppelin.apache.org/. Unzip the package on your hard drive and start Zeppelin using the following command: Aurobindos-MacBook-Pro-2:zeppelin-0.6.2-bin-allaurobindosarkar$ bin/zeppelin-daemon.sh start You should see the following message: Zeppelin start                                             [ OK] You should be able to see the Zeppelin home page at: http://localhost:8080/ Click on Create new note link, and specify a path and name for your notebook, as shown in the following figure: In the next step, we paste the same code as in the beginning of this article to create a DataFrame for our sample dataset. We can execute typical DataFrameoperations as shown in the following figure: Next, we create a table from our DataFrame and execute some SQL on it. The results of the SQL statements execution can be charted by clicking on the appropriatechart-type required. Here, we create bar charts as an illustrative example of summarizing and visualizing data: We can also plot a scatter plot, and read the coordinate values of each of the points plotted, as shown in the following two figures. Additionally, we can create a textbox that accepts input values to make experience interactive. In the following figure we create a textbox that can accept different values for the age parameter and the bar chart is updated, accordingly. Similarly, we can also create dropdown lists where the user can select the appropriate option, and the table of values or chart, automatically gets updated. Summary In this article, we demonstrated using Spark SQL for exploring datasets, performing basic data quality checks, generating samples and pivot tables, and visualizing data with Apache Zeppelin.
Read more
  • 0
  • 0
  • 27233
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime
article-image-how-to-build-a-cold-start-friendly-content-based-recommender-using-apache-spark-sql
Sunith Shetty
14 Dec 2017
12 min read
Save for later

How to build a cold-start friendly content-based recommender using Apache Spark SQL

Sunith Shetty
14 Dec 2017
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. In this book, you will learn how to perform analytics on big data using Java, machine learning and other big data tools. [/box] From the below given article, you will learn how to create the content-based recommendation system using movielens dataset. Getting started with content-based recommendation systems In content-based recommendations, the recommendation systems check for similarity between the items based on their attributes or content and then propose those items to the end users. For example, if there is a movie and the recommendation system has to show similar movies to the users, then it might check for the attributes of the movie such as the director name, the actors in the movie, the genre of the movie, and so on or if there is a news website and the recommendation system has to show similar news then it might check for the presence of certain words within the news articles to build the similarity criteria. As such the recommendations are based on actual content whether in the form of tags, metadata, or content from the item itself (as in the case of news articles). Dataset For this article, we are using the very popular movielens dataset, which was collected by the GroupLens Research Project at the University of Minnesota. This dataset contains a list of movies that are rated by their customers. It can be downloaded from the site https:/grouplens.org/datasets/movielens. There are few files in this dataset, they are: u.item:This contains the information about the movies in the dataset. The attributes in this file are: Movie Id This is the ID of the movie Movie title This is the title of the movie Video release date This is the release date of the movie Genre (isAction, isComedy, isHorror) This is the genre of the movie. This is represented by multiple flats such as isAction, isComedy, isHorror, and so on and as such it is just a Boolean flag with 1 representing users like this genre and 0 representing user do not like this genre. u.data: This contains the data about the users rating for the movies that they have watched. As such there are three main attributes in this file and they are: Userid This is the ID of the user rating the movie Item id This is the ID of the movie Rating This is the rating given to the movie by the user u.user: This contains the demographic information about the users. The main attributes from this file are: User id This is the ID of the user rating the movie Age This is the age of the user Gender This is the rating given to the movie by the user ZipCode This is the zip code of the place of the user As the data is pretty clean and we are only dealing with a few parameters, we are not doing any extensive data exploration in this article for this dataset. However, we urge the users to try and practice data exploration on this dataset and try creating bar charts using the ratings given, and find patterns such as top-rated movies or top-rated action movies, or top comedy movies, and so on. Content-based recommender on MovieLens dataset We will build a simple content-based recommender using Apache Spark SQL and the MovieLens dataset. For figuring out the similarity between movies, we will use the Euclidean Distance. This Euclidean Distance would be run using the genre properties, that is, action, comedy, horror, and so on of the movie items and also run on the average rating for each movie. First is the usual boiler plate code to create the SparkSession object. For this we first build the Spark configuration object and provide the master which in our case is local since we are running this locally in our computer. Next using this Spark configuration object we build the SparkSession object and also provide the name of the application here and that is ContentBasedRecommender. SparkConf sc = new SparkConf().setMaster("local"); SparkSession spark = SparkSession      .builder()      .config(sc)      .appName("ContentBasedRecommender")      .getOrCreate(); Once the SparkSession is created, load the rating data from the u.data file. We are using the Spark RDD API for this, the user can feel free to change this code and use the dataset API if they prefer to use that. On this Java RDD of the data, we invoke a map function and within the map function code we extract the data from the dataset and populate the data into a Java POJO called RatingVO: JavaRDD<RatingVO> ratingsRDD = spark.read().textFile("data/movie/u.data").javaRDD() .map(row -> { RatingVO rvo = RatingVO.parseRating(row); return rvo; }); As you can see, the data from each row of the u.data dataset is passed to the RatingVO.parseRating method in the lambda function. This parseRating method is declared in the POJO RatingVO. Let's look at the RatingVO POJO first. For maintaining brevity of the code, we are just showing the first few lines of the POJO here: public class RatingVO implements Serializable {   private int userId; private int movieId; private float rating; private long timestamp;   private int like; As you can see in the preceding code we are storing movieId, rating given by the user and the userId attribute in this POJO. This class RatingVO contains one useful method parseRating as shown next. In this method, we parse the actual data row (that is tab separated) and extract the values of rating, movieId, and so on from it. In the method, do note the if...else block, it is here that we check whether the rating given is greater than three. If it is, then we take it as a 1 or like and populate it in the like attribute of RatingVO, else we mark it as 0 or dislike for the user: public static RatingVO parseRating(String str) {      String[] fields = str.split("t");      ...      int userId = Integer.parseInt(fields[0]);      int movieId = Integer.parseInt(fields[1]);      float rating = Float.parseFloat(fields[2]); if(rating > 3) return new RatingVO(userId, movieId, rating, timestamp,1);      return new RatingVO(userId, movieId, rating, timestamp,0); } Once we have the RDD object built with the POJO's RatingVO contained per row, we are now ready to build our beautiful dataset object. We will use the createDataFrame method on the spark session and provide the RDD containing data and the corresponding POJO class, that is, RatingVO. We also register this dataset as a temporary view called ratings so that we can fire SQL queries on it: Dataset<Row> ratings = spark.createDataFrame(ratingsRDD, RatingVO.class); ratings.createOrReplaceTempView("ratings"); ratings.show(); The show method in the preceding code would print the first few rows of this dataset as shown next: Next we will fire a Spark SQL query on this view (ratings) and in this SQL query we will find the average rating in this dataset for each movie. For this, we will group by on the movieId in this query and invoke an average function on the rating  column as shown here: Dataset<Row> moviesLikeCntDS = spark.sql("select movieId,avg(rating) likesCount from ratings group by movieId"); The moviesLikeCntDS dataset now contains the results of our group by query. Next we load the data for the movies from the u.item data file. As we did for users, we store this data for movies in a MovieVO POJO. This MovieVO POJO contains the data for the movies such as the MovieId, MovieTitle and it also stores the information about the movie genre such as action, comedy, animation, and so on. The genre information is stored as 1 or 0. For maintaining the brevity of the code, we are not showing the full code of the lambda function here: JavaRDD<MovieVO> movieRdd = spark.read().textFile("data/movie/u.item").javaRDD() .map(row -> {  String[] strs = row.split("|");  MovieVO mvo = new MovieVO();  mvo.setMovieId(strs[0]);        mvo.setMovieTitle(strs[1]);  ...  mvo.setAction(strs[6]); mvo.setAdventure(strs[7]);     ...  return mvo; }); As you can see in the preceding code we split the data row and from the split result, which is an array of strings, we extract our individual values and store in the MovieVO object. The results of this operation are stored in the movieRdd object, which is a Spark RDD. Next, we convert this RDD into a Spark dataset. To do so, we invoke the Spark createDataFrame function and provide our movieRdd to it and also supply the MovieVO POJO here. After creating the dataset for the movie RDD, we perform an important step here of combining our movie dataset with the moviesLikeCntDS dataset we created earlier (recall that moviesLikeCntDS dataset contains our movie ID and the average rating for that review in this movie dataset). We also register this new dataset as a temporary view so that we can fire Spark SQL queries on it: Dataset<Row> movieDS = spark.createDataFrame(movieRdd.rdd(), MovieVO.class).join(moviesLikeCntDS, "movieId"); movieDS.createOrReplaceTempView("movies"); Before we move further on this program, we will print the results of this new dataset. We will invoke the show method on this new dataset: movieDS.show(); This would print the results as (for brevity we are not showing the full columns in the screenshot): Now comes the turn for the meat of this content recommender program. Here we see the power of Spark SQL, we will now do a self join within a Spark SQL query to the temporary view movie. Here, we will make a combination of every movie to every other movie in the dataset except to itself. So if you have say three movies in the set as (movie1, movie2, movie3), this would result in the combinations (movie1, movie2), (movie1, movie3), (movie2, movie3), (movie2, movie1), (movie3, movie1), (movie3, movie2). You must have noticed by now that this query would produce duplicates as (movie1, movie2) is same as (movie2, movie1), so we will have to write separate code to remove those duplicates. But for now the code for fetching these combinations is shown as follows, for maintaining brevity we have not shown the full code: Dataset<Row> movieDataDS = spark.sql("select m.movieId movieId1,m.movieTitle movieTitle1,m.action action1,m.adventure adventure1, ... " + "m2.movieId movieId2,m2.movieTitle movieTitle2,m2.action action2,m2.adventure adventure2, ..." If you invoke show on this dataset and print the result, it would print a lot of columns and their first few values. We will show some of the values next (note that we show values for movie1 on the left-hand side and the corresponding movie2 on the right-hand side): Did you realize by now that on a big data dataset this operation is massive and requires extensive computing resources. Thankfully on Spark you can distribute this operation to run on multiple computer notes. For this, you can tweak the spark-submit job with the following parameters so as to extract the last bit of juice from all the clustered nodes: spark-submit executor-cores <Number of Cores> --num-executor <Number of Executors> The next step is the most important one in this content management program. We will run over the results of the previous query and from each row we will pull out the data for movie1 and movie2 and cross compare the attributes of both the movies and find the Euclidean Distance between them. This would show how similar both the movies are; the greater the distance, the less similar the movies are. As expected, we convert our dataset to an RDD and invoke a map function and within the lambda function for that map we go over all the dataset rows and from each row we extract the data and find the Euclidean Distance. For maintaining conciseness, we depict only a portion of the following code. Also note that we pull the information for both movies and store the calculated Euclidean Distance in a EuclidVO Java POJO object: JavaRDD<EuclidVO> euclidRdd = movieDataDS.javaRDD().map( row -> {    EuclidVO evo = new EuclidVO(); evo.setMovieId1(row.getString(0)); evo.setMovieTitle1(row.getString(1)); evo.setMovieTitle2(row.getString(22)); evo.setMovieId2(row.getString(21));    int action = Math.abs(Integer.parseInt(row.getString(2)) – Integer.parseInt(row.getString(23)) ); ... double likesCnt = Math.abs(row.getDouble(20) - row.getDouble(41)); double euclid = Math.sqrt(action * action + ... + likesCnt * likesCnt); evo.setEuclidDist(euclid); return evo; }); As you can see in the bold text in the preceding code, we are calculating the Euclid Distance and storing it in a variable. Next, we convert our RDD to a dataset object and again register it in a temporary view movieEuclids and now are ready to fire queries for our predictions: Dataset<Row> results = spark.createDataFrame(euclidRdd.rdd(), EuclidVO.class); results.createOrReplaceTempView("movieEuclids"); Finally, we are ready to make our predictions using this dataset. Let's see our first prediction; let's find the top 10 movies that are closer to Toy Story, and this movie has movieId of 1. We will fire a simple query on the view movieEuclids to find this: spark.sql("select * from movieEuclids where movieId1 = 1 order by euclidDist asc").show(20); As you can see in the preceding query, we order by Euclidean Distance in ascending order as lesser distance means more similarity. This would print the first 20 rows of the result as follows: As you can see in the preceding results, they are not bad for content recommender system with little to no historical data. As you can see the first few movies returned are also famous animation movies like Toy Story and they are Aladdin, Winnie the Pooh, Pinocchio, and so on. So our little content management system we saw earlier is relatively okay. You can do many more things to make it even better by trying out more properties to compare with and using a different similarity coefficient such as Pearson Coefficient, Jaccard Distance, and so on. From the article, we learned how content recommenders can be built on zero to no historical data. These recommenders are based on the attributes present on the item itself using which we figure out the similarity with other items and recommend them. To know more about data analysis, data visualization and machine learning techniques you can refer to the book Big Data Analytics with Java.  
Read more
  • 0
  • 0
  • 27156

article-image-implement-neural-network-single-layer-perceptron
Pravin Dhandre
28 Dec 2017
10 min read
Save for later

How to Implement a Neural Network with Single-Layer Perceptron

Pravin Dhandre
28 Dec 2017
10 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rodolfo Bonnin titled Machine Learning for Developers. This book is a systematic developer’s guide for various machine learning algorithms and techniques to develop more efficient and intelligent applications.[/box] In this article we help you go through a simple implementation of a neural network layer by modeling a binary function using basic python techniques. It is the first step in solving some of the complex machine learning problems using neural networks. Take a look at the following code snippet to implement a single function with a single-layer perceptron: import numpy as np import matplotlib.pyplot as plt plt.style.use('fivethirtyeight') from pprint import pprint %matplotlib inline from sklearn import datasets import matplotlib.pyplot as plt Defining and graphing transfer function types The learning properties of a neural network would not be very good with just the help of a univariate linear classifier. Even some mildly complex problems in machine learning involve multiple non-linear variables, so many variants were developed as replacements for the transfer functions of the perceptron. In order to represent non-linear models, a number of different non-linear functions can be used in the activation function. This implies changes in the way the neurons will react to changes in the input variables. In the following sections, we will define the main different transfer functions and define and represent them via code. In this section, we will start using some object-oriented programming (OOP) techniques from Python to represent entities of the problem domain. This will allow us to represent concepts in a much clearer way in the examples. Let's start by creating a TransferFunction class, which will contain the following two methods: getTransferFunction(x): This method will return an activation function determined by the class type getTransferFunctionDerivative(x): This method will clearly return its derivative For both functions, the input will be a NumPy array and the function will be applied element by element, as follows: >class TransferFunction: def getTransferFunction(x): raise NotImplementedError def getTransferFunctionDerivative(x): raise NotImplementedError Representing and understanding the transfer functions Let's take a look at the following code snippet to see how the transfer function works: def graphTransferFunction(function): x = np.arange(-2.0, 2.0, 0.01) plt.figure(figsize=(18,8)) ax=plt.subplot(121) ax.set_title(function.  name  ) plt.plot(x, function.getTransferFunction(x)) ax=plt.subplot(122) ax.set_title('Derivative of ' + function.  name  ) plt.plot(x, function.getTransferFunctionDerivative(x)) Sigmoid or logistic function A sigmoid or logistic function is the canonical activation function and is well-suited for calculating probabilities in classification properties. Firstly, let's prepare a function that will be used to graph all the transfer functions with their derivatives, from a common range of -2.0 to 2.0, which will allow us to see the main characteristics of them around the y axis. The classical formula for the sigmoid function is as follows: class Sigmoid(TransferFunction): #Squash 0,1 def getTransferFunction(x): return 1/(1+np.exp(-x)) def getTransferFunctionDerivative(x): return x*(1-x) graphTransferFunction(Sigmoid) Take a look at the following graph: Playing with the sigmoid Next, we will do an exercise to get an idea of how the sigmoid changes when multiplied by the weights and shifted by the bias to accommodate the final function towards its minimum. Let's then vary the possible parameters of a single sigmoid first and see it stretch and move: ws=np.arange(-1.0, 1.0, 0.2) bs=np.arange(-2.0, 2.0, 0.2) xs=np.arange(-4.0, 4.0, 0.1) plt.figure(figsize=(20,10)) ax=plt.subplot(121) for i in ws: plt.plot(xs, Sigmoid.getTransferFunction(i *xs),label= str(i)); ax.set_title('Sigmoid variants in w') plt.legend(loc='upper left'); ax=plt.subplot(122) for i in bs: plt.plot(xs, Sigmoid.getTransferFunction(i +xs),label= str(i)); ax.set_title('Sigmoid variants in b') plt.legend(loc='upper left'); Take a look at the following graph: Let's take a look at the following code snippet: class Tanh(TransferFunction): #Squash -1,1 def getTransferFunction(x): return np.tanh(x) def getTransferFunctionDerivative(x): return np.power(np.tanh(x),2) graphTransferFunction(Tanh) Lets take a look at the following graph: Rectified linear unit or ReLU ReLU is called a rectified linear unit, and one of its main advantages is that it is not affected by vanishing gradient problems, which generally consist of the first layers of a network tending to be values of zero, or a tiny epsilon: class Relu(TransferFunction): def getTransferFunction(x): return x * (x>0) def getTransferFunctionDerivative(x): return 1 * (x>0) graphTransferFunction(Relu) Let's take a look at the following graph: Linear transfer function Let's take a look at the following code snippet to understand the linear transfer function: class Linear(TransferFunction): def getTransferFunction(x): return x def getTransferFunctionDerivative(x): return np.ones(len(x)) graphTransferFunction(Linear) Let's take a look at the following graph: Defining loss functions for neural networks As with every model in machine learning, we will explore the possible functions that we will use to determine how well our predictions and classification went. The first type of distinction we will do is between the L1 and L2 error function types. L1, also known as least absolute deviations (LAD) or least absolute errors (LAE), has very interesting properties, and it simply consists of the absolute difference between the final result of the model and the expected one, as follows: L1 versus L2 properties Now it's time to do a head-to-head comparison between the two types of loss function:   Robustness: L1 is a more robust loss function, which can be expressed as the resistance of the function when being affected by outliers, which projects a quadratic function to very high values. Thus, in order to choose an L2 function, we should have very stringent data cleaning for it to be efficient. Stability: The stability property assesses how much the error curve jumps for a large error value. L1 is more unstable, especially for non-normalized datasets (because numbers in the [-1, 1] range diminish when squared). Solution uniqueness: As can be inferred by its quadratic nature, the L2 function ensures that we will have a unique answer for our search for a minimum. L2 always has a unique solution, but L1 can have many solutions, due to the fact that we can find many paths with minimal length for our models in the form of piecewise linear functions, compared to the single line distance in the case of L2. Regarding usage, the summation of the past properties allows us to use the L2 error type in normal cases, especially because of the solution uniqueness, which gives us the required certainty when starting to minimize error values. In the first example, we will start with a simpler L1 error function for educational purposes. Let's explore these two approaches by graphing the error results for a sample L1 and L2 loss error function. In the next simple example, we will show you the very different nature of the two errors. In the first two examples, we have normalized the input between -1 and 1 and then with values outside that range. As you can see, from samples 0 to 3, the quadratic error increases steadily and continuously, but with non-normalized data it can explode, especially with outliers, as shown in the following code snippet: sampley_=np.array([.1,.2,.3,-.4, -1, -3, 6, 3]) sampley=np.array([.2,-.2,.6,.10, 2, -1, 3, -1]) ax.set_title('Sigmoid variants in b') plt.figure(figsize=(10,10)) ax=plt.subplot() plt.plot(sampley_ - sampley, label='L1') plt.plot(np.power((sampley_ - sampley),2), label="L2") ax.set_title('L1 vs L2 initial comparison') plt.legend(loc='best') plt.show() Let's take a look at the following graph: Let's define the loss functions in the form of a LossFunction class and a getLoss method for the L1 and L2 loss function types, receiving two NumPy arrays as parameters, y_, or the estimated function value, and y, the expected value: class LossFunction: def getLoss(y_ , y ): raise NotImplementedError class L1(LossFunction): def getLoss(y_, y): return np.sum (y_ - y) class L2(LossFunction): def getLoss(y_, y): return np.sum (np.power((y_ - y),2)) Now it's time to define the goal function, which we will define as a simple Boolean. In order to allow faster convergence, it will have a direct relationship between the first input variable and the function's outcome: # input dataset X = np.array([ [0,0,1], [0,1,1], [1,0,1], [1,1,1] ]) # output dataset y = np.array([[0,0,1,1]]).T The first model we will use is a very minimal neural network with three cells and a weight for each one, without bias, in order to keep the model's complexity to a minimum: # initialize weights randomly with mean 0 W = 2*np.random.random((3,1)) - 1 print (W) Take a look at the following output generated by running the preceding code: [[ 0.52014909] [-0.25361738] [ 0.165037 ]] Then we will define a set of variables to collect the model's error, the weights, and training results progression: errorlist=np.empty(3); weighthistory=np.array(0) resultshistory=np.array(0) Then it's time to do the iterative error minimization. In this case, it will consist of feeding the whole true table 100 times via the weights and the neuron's transfer function, adjusting the weights in the direction of the error. Note that this model doesn't use a learning rate, so it should converge (or diverge) quickly: for iter in range(100): # forward propagation l0 = X l1 = Sigmoid.getTransferFunction(np.dot(l0,W)) resultshistory = np.append(resultshistory , l1) # Error calculation l1_error = y - l1 errorlist=np.append(errorlist, l1_error) # Back propagation 1: Get the deltas l1_delta = l1_error * Sigmoid.getTransferFunctionDerivative(l1) # update weights W += np.dot(l0.T,l1_delta) weighthistory=np.append(weighthistory,W) Let's simply review the last evaluation step by printing the output values at l1. Now we can see that we are reflecting quite literally the output of the original function: print (l1) Take a look at the following output, which is generated by running the preceding code: [[ 0.11510625] [ 0.08929355] [ 0.92890033] [ 0.90781468]] To better understand the process, let's have a look at how the parameters change over time. First, let's graph the neuron weights. As you can see, they go from a random state to accepting the whole values of the first column (which is always right), going to almost 0 for the second column (which is right 50% of the time), and then going to -2 for the third (mainly because it has to trigger 0 in the first two elements of the table): plt.figure(figsize=(20,20)) print (W) plt.imshow(np.reshape(weighthistory[1:],(-1,3))[:40], cmap=plt.cm.gray_r, interpolation='nearest'); Take a look at the following output, which is generated by running the preceding code: [[ 4.62194116] [-0.28222595] [-2.04618725]] Let's take a look at the following screenshot: Let's also review how our solutions evolved (during the first 40 iterations) until we reached the last iteration; we can clearly see the convergence to the ideal values: plt.figure(figsize=(20,20)) plt.imshow(np.reshape(resultshistory[1:], (-1,4))[:40], cmap=plt.cm.gray_r, interpolation='nearest'); Let's take a look at the following screenshot: We can see how the error evolves and tends to be zero through the different epochs. In this case, we can observe that it swings from negative to positive, which is possible because we first used an L1 error function: plt.figure(figsize=(10,10)) plt.plot(errorlist); Let's take a look at the following screenshot: The above explanation of implementing neural network using single-layer perceptron helps to create and play with the transfer function and also explore how accurate did the classification and prediction of the dataset took place. To know how classification is generally done on complex and large datasets, you may read our article on multi-layer perceptrons. To get hands on with advanced concepts and powerful tools for solving complex computational machine learning techniques, do check out this book Machine Learning for Developers and start building smart applications in your machine learning projects.      
Read more
  • 0
  • 0
  • 27029

article-image-bitcoin
Packt
10 Feb 2017
13 min read
Save for later

Bitcoin

Packt
10 Feb 2017
13 min read
In this article by Imran Bashir, the author of the book Mastering Blockchain, will see about bitcoin and it's importance in electronic cash system. (For more resources related to this topic, see here.) Bitcoin is the first application of blockchain technology. In this article readers will be introduced to the bitcoin technology in detail. Bitcoin has started a revolution with the introduction of very first fully decentralized digital currency that has been proven to be extremely secure and stable. This has also sparked great interest in academic and industrial research and introduced many new research areas. Since its introduction in 2008 bitcoin has gained much popularity and currently is the most successful digital currency in the world with billions of dollars invested in it. It is built on decades of research in the field of cryptography, digital cash and distributed computing. In the following section brief history is presented in order to provide background required to understand the foundations behind the invention of bitcoin. Digital currencies have always been an active area of research for many decades. Early proposals to create digital cash goes as far back as the early 1980s. In 1982 David Chaum proposed a scheme that used blind signatures to build untraceable digital currency. In this scheme a bank would issue digital money by signing a blinded and random serial number presented to it by the user. The user can then use the digital token signed by the bank as currency. The limitation in this scheme is that the bank has to keep track of all used serial numbers. This is a central system by design and requires to be trusted by the users. Later on in 1990 David Chaum proposed a refined version named ecash that not only used blinded signature but also some private identification data to craft a message that was then sent to the bank. This scheme allowed detection of double spending but did not prevent it. If the same token is used at two different location then the identity of the double spender would be revealed. ecash could only represent fixed amount of money. Adam Back's hashcash introduced in 1997 was originally proposed to thwart the email spam. The idea behind hashcash is to solve a computational puzzle that is easy to verify but is comparatively difficult to compute. The idea is that for a single user and single email extra computational effort it not noticeable but someone sending large number of spam emails would be discouraged as the time and resources required to run the spam campaign will increase substantially. B-money was proposed by Wei Dai in 1998 which introduced the idea of using proof of work to create money. Major weakness in the system was that some adversary with higher computational power could generate unsolicited money without giving the chance to the network to adjust to an appropriate difficulty level. The system was lacking details on the consensus mechanism between nodes and some security issues like Sybil attacks were also not addressed. At the same time Nick Szabo introduced the concept of bit gold which was also based on proof of work mechanism but had same problems as b-money had with one exception that network difficulty level was adjustable. Tomas Sander and Ammon TaShama introduced an ecash scheme in 1999 that for the first time used merkle trees to represent coins and zero knowledge proofs to prove possession of coins. In the scheme a central bank was required who kept record of all used serial numbers. This scheme allowed users to be fully anonymous albeit at some computational cost. RPOW (Reusable Proof of Work) was introduced by Hal Finney in 2004 that used hash cash scheme by Adam Back as a proof of computational resources spent to create the money. This was also a central system that kept a central database to keep track of all used PoW tokens. This was an online system that used remote attestation made possible by trusted computing platform (TPM hardware). All the above mentioned schemes are intelligently designed but were weak from one aspect or another. Especially all these schemes rely on a central server which is required to be trusted by the users. Bitcoin In 2008 bitcoin paper Bitcoin: A Peer-to-Peer Electronic Cash System was written by Satoshi Nakamoto. First key idea introduced in the paper is that it is a purely peer to peer electronic cash that does need an intermediary bank to transfer payments between peers. Bitcoin is built on decades of Cryptographic research like merkle trees, hash functions, public key cryptography and digital signatures. Moreover ideas like bit gold, b-money, hashcash and cryptographic time stamping have provided the foundations for bitcoin invention. All these technologies are cleverly combined in bitcoin to create world's first decentralized currency. Key issue that has been addressed in bitcoin is an elegant solution to Byzantine Generals problem along with a practical solution of double spend problem. Value of bitcoin has increased significantly since 2011 as shown in the graph below: Bitcoin price and volume since 2012 (on logarithmic scale) Regulation of bitcoin is a controversial subject and as much as it is a libertarian's dream law enforcement agencies and governments are proposing various regulations to control it such as bitlicense issued by NewYorks state department of financial services. This is a license issued to businesses which perform activities related to virtual currencies. Growth of bitcoin is also due to so called Network Effect. Also called demand-side economies of scale, it is a concept which basically means that more users who use the network the more valuable it becomes. Over time exponential increase has been seen in bitcoin network growth. Even though the price of bitcoin is quite volatile it has increased significantly over a period of last few years. Currently (at the time of writing) bitcoin price is 815 GBP. Bitcoin definition Bitcoin can be defined in various ways, it's a protocol, a digital currency and a platform. It is a combination of peer to peer network, protocols and software that facilitate the creation and usage of digital currency named bitcoin. Note that Bitcoin with capital B is used to refer to Bitcoin protocol whereas bitcoin with lower case b is used to refer to bitcoin, the currency. Nodes in this peer to peer to network talk to each other using the Bitcoin protocol. Decentralization of currency was made possible for the first time with the invention of Bitcoin. Moreover double spending problem was solved in an elegant and ingenious way in bitcoin. Double spending problem arises when for example a user sends coins to two different users at the same time and they will be verified independently as valid transactions. Keys and addresses Elliptic curve cryptography is used to generate public and private key pairs in the Bitcoin network. The Bitcoin address is created by taking the corresponding public key of a private key and hashing it twice, first with SHA256 algorithm and then with RIPEMD160. The resultant 160-bit hash is then prefixed with a version number and finally encoded with Base58Check encoding scheme. The bitcoin addresses are 26 to 35 characters long and begin with digit 1 or 3. A typical bitcoin address looks like a string shown as follows: 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt This is also commonly encoded in a QR code for easy sharing. The QR code of the preceding shown address is as follows: QR code of a bitcoin address 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt There are currently two types of addresses, commonly used P2PKH and another P2SH type starting with 1 and 3 respectively. In early days bitcoin used direct Pay-to-Pubkey which is now superseded by P2PKH. However direct Pay-to-Pubkey is still used in bitcoin for coinbase addresses. Addresses should not be used for more than once otherwise privacy and security issues can arise. Avoiding address reuse circumvents anonymity issues to some extent, bitcoin has some other security issues also, such as transaction malleability which requires different approach to resolve. from bitaddress.org private key and bitcoin address in a paper wallet Public keys in bitcoin In Public key cryptography, public keys are generated from private keys. Bitcoin uses ECC based on SECP256K1 standard. A private key is randomly selected and is 256-bit in length. Public keys can be presented in uncompressed or compressed format. Public keys are basically x and y coordinates on an elliptic curve and in uncompressed format are presented with a prefix of 04 in hexadecimal format. X and Y co-ordinates are both 32-bit in length. In total the compressed public key is 33 bytes long as compared to 65 bytes in uncompressed format. Compressed version of public keys basically include only X part, since Y part can be derived from it. The reason why compressed version of public keys works is that bitcoin client initially used uncompressed keys, but starting from bitcoin core client 0.6 compressed keys are used as standard. Keys are identified by various prefixes described as follows: Uncompressed public keys used 0x04 as prefix. Compressed public key starts with 0x03 if the y 32-bit part of the public key is odd. Compressed public key starts with 0x02 if the y 32-bit part of the public key is even. More mathematical description and reason why it works is described later. If the ECC graph is visualized it reveals that the y co-ordinate can be either below the x-axis or above the x-axis and as the curve is symmetric only the location in the prime field is required to be stored. Private keys in bitcoin Private keys are basically 256-bit numbers chosen in the range specified by SECP256K1 ECDSA recommendation. Any randomly chosen 256-bit number from 0x1 to 0xFFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFE BAAE DCE6 AF48 A03B BFD2 5E8C D036 4140 is a valid private key. Private keys are usually encoded using Wallet Import Format (WIF) in order to make them easier to copy and use. WIF can be converted to private key and vice versa. Steps are described later. Also Mini Private Key Format is used sometimes to encode the key in under 30 characters to allow storage where physical space is limited, for example, etching on physical coins or damage resistant QR codes. Bitcoin core client also allows encryption of the wallet which contains the private keys. Bitcoin currency units Bitcoin currency units are described as follows. Smallest bitcoin denomination is Satoshi. Base58Check encoding This encoding is used to limit the confusion between various characters such as 0OIl as they can look same in different fonts. The encoding basically takes the binary byte arrays and converts them into human readable string. This string is composed by utilizing a set of 58 alphanumeric symbols. More explanation and logic can be found in base58.h source file in bitcoin source code. Explanation from bitcoin source code Bitcoin addresses are encoded using Base58check encoding. Vanity addresses As the bitcoin addresses are based on base 58 encoding, it is possible to generate addresses that contain human readable messages. An example is shown as follows: Public address encoded in QR Vanity addresses are generated by using a purely brute force method. An example is shown as follows: Vanity address generated from https://bitcoinvanitygen.com/ Transactions Transactions are at the core of bitcoin ecosystem. Transactions can be as simple as just sending some bitcoins to a bitcoin address or it can be quite complex depending on the requirements. Each transaction is composed of at least one input and output. Inputs can be thought of as coins being spent that have been created in a previous transaction and outputs as coins being created. If a transaction is minting new coins then there is no input and therefore no signature is needed. If a transaction is to send coins to some other user (a bitcoin address), then it needs to be signed by the sender with their private key and also a reference is required to the previous transaction to show the origin of the coins. Coins are in fact unspent transactions outputs represented in Satoshis. Transactions are not encrypted and are publicly visible in the blockchain. Blocks are made up of transactions and these can be viewed by using any online blockchain explorer. Transaction life cycle A user/sender sends a transaction using wallet software or some other interface. Wallet software signs the transaction using the sender's private key. Transaction is broadcasted to the Bitcoin network using a flooding algorithm. Mining nodes include this transaction in the next block to be mined. Mining starts and once a miner who solves the Proof of Work problem broadcasts the newly mined block to the network. Proof of Work is explained in detail later. The nodes verify the block and propagate the block further and confirmation start to generate. Finally the confirmations start to appear in the receiver's wallet and after approximately 6 confirmations the transaction is considered finalized and confirmed. However 6 is just a recommended number , the transaction can be considered final even after first confirmation. The key idea behind waiting for six confirmations is that the probability of double spending virtually eliminates after 6 confirmations. Transaction structure A transaction at a high level contains metadata, inputs and outputs. Transactions are combined together to create a block. The transaction structure is shown in the following table: Field Size Description Version Number 4 bytes Used to specify rules to be used by the miners and nodes for transaction processing. Input counter 1 to 9 bytes Number of inputs included in the transaction. list of inputs variable Each input is composed of several fields including Previous Transaction hash, Previous Txout-index, Txin-script length, Txin-script and optional sequence number. The first transaction in a block is also called coinbase transaction. Specifies on or more transaction inputs. Out-counter 1 to 9 bytes positive integer representing the number of outputs. list of outputs variable Outputs included in the transaction. lock_time 4 bytes It defines the earliest time when a transaction becomes valid. It is either a Unix timestamp or block number. MetaData: This part of the transaction contains some values like size of transaction, number of inputs and outputs, hash of the transaction and a lock_time field. Every transaction has a prefix specifying the version number. Inputs: Generally each input spends a previous output. Each output is considered a UTXO, Unspent transaction output until an input consumes it. Outputs: Outputs have only two fields and it contains instructions for sending bitcoins. First field contains the amount of Satoshis where as second field is a locking script which contains the conditions that needs to be met in order for the output to be spent. More information about transaction spending by using locking and unlocking scripts and producing outputs is discussed later. Verification: Verification is performed by using Bitcoin's scripting language Summary In this article, we learned the importance of bitcoin in digital currency and how bitcoins are encoded using various private keys and encoding techniques. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Protecting Your Bitcoins [article] FPGA Mining [article]
Read more
  • 0
  • 0
  • 26924

article-image-convolutional-neural-networks-reinforcement-learning
Packt
06 Apr 2017
9 min read
Save for later

Convolutional Neural Networks with Reinforcement Learning

Packt
06 Apr 2017
9 min read
In this article by Antonio Gulli, Sujit Pal, the authors of the book Deep Learning with Keras, we will learn about reinforcement learning, or more specifically deep reinforcement learning, that is, the application of deep neural networks to reinforcement learning. We will also see how convolutional neural networks leverage spatial information and they are therefore very well suited for classifying images. (For more resources related to this topic, see here.) Deep convolutional neural network A Deep Convolutional Neural Network (DCCN) consists of many neural network layers. Two different types of layers, convolutional and pooling, are typically alternated. The depth of each filter increases from left to right in the network. The last stage is typically made of one or more fully connected layers as shown here: There are three key intuitions beyond ConvNets: Local receptive fields Shared weights Pooling Let's review them together. Local receptive fields If we want to preserve the spatial information, then it is convenient to represent each image with a matrix of pixels. Then, a simple way to encode the local structure is to connect a submatrix of adjacent input neurons into one single hidden neuron belonging to the next layer. That single hidden neuron represents one local receptive field. Note that this operation is named convolution and it gives the name to this type of networks. Of course we can encode more information by having overlapping submatrices. For instance let's suppose that the size of each single submatrix is 5 x 5 and that those submatrices are used with MNIST images of 28 x 28 pixels, then we will be able to generate 23 x 23 local receptive field neurons in the next hidden layer. In fact it is possible to slide the submatrices by only 23 positions before touching the borders of the images. In Keras, the size of each single submatrix is called stride-length and this is an hyper-parameter which can be fine-tuned during the construction of our nets. Let's define the feature map from one layer to another layer. Of course we can have multiple feature maps which learn independently from each hidden layer. For instance we can start with 28 x 28 input neurons for processing MINST images, and then recall k feature maps of size 23 x 23 neurons each (again with stride of 5 x 5) in the next hidden layer. Shared weights and bias Let's suppose that we want to move away from the pixel representation in a row by gaining the ability of detecting the same feature independently from the location where it is placed in the input image. A simple intuition is to use the same set of weights and bias for all the neurons in the hidden layers. In this way each layer will learn a set of position-independent latent features derived from the image. Assuming that the input image has shape (256, 256) on 3 channels with tf (Tensorflow) ordering, this is represented as (256, 256, 3). Note that with th (Theano) mode the channels dimension (the depth) is at index 1, in tf mode is it at index 3. In Keras if we want to add a convolutional layer with dimensionality of the output 32 and extension of each filter 3 x 3 we will write: model = Sequential() model.add(Convolution2D(32, 3, 3, input_shape=(256, 256, 3)) This means that we are applying a 3 x 3 convolution on 256 x 256 image with 3 input channels (or input filters) resulting in 32 output channels (or output filters). An example of convolution is provided in the following diagram: Pooling layers Let's suppose that we want to summarize the output of a feature map. Again, we can use the spatial contiguity of the output produced from a single feature map and aggregate the values of a submatrix into one single output value synthetically describing the meaning associated with that physical region. Max pooling One easy and common choice is the so-called max pooling operator which simply outputs the maximum activation as observed in the region. In Keras, if we want to define a max pooling layer of size 2 x 2 we will write: model.add(MaxPooling2D(pool_size = (2, 2))) An example of max pooling operation is given in the following diagram: Average pooling Another choice is the average pooling which simply aggregates a region into the average values of the activations observed in that region. Keras implements a large number of pooling layers and a complete list is available online. In short, all the pooling operations are nothing more than a summary operation on a given region. Reinforcement learning Our objective is to build a neural network to play the game of catch. Each game starts with a ball being dropped from a random position from the top of the screen. The objective is to move a paddle at the bottom of the screen using the left and right arrow keys to catch the ball by the time it reaches the bottom. As games go, this is quite simple. At any point in time, the state of this game is given by the (x, y) coordinates of the ball and paddle. Most arcade games tend to have many more moving parts, so a general solution is to provide the entire current game screen image as the state. The following diagram shows four consecutive screenshots of our catch game: Astute readers might note that our problem could be modeled as a classification problem, where the input to the network are the game screen images and the output is one of three actions - move left, stay, or move right. However, this would require us to provide the network with training examples, possibly from recordings of games played by experts. An alternative and simpler approach might be to build a network and have it play the game repeatedly, giving it feedback based on whether it succeeds in catching the ball or not. This approach is also more intuitive and is closer to the way humans and animals learn. The most common way to represent such a problem is through a Markov Decision Process (MDP). Our game is the environment within which the agent is trying to learn. The state of the environment at time step t is given by st (and contains the location of the ball and paddle). The agent can perform certain actions at (such as moving the paddle left or right). These actions can sometimes result in a reward rt, which can be positive or negative (such as an increase or decrease in the score). Actions change the environment and can lead to a new state st+1, where the agent can perform another action at+1, and so on. The set of states, actions and rewards, together with the rules for transitioning from one state to the other, make up a Markov decision process. A single game is one episode of this process, and is represented by a finite sequence of states, actions and rewards: Since this is a Markov decision process, the probability of state st+1 depends only on current state st and action at. Maximizing future rewards As an agent, our objective is to maximize the total reward from each game. The total reward can be represented as follows: In order to maximize the total reward, the agent should try to maximize the total reward from any time point t in the game. The total reward at time step t is given by Rt and is represented as: However, it is harder to predict the value of the rewards the further we go into the future. In order to take this into consideration, our agent should try to maximize the total discounted future reward at time t instead. This is done by discounting the reward at each future time step by a factor γ over the previous time step. If γ is 0, then our network does not consider future rewards at all, and if γ is 1, then our network is completely deterministic. A good value for γ is around 0.9. Factoring the equation allows us to express the total discounted future reward at a given time step recursively as the sum of the current reward and the total discounted future reward at the next time step: Q-learning Deep reinforcement learning utilizes a model-free reinforcement learning technique called Q-learning. Q-learning can be used to find an optimal action for any given state in a finite Markov decision process. Q-learning tries to maximize the value of the Q-function which represents the maximum discounted future reward when we perform action a in state s: Once we know the Q-function, the optimal action a at a state s is the one with the highest Q-value. We can then define a policy π(s) that gives us the optimal action at any state: We can define the Q-function for a transition point (st, at, rt, st+1) in terms of the Q-function at the next point (st+1, at+1, rt+1, st+2) similar to how we did with the total discounted future reward. This equation is known as the Bellmann equation. The Q-function can be approximated using the Bellman equation. You can think of the Q-function as a lookup table (called a Q-table) where the states (denoted by s) are rows and actions (denoted by a) are columns, and the elements (denoted by Q(s, a)) are the rewards that you get if you are in the state given by the row and take the action given by the column. The best action to take at any state is the one with the highest reward. We start by randomly initializing the Q-table, then carry out random actions and observe the rewards to update the Q-table iteratively according to the following algorithm: initialize Q-table Q observe initial state s repeat select and carry out action a observe reward r and move to new state s' Q(s, a) = Q(s, a) + α(r + γ maxa' Q(s', a') - Q(s, a)) s = s' until game over You will realize that the algorithm is basically doing stochastic gradient descent on the Bellman equation, backpropagating the reward through the state space (or episode) and averaging over many trials (or epochs). Here α is the learning rate that determines how much of the difference between the previous Q-value and the discounted new maximum Q-value should be incorporated. Summary We have seen the application of deep neural networks, reinforcement learning. We have also seen convolutional neural networks and how they are well suited for classifying images. Resources for Article: Further resources on this subject: Deep learning and regression analysis [article] Training neural networks efficiently using Keras [article] Implementing Artificial Neural Networks with TensorFlow [article]
Read more
  • 0
  • 0
  • 26890
article-image-google-ai-releases-cirq-and-open-fermion-cirq-to-boost-quantum-computation
Savia Lobo
19 Jul 2018
3 min read
Save for later

Google AI releases Cirq and Open Fermion-Cirq to boost Quantum computation

Savia Lobo
19 Jul 2018
3 min read
Google AI Quantum team announced two releases at the First International Workshop on Quantum Software and Quantum Machine Learning(QSML) yesterday. Firstly the public alpha release of Cirq, an open source framework for NISQ computers. The second release is OpenFermion-Cirq, an example of a Cirq-based application enabling near-term algorithms. Noisy Intermediate Scale Quantum (NISQ) computers are devices including ~50 - 100 qubits and high fidelity quantum gates enhance the quantum algorithms such that they can understand the power that these machines uphold. However, quantum algorithms for the quantum computers have their limitations such as A poor mapping between the algorithms and the machines Also, some quantum processors have complex geometric constraints These and other nuances inevitably lead to wasted resources and faulty computations. Cirq comes as a great help for researchers here. It is focussed on near-term questions, which help researchers to understand whether NISQ quantum computers are capable of solving computational problems of practical importance. It is licensed under Apache 2 and is free to be either embedded or modified within any commercial or open source package. Cirq highlights With Cirq, researchers can write quantum algorithms for specific quantum processors. It provides a fine-tuned user control over quantum circuits by, specifying gate behavior using native gates, placing these gates appropriately on the device, and scheduling the timing of these gates within the constraints of the quantum hardware. Other features of Cirq include: Allows users to leverage the most out of NISQ architectures with optimized data structures to write and compile the quantum circuits. Supports running of the algorithms locally on a simulator Designed to easily integrate with future quantum hardware or larger simulators via the cloud. OpenFermion-Cirq highlights Google AI Quantum team also released OpenFermion-Cirq, which is an example of a CIrq-based application that enables the near-term algorithms.  OpenFermion is a platform for developing quantum algorithms for chemistry problems. OpenFermion-Cirq extends the functionality of OpenFermion by providing routines and tools for using Cirq for compiling and composing circuits for quantum simulation algorithms. An instance of the OpenFermion-Cirq is, it can be used to easily build quantum variational algorithms for simulating properties of molecules and complex materials. While building Cirq, the Google AI Quantum team worked with early testers to gain feedback and insight into algorithm design for NISQ computers. Following are some instances of Cirq work resulting from the early adopters: Zapata Computing: simulation of a quantum autoencoder (example code, video tutorial) QC Ware: QAOA implementation and integration into QC Ware’s AQUA platform (example code, video tutorial) Quantum Benchmark: integration of True-Q software tools for assessing and extending hardware capabilities (video tutorial) Heisenberg Quantum Simulations: simulating the Anderson Model Cambridge Quantum Computing: integration of proprietary quantum compiler t|ket> (video tutorial) NASA: architecture-aware compiler based on temporal-planning for QAOA (slides) and simulator of quantum computers (slides) The team also announced that it is using Cirq to create circuits that run on Google’s Bristlecone processor. Their future plans include making the Bristlecone processor available in cloud with Cirq as the interface for users to write programs for this processor. To know more about both the releases, check out the GitHub repositories of each Cirq and OpenFermion-Cirq. Q# 101: Getting to know the basics of Microsoft’s new quantum computing language Google Bristlecone: A New Quantum processor by Google’s Quantum AI lab Quantum A.I. : An intelligent mix of Quantum+A.I.
Read more
  • 0
  • 0
  • 26786

article-image-how-to-create-multithreaded-applications-in-qt
Gebin George
02 May 2018
10 min read
Save for later

How to create multithreaded applications in Qt

Gebin George
02 May 2018
10 min read
In today’s tutorial, we will learn how to use QThread and its affiliate classes to create multithreaded applications. We will go through this by creating an example project, which processes and displays the input and output frames from a video source using a separate thread. This helps leave the GUI thread (main thread) free and responsive while more intensive processes are handled with the second thread. As it was mentioned earlier, we will focus mostly on the use cases common to computer vision and GUI development; however, the same (or a very similar) approach can be applied to any multithreading problem. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Computer Vision with OpenCV 3 and Qt5 written by Amin Ahmadi Tazehkandi. This book will help you blend the power of Qt with OpenCV to build cross-platform computer vision applications.[/box] We will use this example project to implement multithreading using two different approaches available in Qt for working with QThread classes. First, subclassing and overriding the run method, and second, using the moveToThread function available in all Qt objects, or, in other words, QObject subclasses. Subclassing QThread Let's start by creating an example Qt Widgets application in the Qt Creator named MultithreadedCV. To start with, add an OpenCV framework to this project: win32: { include("c:/dev/opencv/opencv.pri") } unix: !macx{ CONFIG += link_pkgconfig PKGCONFIG += opencv } unix: macx{ INCLUDEPATH += /usr/local/include LIBS += -L"/usr/local/lib" -lopencv_world } Then, add two label widgets to your mainwindow.ui file, shown as follows. We will use these labels to display the original and processed video from the default webcam on the computer: Make sure to set the objectName property of the label on the left to inVideo and the one on the right to outVideo. Also, set their alignment/Horizontal property to AlignHCenter. Now, create a new class called VideoProcessorThread by right-clicking on the project PRO file and selecting Add New from the menu. Then, choose C++ Class and make sure the combo boxes and checkboxes in the new class wizard look like the following screenshot: After your class is created, you'll have two new files in your project called videoprocessorthread.h and videoprocessor.cpp, in which you'll implement a video processor that works in a thread separate from the mainwindow files and GUI threads. First, make sure that this class inherits QThread by adding the relevant include line and class inheritance, as seen here (just replace QObject with QThread in the header file). Also, make sure you include OpenCV headers: #include <QThread> #include "opencv2/opencv.hpp" class VideoProcessorThread : public QThread You need to similarly update the videoprocessor.cpp file so that it calls the correct constructor: VideoProcessorThread::VideoProcessorThread(QObject *parent) : QThread(parent) Now, we need to add some required declarations to the videoprocessor.h file. Add the following line to the private members area of your class: void run() override; And then, add the following to the signals section: void inDisplay(QPixmap pixmap); void outDisplay(QPixmap pixmap); And finally, add the following code block to the videoprocessorthread.cpp file: void VideoProcessorThread::run() { using namespace cv; VideoCapture camera(0); Mat inFrame, outFrame; while(camera.isOpened() && !isInterruptionRequested()) { camera >> inFrame; if(inFrame.empty()) continue; bitwise_not(inFrame, outFrame); emit inDisplay( QPixmap::fromImage( QImage( inFrame.data, inFrame.cols, inFrame.rows, inFrame.step, QImage::Format_RGB888) .rgbSwapped())); emit outDisplay( QPixmap::fromImage( QImage( outFrame.data, outFrame.cols, outFrame.rows, Multithreading Chapter 8 [ 309 ] outFrame.step, QImage::Format_RGB888) .rgbSwapped())); } } The run function is overridden and it's implemented to do the required video processing task. If you try to do the same inside the mainwindow.cpp code in a loop, you'll notice that your program becomes unresponsive, and, eventually, you have to terminate it. However, with this approach, the same code is now in a separate thread. You just need to make sure you start this thread by calling the start function, not run! Note that the run function is meant to be called internally, so you only need to reimplement it as seen in this example; however, to control the thread and its execution behavior, you need to use the following functions: start: This can be used to start a thread if it is not already started. This function starts the execution by calling the run function we implemented. You can pass one of the following values to the start function to control the priority of the thread: QThread::IdlePriority (this is scheduled when no other thread is running) QThread::LowestPriority QThread::LowPriority QThread::NormalPriority QThread::HighPriority QThread::HighestPriority QThread::TimeCriticalPriority (this is scheduled as much as possible) QThread::InheritPriority (this is the default value, which simply inherits priority from the parent) terminate: This function, which should only be used in extreme cases (means never, hopefully), forces a thread to terminate. setTerminationEnabled: This can be used to allow or disallow the terminate Function. wait: This function can be used to block a thread (force waiting) until the thread is finished or the timeout value (in milliseconds) is reached. requestInterruption and isRequestInterrupted: These functions can be used to set and get the interruption request status. Using these functions is a useful approach to make sure the thread is stopped safely in the middle of a process that can go on forever. isRunning and isFinished: These functions can be used to request the execution status of the thread. Apart from the functions we mentioned here, QThread contains other functions useful for dealing with multithreading, such as quit, exit, idealThreadCount, and so on. It is a good idea to check them out for yourself and think about use cases for each one of them. QThread is a powerful class that can help maximize the efficiency of your applications. Let's continue with our example. In the run function, we used an OpenCV VideoCapture class to read the video frames (forever) and apply a simple bitwise_not operator to the Mat frame (we can do any other image processing at this point, so bitwise_not is only an example and a fairly simple one to explain our point), then converted that to QPixmap via QImage, and then sent the original and modified frames using two signals. Notice that in our loop that will go on forever, we will always check if the camera is still open and also check if there is an interruption request to this thread. Now, let's use our thread in MainWindow. Start by including its header file in the mainwindow.h file: #include "videoprocessorthread.h" Then, add the following line to the private members' section of MainWindow, in the mainwindow.h file: VideoProcessorThread processor; Now, add the following code to the MainWindow constructor, right after the setupUi line: connect(&processor, SIGNAL(inDisplay(QPixmap)), ui->inVideo, SLOT(setPixmap(QPixmap))); connect(&processor, SIGNAL(outDisplay(QPixmap)), ui->outVideo, SLOT(setPixmap(QPixmap))); processor.start(); And then add the following lines to the MainWindow destructor, right before the delete ui; line: processor.requestInterruption(); processor.wait(); We simply connected the two signals from the VideoProcessorThread class to two labels we had added to the MainWindow GUI, and then started the thread as soon as the program started. We also request the thread to stop as soon as MainWindow is closed and right before the GUI is deleted. The wait function call makes sure to wait for the thread to clean up and safely finish executing before continuing with the delete instruction. Try running this code to check it for yourself. You should see something similar to the following image as soon as the program starts: The video from your default webcam on the computer should start as soon as the program starts, and it will be stopped as soon as you close the program. Try extending the VideoProcessorThread class by passing a camera index number or a video file path into it. You can instantiate as many VideoProcessorThread classes as you want. You just need to make sure to connect the signals to correct widgets on the GUI, and this way you can have multiple videos or cameras processed and displayed dynamically at runtime. Using the moveToThread function As we mentioned earlier, you can also use the moveToThread function of any QObject subclass to make sure it is running in a separate thread. To see exactly how this works, let's repeat the same example by creating exactly the same GUI, and then creating a new C++ class (same as before), but, this time, naming it VideoProcessor. This time though, after the class is created, you don't need to inherit it from QThread, leave it as QObject (as it is). Just add the following members to the videoprocessor.h file: signals: void inDisplay(QPixmap pixmap); void outDisplay(QPixmap pixmap); public slots: void startVideo(); void stopVideo(); private: bool stopped; The signals are exactly the same as before. The stopped is a flag we'll use to help us stop the video so that it does not go on forever. The startVideo and stopVideo are functions we'll use to start and stop the processing of the video from the default webcam. Now, we can switch to the videoprocessor.cpp file and add the following code blocks. Quite similar to what we had before, with the obvious difference that we don't need to implement the run function since it's not a QThread subclass, and we have named our functions as we like: void VideoProcessor::startVideo() { using namespace cv; VideoCapture camera(0); Mat inFrame, outFrame; stopped = false; while(camera.isOpened() && !stopped) { camera >> inFrame; if(inFrame.empty()) continue; bitwise_not(inFrame, outFrame); emit inDisplay( QPixmap::fromImage( QImage( inFrame.data, inFrame.cols, inFrame.rows, inFrame.step, QImage::Format_RGB888) .rgbSwapped())); emit outDisplay( QPixmap::fromImage( QImage( outFrame.data, outFrame.cols, outFrame.rows, outFrame.step, QImage::Format_RGB888) .rgbSwapped())); } } void VideoProcessor::stopVideo() { stopped = true; } Now we can use it in our MainWindow class. Make sure to add the include file for the VideoProcessor class, and then add the following to the private members' section of MainWindow: VideoProcessor *processor; Now, add the following piece of code to the MainWindow constructor in the mainwindow.cpp file: processor = new VideoProcessor(); processor->moveToThread(new QThread(this)); connect(processor->thread(), SIGNAL(started()), processor, SLOT(startVideo())); connect(processor->thread(), SIGNAL(finished()), processor, SLOT(deleteLater())); connect(processor, SIGNAL(inDisplay(QPixmap)), ui->inVideo, SLOT(setPixmap(QPixmap))); connect(processor, SIGNAL(outDisplay(QPixmap)), ui->outVideo, SLOT(setPixmap(QPixmap))); processor->thread()->start(); In the preceding code snippet, first, we created an instance of VideoProcessor. Note that we didn't assign any parents in the constructor, and we also made sure to define it as a pointer. This is extremely important when we intend to use the moveToThread function. An object that has a parent cannot be moved into a new thread. The second highly important lesson in this code snippet is the fact that we should not directly call the startVideo function of VideoProcessor, and it should only be invoked by connecting a proper signal to it. In this case, we have used its own thread's started signal; however, you can use any other signal that has the same signature. The rest is all about connections. In the MainWindow destructor function, add the following lines: processor->stopVideo(); processor->thread()->quit(); processor->thread()->wait(); If you found our post useful, do check out this book Computer Vision with OpenCV 3 and Qt5  which will help you understand how to build a multithreaded computer vision applications with the help of OpenCV 3 and Qt5. Top 10 Tools for Computer Vision OpenCV Primer: What can you do with Computer Vision and how to get started? Image filtering techniques in OpenCV  
Read more
  • 0
  • 0
  • 26553

article-image-what-can-you-expect-at-neurips-2019
Sugandha Lahoti
06 Sep 2019
5 min read
Save for later

What can you expect at NeurIPS 2019?

Sugandha Lahoti
06 Sep 2019
5 min read
Popular machine learning conference NeurIPS 2019 (Conference on Neural Information Processing Systems) will be held on Sunday, December 8 through Saturday, December 14 at the Vancouver Convention Center. The conference invites papers tutorials, and submissions on cross-disciplinary research where machine learning methods are being used in other fields, as well as methods and ideas from other fields being applied to ML.  NeurIPS 2019 accepted papers Yesterday, the conference published the list of their accepted papers. A total of 1429 papers have been selected. Submissions opened from May 1 on a variety of topics such as Algorithms, Applications, Data implementations, Neuroscience, and Cognitive Science, Optimization, Probabilistic Methods, Reinforcement Learning and Planning, and Theory. (The full list of Subject Areas are available here.) This year at NeurIPS 2019, authors of accepted submissions were mandatorily required to prepare either a 3-minute video or a PDF of slides summarizing the paper or prepare a PDF of the poster used at the conference. This was done to make NeurIPS content accessible to those unable to attend the conference. NeurIPS 2019 also introduced a mandatory abstract submission deadline, a week before final submissions are due. Only a submission with a full abstract was allowed to have the full paper uploaded. The authors were also asked to answer questions from the Reproducibility Checklist during the submission process. NuerIPS 2019 tutorial program NeurIPS also invites experts to present tutorials that feature topics that are of interest to a sizable portion of the NeurIPS community and are different from the ones already presented at other ML conferences like ICML or ICLR. They looked for tutorial speakers that cover topics beyond their own research in a comprehensive manner that encompasses multiple perspectives.  The tutorial chairs for NeurIPS 2019 are Danielle Belgrave and Alice Oh. They initially compiled a list based on the last few years’ publications, workshops, and tutorials presented at NeurIPS and at related venues. They asked colleagues for recommendations and conducted independent research. In reviewing the potential candidates, the chair read papers to understand their expertise and watch their videos to appreciate their style of delivery. The list of candidates was emailed to the General Chair, Diversity & Inclusion Chairs, and the rest of the Organizing Committee for their comments on this shortlist. Following a few adjustments based on their input, the potential speakers were selected. A total of 9 tutorials have been selected for NeurIPS 2019: Deep Learning with Bayesian Principles - Emtiyaz Khan Efficient Processing of Deep Neural Network: from Algorithms to Hardware Architectures - Vivienne Sze Human Behavior Modeling with Machine Learning: Opportunities and Challenges - Nuria Oliver, Albert Ali Salah Interpretable Comparison of Distributions and Models - Wittawat Jitkrittum, Dougal Sutherland, Arthur Gretton Language Generation: Neural Modeling and Imitation Learning -  Kyunghyun Cho, Hal Daume III Machine Learning for Computational Biology and Health - Anna Goldenberg, Barbara Engelhardt Reinforcement Learning: Past, Present, and Future Perspectives - Katja Hofmann Representation Learning and Fairness - Moustapha Cisse, Sanmi Koyejo Synthetic Control - Alberto Abadie, Vishal Misra, Devavrat Shah NeurIPS 2019 Workshops NeurIPS Workshops are primarily used for discussion of work in progress and future directions. This time the number of Workshop Chairs doubled, from two to four; selected chairs are Jenn Wortman Vaughan, Marzyeh Ghassemi, Shakir Mohamed, and Bob Williamson. However, the number of workshop submissions went down from 140 in 2018 to 111 in 2019. Of these 111 submissions, 51 workshops were selected. The full list of selected Workshops is available here.  The NeurIPS 2019 chair committee introduced new guidelines, expectations, and selection criteria for the Workshops. This time workshops had an important focus on the nature of the problem, intellectual excitement of the topic, diversity, and inclusion, quality of proposed invited speakers, organizational experience and ability of the team and more.  The Workshop Program Committee consisted of 37 reviewers with each workshop proposal assigned to two reviewers. The reviewer committee included more senior researchers who have been involved with the NeurIPS community. Reviewers were asked to provide a summary and overall rating for each workshop, a detailed list of pros and cons, and specific ratings for each of the new criteria. After all reviews were submitted, each proposal was assigned to two of the four chair committee members. The chair members looked through assigned proposals and their reviews to form an educated assessment of the pros and cons of each. Finally, the entire chair held a meeting to discuss every submitted proposal to make decisions.  You can check more details about the conference on the NeurIPS website. As always keep checking this space for more content about the conference. In the meanwhile, you can read our previous year coverage: NeurIPS Invited Talk: Reproducible, Reusable, and Robust Reinforcement Learning NeurIPS 2018: How machine learning experts can work with policymakers to make good tech decisions [Invited Talk] NeurIPS 2018: Rethinking transparency and accountability in machine learning NeurIPS 2018: Developments in machine learning through the lens of Counterfactual Inference [Tutorial] Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk]
Read more
  • 0
  • 0
  • 26494
article-image-implement-memory-oltp-sql-server-linux
Fatema Patrawala
17 Feb 2018
11 min read
Save for later

How to implement In-Memory OLTP on SQL Server in Linux

Fatema Patrawala
17 Feb 2018
11 min read
[box type="note" align="" class="" width=""]Below given article is an excerpt from the book SQL Server on Linux, authored by Jasmin Azemović. This book is a handy guide to setting up and implementing your SQL Server solution on the open source Linux platform.[/box] Today we will learn about the basics of In-Memory OLTP and how to implement it on SQL Server on Linux through the following topics: Elements of performance What is In-Memory OLTP Implementation Elements of performance How do you know if you have a performance issue in your database environment? Well, let's put it in these terms. You notice it (the good), users start calling technical support and complaining about how everything is slow (the bad) or you don't know about your performance issues (the ugly). Try to never get in to the last category. The good Achieving best performance is an iterative process where you need to define a set of tasks that you will execute on a regular basics and monitor their results. Here is a list that will give you an idea and guide you through this process: Establish the baseline Define the problem Fix one thing at a time Test and re-establish the baseline Repeat everything Establishing the baseline is the critical part. In most case scenarios, it is not possible without real stress testing. Example: How many users' systems can you handle on the current configuration? The next step is to measure the processing time. Do your queries or stored procedures require milliseconds, seconds, or minutes to execute? Now you need to monitor your database server using a set of tools and correct methodologies. During that process, you notice that some queries show elements of performance degradation. This is the point that defines the problem. Let's say that frequent UPDATE and DELETE operations are resulting in index fragmentation. The next step is to fix this issue with REORGANIZE or REBUILD index operations. Test your solution in the control environment and then in the production. Results can be better, same, or worse. It depends and there is no magic answer here. Maybe now something else is creating the problem: disk, memory, CPU, network, and so on. In this step, you should re-establish the old or a new baseline. Measuring performance process is something that never ends. You should keep monitoring the system and stay alert. The bad If you are in this category, then you probably have an issue with establishing the baseline and alerting the system. So, users are becoming your alerts and that is a bad thing. The rest of the steps are the same except re-establishing the baseline. But this can be your wake-up call to move yourself in the good category. The ugly This means that you don't know or you don't want to know about performance issues. The best case scenario is a headline on some news portal, but that is the ugly thing. Every decent DBA should try to be light years away from this category. What do you need to start working with performance measuring, monitoring, and fixing? Here are some tips that can help you: Know the data and the app Know your server and its capacity Use dynamic management views—DMVs: Sys.dm_os_wait_stats Sys.dm_exec_query_stats sys.dm_db_index_operational_stats Look for top queries by reads, writes, CPU, execution count Put everything in to LibreOffice Calc or another spreadsheet application and do some basic comparative math Fortunately, there is something in the field that can make your life really easy. It can boost your environment to the scale of warp speed (I am a Star Trek fan). What is In-Memory OLTP? SQL Server In-Memory feature is unique in the database world. The reason is very simple; because it is built-in to the databases' engine itself. It is not a separate database solution and there are some major benefits of this. One of these benefits is that in most cases you don't have to rewrite entire SQL Server applications to see performance benefits. On average, you will see 10x more speed while you are testing the new In-Memory capabilities. Sometimes you will even see up to 50x improvement, but it all depends on the amount of business logic that is done in the database via stored procedures. The greater the logic in the database, the greater the performance increase. The more the business logic sits in the app, the less opportunity there is for performance increase. This is one of the reasons for always separating database world from the rest of the application layer. It has built-in compatibility with other non-memory tables. This way you can optimize thememory you have for the most heavily used tables and leave others on the disk. This also means you won't have to go out and buy expensive new hardware to make large InMemory databases work; you can optimize In-Memory to fit your existing hardware. In-Memory was started in SQL Server 2014. One of the first companies that has started to use this feature during the development of the 2014 version was Bwin. This is an online gaming company. With In-Memory OLTP they improved their transaction speed by 16x, without investing in new expensive hardware. The same company has achieved 1.2 Million requests/second on SQL Server 2016 with a single machine using In-Memory OLTP: https://blogs.msdn.microsoft.com/sqlcat/2016/10/26/how-bwin-is-using-sql-server-2016-in-memory-oltp-to-achieve-unprecedented-performance-and-scale/ Not every application will benefit from In-Memory OLTP. If an application is not suffering from performance problems related to concurrency, IO pressure, or blocking, it's probably not a good candidate. If the application has long-running transactions that consume large amounts of buffer space, such as ETL processing, it's probably not a good candidate either. The best applications for consideration would be those that run high volumes of small fast transactions, with repeatable query plans such as order processing, reservation systems, stock trading, and ticket processing. The biggest benefits will be seen on systems that suffer performance penalties from tables that are having concurrency issues related to a large number of users and locking/blocking. Applications that heavily use the tempdb for temporary tables could benefit from In-Memory OLTP by creating the table as memory optimized, and performing the expensive sorts, and groups, and selective queries on the tables that are memory optimized. In-Memory OLTP quick start An important thing to remember is that the databases that will contain memory-optimized tables must have a MEMORY_OPTIMIZED_DATA filegroup. This filegroup is used for storing the checkpoint needed by SQL Server to recover the memory-optimized tables. Here is a simple DDL SQL statement to create a database that is prepared for In-Memory tables: 1> USE master 2> GO 1> CREATE DATABASE InMemorySandbox 2> ON 3> PRIMARY (NAME = InMemorySandbox_data, 4> FILENAME = 5> '/var/opt/mssql/data/InMemorySandbox_data_data.mdf', 6> size=500MB), 7> FILEGROUP InMemorySandbox_fg 8> CONTAINS MEMORY_OPTIMIZED_DATA 9> (NAME = InMemorySandbox_dir, 10> FILENAME = 11> '/var/opt/mssql/data/InMemorySandbox_dir') 12> LOG ON (name = InMemorySandbox_log, 13> Filename= 14>'/var/opt/mssql/data/InMemorySandbox_data_data.ldf', 15> size=500MB) 16 GO   The next step is to alter the existing database and configure it to access memory-optimized tables. This part is helpful when you need to test and/or migrate current business solutions: --First, we need to check compatibility level of database. -- Minimum is 130 1> USE AdventureWorks 2> GO 3> SELECT T.compatibility_level 4> FROM sys.databases as T 5> WHERE T.name = Db_Name(); 6> GO compatibility_level ------------------- 120 (1 row(s) affected) --Change the compatibility level 1> ALTER DATABASE CURRENT 2> SET COMPATIBILITY_LEVEL = 130; 3> GO --Modify the transaction isolation level 1> ALTER DATABASE CURRENT SET 2> MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT=ON 3> GO --Finlay create memory optimized filegroup 1> ALTER DATABASE AdventureWorks 2> ADD FILEGROUP AdventureWorks_fg CONTAINS 3> MEMORY_OPTIMIZED_DATA 4> GO 1> ALTER DATABASE AdventureWorks ADD FILE 2> (NAME='AdventureWorks_mem', 3> FILENAME='/var/opt/mssql/data/AdventureWorks_mem') 4> TO FILEGROUP AdventureWorks_fg 5> GO   How to create memory-optimized table? The syntax for creating memory-optimized tables is almost the same as the syntax for creating classic disk-based tables. You will need to specify that the table is a memory-optimized table, which is done using the MEMORY_OPTIMIZED = ON clause. A memory-optimized table can be created with two DURABILITY values: SCHEMA_AND_DATA (default) SCHEMA_ONLY If you defined a memory-optimized table with DURABILITY=SCHEMA_ONLY, it means that changes to the table's data are not logged and the data is not persisted on disk. However, the schema is persisted as part of the database metadata. A side effect is that an empty table will be available after the database is recovered during a restart of SQL Server on Linux service.   The following table is a summary of key differences between those two DURABILITY Options. When you create a memory-optimized table, the database engine will generate DML routines just for accessing that table, and load them as DLLs files. SQL Server itself does not perform data manipulation, instead it calls the appropriate DLL: Now let's add some memory-optimized tables to our sample database:     1> USE InMemorySandbox 2> GO -- Create a durable memory-optimized table 1> CREATE TABLE Basket( 2> BasketID INT IDENTITY(1,1) 3> PRIMARY KEY NONCLUSTERED, 4> UserID INT NOT NULL INDEX ix_UserID 5> NONCLUSTERED HASH WITH (BUCKET_COUNT=1000000), 6> CreatedDate DATETIME2 NOT NULL,   7> TotalPrice MONEY) WITH (MEMORY_OPTIMIZED=ON) 8> GO -- Create a non-durable table. 1> CREATE TABLE UserLogs ( 2> SessionID INT IDENTITY(1,1) 3> PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT=400000), 4> UserID int NOT NULL, 5> CreatedDate DATETIME2 NOT NULL, 6> BasketID INT, 7> INDEX ix_UserID 8> NONCLUSTERED HASH (UserID) WITH (BUCKET_COUNT=400000)) 9> WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY) 10> GO -- Add some sample records 1> INSERT INTO UserLogs VALUES 2> (432, SYSDATETIME(), 1), 3> (231, SYSDATETIME(), 7), 4> (256, SYSDATETIME(), 7), 5> (134, SYSDATETIME(), NULL), 6> (858, SYSDATETIME(), 2), 7> (965, SYSDATETIME(), NULL) 8> GO 1> INSERT INTO Basket VALUES 2> (231, SYSDATETIME(), 536), 3> (256, SYSDATETIME(), 6547), 4> (432, SYSDATETIME(), 23.6), 5> (134, SYSDATETIME(), NULL) 6> GO -- Checking the content of the tables 1> SELECT SessionID, UserID, BasketID 2> FROM UserLogs 3> GO 1> SELECT BasketID, UserID 2> FROM Basket 3> GO   What is natively compiled stored procedure? This is another great feature that comes comes within In-Memory package. In a nutshell, it is a classic SQL stored procedure, but it is compiled into machine code for blazing fast performance. They are stored as native DLLs, enabling faster data access and more efficient query execution than traditional T-SQL. Now you will create a natively compiled stored procedure to insert 1,000,000 rows into Basket: 1> USE InMemorySandbox 2> GO 1> CREATE PROCEDURE dbo.usp_BasketInsert @InsertCount int 2> WITH NATIVE_COMPILATION, SCHEMABINDING AS 3> BEGIN ATOMIC 4> WITH 5> (TRANSACTION ISOLATION LEVEL = SNAPSHOT, 6> LANGUAGE = N'us_english') 7> DECLARE @i int = 0 8> WHILE @i < @InsertCount 9> BEGIN 10> INSERT INTO dbo.Basket VALUES (1, SYSDATETIME() , NULL) 11> SET @i += 1 12> END 13> END 14> GO --Add 1000000 records 1> EXEC dbo.usp_BasketInsert 1000000 2> GO   The insert part should be blazing fast. Again, it depends on your environment (CPU, RAM, disk, and virtualization). My insert was done in less than three seconds, on an average machine. But significant improvement should be visible now. Execute the following SELECT statement and count the number of records:   1> SELECT COUNT(*) 2> FROM dbo.Basket 3> GO ----------- 1000004 (1 row(s) affected)   In my case, counting of one million records was less than one second. It is really hard to achieve this performance on any kind of disk. Let's try another query. We want to know how much time it will take to find the top 10 records where the insert time was longer than 10 microseconds:   1> SELECT TOP 10 BasketID, CreatedDate 2> FROM dbo.Basket 3> WHERE DATEDIFF 4> (MICROSECOND,'2017-05-30 15:17:20.9308732', CreatedDate) 5> >10 6> GO   Again, query execution time was less than a second. Even if you remove TOP and try to get all the records it will take less than a second (in my case scenario). Advantages of InMemory tables are more than obvious.   We learnt about the basic concepts of In-Memory OLTP and how to implement it on new and existing database. We also got to know that a memory-optimized table can be created with two DURABILITY values and finally, we created an In-Memory table. If you found this article useful, check out the book SQL Server on Linux, which covers advanced SQL Server topics, demonstrating the process of setting up SQL Server database solution in the Linux environment.        
Read more
  • 0
  • 0
  • 26486

article-image-highest-paying-data-science-jobs-2017
Amey Varangaonkar
27 Nov 2017
10 min read
Save for later

Highest Paying Data Science Jobs in 2017

Amey Varangaonkar
27 Nov 2017
10 min read
It is no secret that this is the age of data. More data has been created in the last 2 years than ever before. Within the dumps of data created every second, businesses are looking for useful, action worthy insights which they can use to enhance their processes and thereby increase their revenue and profitability. As a result, the demand for data professionals, who sift through terabytes of data for accurate analysis and extract valuable business insights from it, is now higher than ever before. Think of Data Science as a large tree from which all things related to data branch out - from plain old data management and analysis to Big Data, and more.  Even the recently booming trends in Artificial Intelligence such as machine learning and deep learning are applied in many ways within data science. Data science continues to be a lucrative and growing job market in the recent years, as evidenced by the graph below: Source: Indeed.com In this article, we look at some of the high paying, high trending job roles in the data science domain that you should definitely look out for if you’re considering data science as a serious career opportunity. Let’s get started with the obvious and the most popular role. Data Scientist Dubbed as the sexiest job of the 21st century, data scientists utilize their knowledge of statistics and programming to turn raw data into actionable insights. From identifying the right dataset to cleaning and readying the data for analysis, to gleaning insights from said analysis, data scientists communicate the results of their findings to the decision makers. They also act as advisors to executives and managers by explaining how the data affects a particular product or process within the business so that appropriate actions can be taken by them. Per Salary.com, the median annual salary for the role of a data scientist today is $122,258, with a range between $106,529 to $137,037. The salary is also accompanied by a whole host of benefits and perks which vary from one organization to the other, making this job one of the best and the most in-demand, in the job market today. This is a clear testament to the fact that an increasing number of businesses are now taking the value of data seriously, and want the best talent to help them extract that value. There are over 20,000 jobs listed for the role of data scientist, and the demand is only growing. Source: Indeed.com To become a data scientist, you require a bachelor’s or a master’s degree in mathematics or statistics and work experience of more than 5 years in a related field. You will need to possess a unique combination of technical and analytical skills to understand the problem statement and propose the best solution, good programming skills to develop effective data models, and visualization skills to communicate your findings with the decision makers. Interesting in becoming a data scientist? Here are some resources to help you get started: Principles of Data Science Data Science Algorithms in a Week Getting Started with R for Data Science [Video] For a more comprehensive learning experience, check out our skill plan for becoming a data scientist on Mapt, our premium skills development platform. Data Analyst Probably a term you are quite familiar with, Data Analysts are responsible for crunching large amounts of data and analyze it to come to appropriate logical conclusions. Whether it’s related to pure research or working with domain-specific data, a data analyst’s job is to help the decision-makers’ job easier by giving them useful insights. Effective data management, analyzing data, and reporting results are some of the common tasks associated with this role. How is this role different than a data scientist, you might ask. While data scientists specialize in maths, statistics and predictive analytics for better decision making, data analysts specialize in the tools and components of data architecture for better analysis. Per Salary.com, the median annual salary for an entry-level data analyst is $55,804, and the range usually falls between $50,063 to $63,364 excluding bonuses and benefits. For more experienced data analysts, this figure rises to around a mean annual salary of $88,532. With over 83,000 jobs listed on Indeed.com, this is one of the most popular job roles in the data science community today. This profile requires a pretty low starting point, and is justified by the low starting salary packages. As you gain more experience, you can move up the ladder and look at becoming a data scientist or a data engineer. Source: Indeed.com You may also come across terms such as business data analyst or simply business analyst which are sometimes interchangeably used with the role of a data analyst. While their primary responsibilities are centered around data crunching, business analysts model company infrastructure, while data analysts model business data structures. You can find more information related to the differences in this interesting article. If becoming a data analyst is something that interests you, here are some very good starting points: Data Analysis with R Python Data Analysis, Second Edition Learning Python Data Analysis [Video] Data Architect Data architects are responsible for creating a solid data management blueprint for an organization. They are primarily responsible for designing the data architecture and defining how data is stored, consumed and managed by different applications and departments within the organization. Because of these critical responsibilities, a data architect’s job is a very well-paid one. Per Salary.com, the median annual salary for an entry-level data architect is $74,809, with a range between $57,964 to $91,685. For senior-level data architects, the median annual salary rises up to $136,856, with a range usually between $121,969 to $159,212. These high figures are justified by the critical nature of the role of a data architect - planning and designing the right data infrastructure after understanding the business considerations to get the most value out of the data. At present, there are over 23,000 jobs for the role listed on Indeed.com, with a stable trend in job seeker interest, as shown: Source: Indeed.com To become a data architect, you need a bachelor’s degree in computer science, mathematics, statistics or a related field, and loads of real-world skills to qualify for even the entry-level positions. Technical skills such as statistical modeling, knowledge of languages such as Python and R, database architectures, Hadoop-based skills, knowledge of NoSQL databases, and some machine learning and data mining are required to become a data architect. You also need strong collaborative skills, problem-solving, creativity and the ability to think on your feet, to solve the trickiest of problems on the go. Suffice to say it’s not an easy job, but it is definitely a lucrative one! Get ahead of the curve, and start your journey to becoming a data architect now: Big Data Analytics Hadoop Blueprints PostgreSQL for Data Architects Data Engineer Data engineers or Big Data engineers are a crucial part of the organizational workforce and work in tandem with data architects and data scientists to ensure appropriate data management systems are deployed and the right kind of data is being used for analysis. They deal with messy, unstructured Big Data and strive to provide clean, usable data to the other teams within the organization. They build high-performance analytics pipelines and develop set of processes for efficient data mining. In many companies, the role of a data engineer is closely associated with that of a data architect. While an architect is responsible for the planning and designing stages of the data infrastructure project, a data engineer looks after the construction, testing and maintenance of the infrastructure. As such data engineers tend to have a more in-depth understanding of different data tools and languages than data architects. There are over 90,000 jobs listed on Indeed.com, suggesting there is a very high demand in the organizations for this kind of a role. An entry level data engineer has a median annual salary of $90,083 per Payscale.com, with a range of $60,857 to $131,851. For Senior Data Engineers, the average salary shoots up to $123,749 as per Glassdoor estimates. Source: Indeed.com With the unimaginable rise in the sheer volume of data, the onus is on the data engineers to build the right systems that empower the data analysts and data scientists to sift through the messy data and derive actionable insights from it. If becoming a data engineer is something that interests you, here are some of our products you might want to look at: Real-Time Big Data Analytics Apache Spark 2.x Cookbook Big Data Visualization You can also check out our detailed skill plan on becoming a Big Data Engineer on Mapt. Chief Data Officer There is a countless number of organizations that build their businesses on data, but don’t manage it that well. This is where a senior executive popularly known as the Chief Data Officer (CDO) comes into play - bearing the responsibility for implementing the organization’s data and information governance and assisting with data-driven business strategies. They are primarily responsible for ensuring that their organization gets the most value out of their data and put appropriate plans in place for effective data quality and its life-cycle management. The role of a CDO is one of the most lucrative and highest paying jobs in the data science frontier. An average median annual pay for a CDO per Payscale.com is around $192,812. Indeed.com lists just over 8000 job postings too - this is not a very large number, but understandable considering the recent emergence of the role and because it’s a high profile, C-suite job. Source: Indeed.com According to a Gartner research, almost 50% companies in a variety of regulated industries will have a CDO in place, by 2017. Considering the demand for the role and the fact that it is only going to rise in the future, the role of a CDO is one worth vying for.   To become a CDO, you will obviously need a solid understanding of statistical, mathematical and analytical concepts. Not just that, extensive and high-value experience in managing technical teams and information management solutions is also a prerequisite. Along with a thorough understanding of the various Big Data tools and technologies, you will need strong communication skills and deep understanding of the business. If you’re planning to know more about how you can become a Chief Data Officer, you can browse through our piece on the role of CDO. Why demand for data science professionals will rise It’s hard to imagine an organization which doesn’t have to deal with data, but it’s harder to imagine the state of an organization with petabytes of data and not knowing what to do with it. With the vast amounts of data, organizations deal with these days, the need for experts who know how to handle the data and derive relevant and timely insights from it is higher than ever. In fact, IBM predicts there’s going to be a severe shortage of data science professionals, and thereby, a tremendous growth in terms of job offers and advertised openings, by 2020. Not everyone is equipped with the technical skills and know-how associated with tasks such as data mining, machine learning and more. This is slowly creating a massive void in terms of talent that organizations are looking to fill quickly, by offering lucrative salaries and added benefits. Without the professional expertise to turn data into actionable insights, Big Data becomes all but useless.      
Read more
  • 0
  • 0
  • 26415
Modal Close icon
Modal Close icon