Data | Tech News, Tutorials & Expert Insights

19 Mar 2010

3 min read

iReport in NetBeans

19 Mar 2010

Creating different types of reports inside the NetBeans IDE The first step is to download the NetBeans IDE and the iReport plugin for this. The iReport plugin for NetBeans is available for free download at the following locations: https://sourceforge.net/projects/ireport/files or http://plugins.netbeans.org/PluginPortal/faces/PluginDetailPage.jsp?pluginid=4425 After downloading the plugin, follow the listed steps to install the plugin in NetBeans: Start the NetBeans IDE. Go to Tools | Plugins. Select the Downloaded tab. Press Add Plugins…. Select the plugin files. For iReport 3.7.0 the plugins are: iReport-3.7.0.nbm, jasperreports-components-plugin-3.7.0.nbm, jasperreportsextensions-plugin-3.7.0.nbm, and jasperserver-plugin-3.7.0.nbm. After opening the plugin files you will see the following screen: Check the Install checkbox of ireport-designer, and press the Install button at the bottom of the window. The following screen will appear: Press Next >, and accept the terms of the License Agreement. If the Verify Certificate dialog box appears, click Continue. Press Install, and wait for the installer to complete the installation. After the installation is done, press Finish and close the Plugins dialog. If the IDE requests for a restart, then do it. Now the IDE is ready for creating reports. Creating reports We have already learnt about creating various types of reports, such as reports without parameters, reports with parameters, reports with variables, subreports, crosstab reports, reports with charts and images, and so on. We have also attained knowledge associated with these types of reports. Now, we will learn quickly how to create these reports using NetBeans with the help of the installed iReport plugins. Creating a NetBeans database JDBC connection The first step is to create a database connection, which will be used by the report data sources. Follow the listed steps: Select the Services tab from the left side of the project window. Select Databases. Right-click on Databases, and press New Connection…. In the New Database Connection dialog, set the following under Basic setting, and check the Remember password checkbox: Option Value Driver Name MySQL (Connector/J Driver) Host localhost Port 3306 Database inventory User Name root Password packt Press OK. Now the connection is created, and you can see this under the Services | Databases section, as shown in the following screenshot: Creating a report data source The NetBeans database JDBC connection created previously will be used by a report data source that will be used by the report. Follow the listed steps to create the data source: From the NetBeans toolbar, press the Report Datasources button. You will see the following dialog box: Press New. Select NetBeans Database JDBC connection, and press Next >. Enter inventory in the Name field, and from the Connection drop-down list, select jdbc:mysql://localhost:3306/inventory [root on Default schema]. Press Test, and if the connection is successful, press Save and close the Connections / Datasources dialog box.

0
0
27718

article-image-putting-your-database-heart-azure-solutions

Packt

28 Oct 2015

19 min read

Putting Your Database at the Heart of Azure Solutions

Packt

28 Oct 2015

19 min read

In this article by Riccardo Becker, author of the book Learning Azure DocumentDB, we will see how to build a real scenario around an Internet of Things scenario. This scenario will build a basic Internet of Things platform that can help to accelerate building your own. In this article, we will cover the following: Have a look at a fictitious scenario Learn how to combine Azure components with DocumentDB Demonstrate how to migrate data to DocumentDB (For more resources related to this topic, see here.) Introducing an Internet of Things scenario Before we start exploring different capabilities to support a real-life scenario, we will briefly explain the scenario we will use throughout this article. IoT, Inc. IoT, Inc. is a fictitious start-up company that is planning to build solutions in the Internet of Things domain. The first solution they will build is a registration hub, where IoT devices can be registered. These devices can be diverse, ranging from home automation devices up to devices that control traffic lights and street lights. The main use case for this solution is offering the capability for devices to register themselves against a hub. The hub will be built with DocumentDB as its core component and some Web API to expose this functionality. Before devices can register themselves, they need to be whitelisted in order to prevent malicious devices to start registering. In the following screenshot, we see the high-level design of the registration requirement: The first version of the solution contains the following components: A Web API containing methods to whitelist, register, unregister, and suspend devices DocumentDB, containing all the device information including information regarding other Microsoft Azure resources Event Hub, a Microsoft Azure asset that enables scalable publish-subscribe mechanism to ingress and egress millions of events per second Power BI, Microsoft’s online offering to expose reporting capabilities and the ability to share reports Obviously, we will focus on the core of the solution which is DocumentDB but it is nice to touch some of the Azure components, as well to see how well they co-operate and how easy it is to set up a demonstration for IoT scenarios. The devices on the left-hand side are chosen randomly and will be mimicked by an emulator written in C#. The Web API will expose the functionality required to let devices register themselves at the solution and start sending data afterwards (which will be ingested to the Event Hub and reported using Power BI). Technical requirements To be able to service potentially millions of devices, it is necessary that registration request from a device is being stored in a separate collection based on the country where the device is located or manufactured. Every device is being modeled in the same way, whereas additional metadata can be provided upon registration or afterwards when updating. To achieve country-based partitioning, we will create a custom PartitionResolver to achieve this goal. To extend the basic security model, we reduce the amount of sensitive information in our configuration files. Enhance searching capabilities because we want to service multiple types of devices each with their own metadata and device-specific information. Querying on all the information is desired to support full-text search and enable users to quickly search and find their devices. Designing the model Every device is being modeled similar to be able to service multiple types of devices. The device model contains at least the deviceid and a location. Furthermore, the device model contains a dictionary where additional device properties can be stored. The next code snippet shows the device model: [JsonProperty("id")] public string DeviceId { get; set; } [JsonProperty("location")] public Point Location { get; set; } //practically store any metadata information for this device [JsonProperty("metadata")] public IDictionary<string, object> MetaData { get; set; } The Location property is of type Microsoft.Azure.Documents.Spatial.Point because we want to run spatial queries later on in this section, for example, getting all the devices within 10 kilometers of a building. Building a custom partition resolver To meet the first technical requirement (partition data based on the country), we need to build a custom partition resolver. To be able to build one, we need to implement the IPartitionResolver interface and add some logic. The resolver will take the Location property of the device model and retrieves the country that corresponds with the latitude and longitude provided upon registration. In the following code snippet, you see the full implementation of the GeographyPartitionResolver class: public class GeographyPartitionResolver : IPartitionResolver { private readonly DocumentClient _client; private readonly BingMapsHelper _helper; private readonly Database _database; public GeographyPartitionResolver(DocumentClient client, Database database) { _client = client; _database = database; _helper = new BingMapsHelper(); } public object GetPartitionKey(object document) { //get the country for this document //document should be of type DeviceModel if (document.GetType() == typeof(DeviceModel)) { //get the Location and translate to country var country = _helper.GetCountryByLatitudeLongitude( (document as DeviceModel).Location.Position.Latitude, (document as DeviceModel).Location.Position.Longitude); return country; } return String.Empty; } public string ResolveForCreate(object partitionKey) { //get the country for this partitionkey //check if there is a collection for the country found var countryCollection = _client.CreateDocumentCollectionQuery(database.SelfLink). ToList().Where(cl => cl.Id.Equals(partitionKey.ToString())).FirstOrDefault(); if (null == countryCollection) { countryCollection = new DocumentCollection { Id = partitionKey.ToString() }; countryCollection = _client.CreateDocumentCollectionAsync(_database.SelfLink, countryCollection).Result; } return countryCollection.SelfLink; } /// <summary> /// Returns a list of collectionlinks for the designated partitionkey (one per country) /// </summary> /// <param name="partitionKey"></param> /// <returns></returns> public IEnumerable<string> ResolveForRead(object partitionKey) { var countryCollection = _client.CreateDocumentCollectionQuery(_database.SelfLink). ToList().Where(cl => cl.Id.Equals(partitionKey.ToString())).FirstOrDefault(); return new List<string> { countryCollection.SelfLink }; } } In order to have the DocumentDB client use this custom PartitionResolver, we need to assign it. The code is as follows: GeographyPartitionResolver resolver = new GeographyPartitionResolver(docDbClient, _database); docDbClient.PartitionResolvers[_database.SelfLink] = resolver; //Adding a typical device and have the resolver sort out what //country is involved and whether or not the collection already //exists (and create a collection for the country if needed), use //the next code snippet. var deviceInAmsterdam = new DeviceModel { DeviceId = Guid.NewGuid().ToString(), Location = new Point(4.8951679, 52.3702157) }; Document modelAmsDocument = docDbClient.CreateDocumentAsync(_database.SelfLink, deviceInAmsterdam).Result; //get all the devices in Amsterdam var doc = docDbClient.CreateDocumentQuery<DeviceModel>( _database.SelfLink, null, resolver.GetPartitionKey(deviceInAmsterdam)); Now that we have created a country-based PartitionResolver, we can start working on the Web API that exposes the registration method. Building the Web API A Web API is an online service that can be used by any clients running any framework that supports the HTTP programming stack. Currently, REST is a way of interacting with APIs so that we will build a REST API. Building a good API should aim for platform independence. A well-designed API should also be able to extend and evolve without affecting existing clients. First, we need to whitelist the devices that should be able to register themselves against our device registry. The whitelist should at least contain a device ID, a unique identifier for a device that is used to match during the whitelisting process. A good candidate for a device ID is the mac address of the device or some random GUID. Registering a device The registration Web API contains a POST method that does the actual registration. First, it creates access to an Event Hub (not explained here) and stores the credentials needed inside the DocumentDB document. The document is then created inside the designated collection (based on the location). To learn more about Event Hubs, please visit https://azure.microsoft.com/en-us/services/event-hubs/. [Route("api/registration")] [HttpPost] public async Task<IHttpActionResult> Post([FromBody]DeviceModel value) { //add the device to the designated documentDB collection (based on country) try { var serviceUri = ServiceBusEnvironment.CreateServiceUri("sb", serviceBusNamespace, String.Format("{0}/publishers/{1}", "telemetry", value.DeviceId)) .ToString() .Trim('/'); var sasToken = SharedAccessSignatureTokenProvider.GetSharedAccessSignature(EventHubKeyName, EventHubKey, serviceUri, TimeSpan.FromDays(365 * 100)); // hundred years will do //this token can be used by the device to send telemetry //this token and the eventhub name will be saved with the metadata of the document to be saved to DocumentDB value.MetaData.Add("Namespace", serviceBusNamespace); value.MetaData.Add("EventHubName", "telemetry"); value.MetaData.Add("EventHubToken", sasToken); var document = await docDbClient.CreateDocumentAsync(_database.SelfLink, value); return Created(document.ContentLocation, value); } catch (Exception ex) { return InternalServerError(ex); } } After this registration call, the right credentials on the Event Hub have been created for this specific device. The device is now able to ingress data to the Event Hub and have consumers like Power BI consume the data and present it. Event Hubs is a highly scalable publish-subscribe event ingestor. It can collect millions of events per second so that you can process and analyze the massive amounts of data produced by your connected devices and applications. Once collected into Event Hubs, you can transform and store the data by using any real-time analytics provider or with batching/storage adapters. At the time of writing, Microsoft announced the release of Azure IoT Suite and IoT Hubs. These solutions offer internet of things capabilities as a service and are well-suited to build our scenario as well. Increasing searching We have seen how to query our documents and retrieve the information we need. For this approach, we need to understand the DocumentDB SQL language. Microsoft has an online offering that enables full-text search called Azure Search service. This feature enables us to perform full-text searches and it also includes search behaviours similar to search engines. We could also benefit from so called type-ahead query suggestions based on the input of a user. Imagine a search box on our IoT Inc. portal that offers free text searching while the user types and search for devices that include any of the search terms on the fly. Azure Search runs on Azure; therefore, it is scalable and can easily be upgraded to offer more search and storage capacity. Azure Search stores all your data inside an index, offering full-text search capabilities on your data. Setting up Azure Search Setting up Azure Search is pretty straightforward and can be done by using the REST API it offers or on the Azure portal. We will set up the Azure Search service through the portal and later on, we will utilize the REST API to start configuring our search service. We set up the Azure Search service through the Azure portal (http://portal.azure.com). Find the Search service and fill out some information. In the following screenshot, we can see how we have created the free tier for Azure Search: You can see that we use the Free tier for this scenario and that there are no datasources configured yet. We will do that know by using the REST API. We will use the REST API, since it offers more insight on how the whole concept works. We use Fiddler to create a new datasource inside our search environment. The following screenshot shows how to use Fiddler to create a datasource and add a DocumentDB collection: In the Composer window of Fiddler, you can see we need to POST a payload to the Search service we created earlier. The Api-Key is mandatory and also set the content type to be JSON. Inside the body of the request, the connection information to our DocumentDB environment is need and the collection we want to add (in this case, Netherlands). Now that we have added the collection, it is time to create an Azure Search index. Again, we use Fiddler for this purpose. Since we use the free tier of Azure Search, we can only add five indexes at most. For this scenario, we add an index on ID (device ID), location, and metadata. At the time of writing, Azure Search does not support complex types. Note that the metadata node is represented as a collection of strings. We could check in the portal to see if the creation of the index was successful. Go to the Search blade and select the Search service we have just created. You can check the indexes part to see whether the index was actually created. The next step is creating an indexer. An indexer connects the index with the provided data source. Creating this indexer takes some time. You can check in the portal if the indexing process was successful. We actually find that documents are part of the index now. If your indexer needs to process thousands of documents, it might take some time for the indexing process to finish. You can check the progress of the indexer using the REST API again. https://iotinc.search.windows.net/indexers/deviceindexer/status?api-version=2015-02-28 Using this REST call returns the result of the indexing process and indicates if it is still running and also shows if there are any errors. Errors could be caused by documents that do not have the id property available. The final step involves testing to check whether the indexing works. We will search for a device ID, as shown in the next screenshot: In the Inspector tab, we can check for the results. It actually returns the correct document also containing the location field. The metadata is missing because complex JSON is not supported (yet) at the time of writing. Indexing complex JSON types is not supported yet. It is possible to add SQL queries to the data source. We could explicitly add a SELECT statement to surface the properties of the complex JSON we have like metadata or the Point property. Try adding additional queries to your data source to enable querying complex JSON types. Now that we have created an Azure Search service that indexes our DocumentDB collection(s), we can build a nice query-as-you-type field on our portal. Try this yourself. Enhancing security Microsoft Azure offers a capability to move your secrets away from your application towards Azure Key Vault. Azure Key Vault helps to protect cryptographic keys, secrets, and other information you want to store in a safe place outside your application boundaries (connectionstring are also good candidates). Key Vault can help us to protect the DocumentDB URI and its key. DocumentDB has no (in-place) encryption feature at the time of writing, although a lot of people already asked for it to be on the roadmap. Creating and configuring Key Vault Before we can use Key Vault, we need to create and configure it first. The easiest way to achieve this is by using PowerShell cmdlets. Please visit https://msdn.microsoft.com/en-us/mt173057.aspx to read more about PowerShell. The following PowerShell cmdlets demonstrate how to set up and configure a Key Vault: Command Description Get-AzureSubscription This command will prompt you to log in using your Microsoft Account. It returns a list of all Azure subscriptions that are available to you. Select-AzureSubscription -SubscriptionName "Windows Azure MSDN Premium" This tells PowerShell to use this subscription as being subject to our next steps. Switch-AzureMode AzureResourceManager New-AzureResourceGroup –Name 'IoTIncResourceGroup' –Location 'West Europe' This creates a new Azure Resource Group with a name and a location. New-AzureKeyVault -VaultName 'IoTIncKeyVault' -ResourceGroupName 'IoTIncResourceGroup' -Location 'West Europe' This creates a new Key Vault inside the resource group and provide a name and location. $secretvalue = ConvertTo-SecureString '<DOCUMENTDB KEY>' -AsPlainText –Force This creates a security string for my DocumentDB key. $secret = Set-AzureKeyVaultSecret -VaultName 'IoTIncKeyVault' -Name 'DocumentDBKey' -SecretValue $secretvalue This creates a key named DocumentDBKey into the vault and assigns it the secret value we have just received. Set-AzureKeyVaultAccessPolicy -VaultName 'IoTIncKeyVault' -ServicePrincipalName <SPN> -PermissionsToKeys decrypt,sign This configures the application with the Service Principal Name <SPN> to get the appropriate rights to decrypt and sign Set-AzureKeyVaultAccessPolicy -VaultName 'IoTIncKeyVault' -ServicePrincipalName <SPN> -PermissionsToSecrets Get This configures the application with SPN to also be able to get a key. Key Vault must be used together with Azure Active Directory to work. The SPN we need in the steps for powershell is actually is a client ID of an application I have set up in my Azure Active Directory. Please visit https://azure.microsoft.com/nl-nl/documentation/articles/active-directory-integrating-applications/ to see how you can create an application. Make sure to copy the client ID (which is retrievable afterwards) and the key (which is not retrievable afterwards). We use these two pieces of information to take the next step. Using Key Vault from ASP.NET In order to use the Key Vault we have created in the previous section, we need to install some NuGet packages into our solution and/or projects: Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory -Version 2.16.204221202 Install-Package Microsoft.Azure.KeyVault These two packages enable us to use AD and Key Vault from our ASP.NET application. The next step is to add some configuration information to our web.config file: <add key="ClientId" value="<CLIENTID OF THE APP CREATED IN AD" /> <add key="ClientSecret" value="<THE SECRET FROM AZURE AD PORTAL>" />  <add key="SecretUri" value="https://iotinckeyvault.vault.azure.net:443/secrets/DocumentDBKey" /> If you deploy the ASP.NET application to Azure, you could even configure these settings from the Azure portal itself, completely removing this from the web.config file. This technique adds an additional ring of security around your application. The following code snippet shows how to use AD and Key Vault inside the registration functionality of our scenario: //no more keys in code or .config files. Just a appid, secret and the unique URL to our key (SecretUri). When deploying to Azure we could //even skip this by setting appid and clientsecret in the Azure Portal. var kv = new KeyVaultClient(new KeyVaultClient.AuthenticationCallback(Utils.GetToken)); var sec = kv.GetSecretAsync(WebConfigurationManager.AppSettings["SecretUri"]).Result.Value; The Utils.GetToken method is shown next. This method retrieves an access token from AD by supplying the ClientId and the secret. Since we configured Key Vault to allow this application to get the keys, the call to GetSecretAsync() will succeed. The code is as follows: public async static Task<string> GetToken(string authority, string resource, string scope) { var authContext = new AuthenticationContext(authority); ClientCredential clientCred = new ClientCredential(WebConfigurationManager.AppSettings["ClientId"], WebConfigurationManager.AppSettings["ClientSecret"]); AuthenticationResult result = await authContext.AcquireTokenAsync(resource, clientCred); if (result == null) throw new InvalidOperationException("Failed to obtain the JWT token"); return result.AccessToken; } Instead of storing the key to DocumentDB somewhere in code or in the web.config file, it is now moved away to Key Vault. We could do the same with the URI to our DocumentDB and with other sensitive information as well (for example, storage account keys or connection strings). Encrypting sensitive data The documents we created in the previous section contains sensitive data like namespaces, Event Hub names, and tokens. We could also use Key Vault to encrypt those specific values to enhance our security. In case someone gets hold of a document containing the device information, he is still unable to mimic this device since the keys are encrypted. Try to use Key Vault to encrypt the sensitive information that is stored in DocumentDB before it is saved in there. Migrating data This section discusses how to use a tool to migrate data from an existing data source to DocumentDB. For this scenario, we assume that we already have a large datastore containing existing devices and their registration information (Event Hub connection information). In this section, we will see how to migrate an existing data store to our new DocumentDB environment. We use the DocumentDB Data Migration Tool for this. You can download this tool from the Microsoft Download Center (http://www.microsoft.com/en-us/download/details.aspx?id=46436) or from GitHub if you want to check the code. The tool is intuitive and enables us to migrate from several datasources: JSON files MongoDB SQL Server CSV files Azure Table storage Amazon DynamoDB HBase DocumentDB collections To demonstrate the use, we migrate our existing Netherlands collection to our United Kingdom collection. Start the tool and enter the right connection string to our DocumentDB database. We do this for both our source and target information in the tool. The connection strings you need to provide should look like this: AccountEndpoint=https://<YOURDOCDBURL>;AccountKey=<ACCOUNTKEY>;Database=<NAMEOFDATABASE>. You can click on the Verify button to make sure these are correct. In the Source Information field, we provide the Netherlands as being the source to pull data from. In the Target Information field, we specify the United Kingdom as the target. In the following screenshot, you can see how these settings are provided in the migration tool for the source information: The following screenshot shows the settings for the target information: It is also possible to migrate data to a collection that is not created yet. The migration tool can do this if you enter a collection name that is not available inside your database. You also need to select the pricing tier. Optionally, setting the partition key could help to distribute your documents based on this key across all collections you add in this screen. This information is sufficient to run our example. Go to the Summary tab and verify the information you entered. Press Import to start the migration process. We can verify a successful import on the Import results pane. This example is a simple migration scenario but the tool is also capable of using complex queries to only migrate those documents that need to moved or migrated. Try migrating data from an Azure Table storage table to DocumentDB by using this tool. Summary In this article, we saw how to integrate DocumentDB with other Microsoft Azure features. We discussed how to setup the Azure Search service and how create an index to our collection. We also covered how to use the Azure Search feature to enable full-text search on our documents which could enable users to query while typing. Next, we saw how to add additional security to our scenario by using Key Vault. We also discussed how to create and configure Key Vault by using PowerShell cmdlets, and we saw how to enable our ASP.NET scenario application to make use of the Key Vault .NET SDK. Then, we discussed how to retrieve the sensitive information from Key Vault instead of configuration files. Finally, we saw how to migrate an existing data source to our collection by using the DocumentDB Data Migration Tool. Resources for Article: Further resources on this subject: Microsoft Azure – Developing Web API For Mobile Apps [article] Introduction To Microsoft Azure Cloud Services [article] Security In Microsoft Azure [article]

0
0
27672

article-image-how-we-think-ai-urge-ai-founding-fathers

Neil Aitken

31 May 2018

9 min read

We must change how we think about AI, urge AI founding fathers

Neil Aitken

31 May 2018

9 min read

0
0
27545

article-image-salesforce-crm-lightning-experience

Richa Tripathi

02 May 2018

4 min read

Build a custom Admin Home page in Salesforce CRM Lightning Experience

Richa Tripathi

02 May 2018

4 min read

0
0
27455

article-image-bad-metadata-can-get-you-in-legal-hot-water

Guest Contributor

21 Sep 2019

6 min read

Bad Metadata can get you in legal hot water

Guest Contributor

21 Sep 2019

6 min read

Metadata isn't just something that concerns business intelligence and IT teams; but lawyers are extremely interested in it as well. Metadata, it turns out, can win or lose lawsuits, send politicians to jail, and even decide medical malpractice cases. It's not uncommon for attorneys who conduct discovery of electronic records in organizations to find that the claims of plaintiffs or defendants are contradicted by metadata, like time and date, type of data, etc. If a discovery process is initiated against them, an organization had better be sure that its metadata is in order. All it would take for an organization to lose a case would be for an attorney to discover a discrepancy in different databases – a different time stamp on some communication, a different job title for a principal in the case. Such discrepancies could lead to accusations of data tampering, fraud, or worse – and would most definitely put the organization in a very tough position versus a judge or jury. Metadata errors are difficult to spot The problem with that, of course, is that catching metadata errors is extremely difficult. In large organizations, data is stored in repositories that are spread throughout the organization, maybe even the world – in different departments. Each department is responsible for maintaining its own database, and the metadata in it; and on different cloud storage repositories, which may have their own system of classifying data. An enterprising attorney could have a field day with the different categories and tags data is stored under, making claims that the organization is trying to “hide something.” The organization's only defense: We're poor administrators. That may not be enough to impress the court. Types of Metadata Metadata is “data about data,” and comes in three flavors: System Metadata, which is data that is automatically generated from the computer and includes specific labeled criteria, like the date and time of creation and date a document was modified, etc. Substantive Metadata reflects changes to a document, like tracked changes. Embedded metadata is data entered into a document or file but not normally visible, such as formulas in cells in an Excel spreadsheet. All of these have increasingly become targets for attorneys in recent years. Metadata has been used in thousands of cases – medical, financial, patent and trademark law, product liability, civil rights, and many more. Metadata is both discoverable and admissible as evidence. According to one New York court, “General information about the creation of a document, including who authored a document and when it was created, is pedigree information often important for purposes of determining admissibility at trial.” According to legal experts, “from a legal standpoint metadata is evidence… that describes the characteristics, origins, usage, and validity of other electronic evidence.” The biggest metadata-linked payout until now - $10.8 million – occurred in 2017, when a jury awarded a plaintiff $8 million (eventually this was increased to nearly $11 million) after claiming he was fired from a biotechnology company after telling authorities about potential bribery in China. The key piece of evidence was the metadata timestamp on a performance review that was written after the plaintiff was fired; with that evidence, the court increased the defendant's payout for violating laws against firing whistleblowers. In that case, records claiming that the employee was fired for cause were belied by the metadata in the performance review. That, of course, was a case in which there was clear wrongdoing by an organization. But the same metadata errors could have cropped up in any number of scenarios, even if no laws were broken. The precedent in this case, and others like it, might be enough to convince the court to penalize an organization based on claims of a plaintiff. How can organizations defend themselves from this legal bind of metadata The answer would seem obvious; get control of your metadata and make sure it corresponds to the data it represents. With that kind of control over data, organizations would discover for themselves if something was amiss that could cost them in a settlement later. But execution of that obvious answer is a different story. With reams of data to pore through, it would take an organization's business intelligence team months, or even years, to manually sift through the databases. And because to err is human, there would be no guarantee they hadn't missed something. Clearly Business Intelligence and Data Analysis teams need some help in doing this. One solution would be to hire more staff, expanding teams at least temporarily to make sense of the data and metadata that could prove problematic. There are services that will lend their staff to an organization to do just that, and for companies that prefer the “human touch,” adding that temporary staff may be the best solution. Another idea is to automate the process, with advanced tools that will do a full examination of data, both across systems and within repositories themselves. Such automated tools would examine the data in the various repositories and find where the metadata for the same information is different – pointing BI teams in the right direction and cutting down on the time needed to determine what needs to be fixed. Using automated metadata management tools, companies can ensure that they remain secure. If a company is being sued and discovery has commenced, it will be too late for the organization to fix anything. Honest mistakes or disorganized file keeping can no longer be corrected, and the fate of the organization will be in the hands of a jury or a judge. Automated metadata management tools can help Business Intelligence and Data Analysis teams figure out which metadata entries are not consistent across the repositories, ensuring that things are fixed before discovery takes place. There are a variety of tools on the market, with various strengths and weaknesses. Companies will need to decide whether a data dictionary, a business glossary, or a more all-encompassing product best answers their needs. They’ll also need to make sure the enterprise software they currently use is supported by the metadata management solution they are after. As the market develops, AI will be a huge distinguishing factor between metadata solutions, as machine learning will reduce the cost and manpower investment of solution onboarding significantly. With the success of recent metadata-based lawsuits, you can be sure more attorneys will be using metadata in their discovery processes. Organizations that want to defend themselves need to get their data in order, and ensure that they won't end up losing lots of money because of their own errors. Author Bio Amnon Drori is the Co-Founder and CEO of Octopai and has over 20 years of leadership experience in technology companies. Before co-founding Octopai he led sales efforts at companies like Panaya (Acquired by Infosys), Zend Technologies (Acquired by Rogue Wave Software), ModusNovo and Alvarion. Other interesting news in Tech Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data and Society report. Open AI researchers advance multi-agent competition by training AI agents in a hide and seek environment. France and Germany reaffirm blocking Facebook’s Libra cryptocurrency

0
0
27245

article-image-data-explorationusing-spark-sql

Packt

08 Mar 2018

9 min read

Data Exploration using Spark SQL

Packt

08 Mar 2018

9 min read

In this article, Aurobindo Sarkar, the author of the book, Learning Spark SQL, we will be covering the following points to introduce you to using Spark SQL for exploratory data analysis. What is exploratory Data Analysis (EDA)? Why EDA is important? Using Spark SQL for basic data analysis Visualizing data with Apache Zeppelin Introducing Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a “feel”, for the data set as a result of such exploration. More specifically, the graphical techniques allow us to, efficiently, select and validate appropriate models, test our assumptions, identify relationships, select estimators, and detect outliers. EDA involves a lot of trial and error, and several iterations.The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement.For example, you should plot simple histograms and scatterplots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take long to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java, and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer there are many good tools available such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data then this section will offer some good tools and techniques data exploration. Here, we will do some basic data exploration exercises to understand a sample dataset. We will use a dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We use the bank-additional-full.csv file that contains 41188 records and 20 input fields, ordered by date (from May 2008 to November 2010). As a first step, let’s define a schema and read in the CSV file to create a DataFrame. You can use :paste to paste-in the initial set of statements in the Spark shell, as shown in the following figure: After the DataFrame is created, we first verify the number of records. We can also define a case class called Call for our input records, and then create a strongly-typed Dataset as follows: In the next section, we will begin our data exploration by identifying missing data in our dataset. Identifying Missing Data Missing data can occur in datasets due to reasons ranging from negligence to a refusal on part of respondants to provide a specific data point. However, in all cases missing data is a common occurrence in real-world datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies for dealing with it. Here, weanalyzethe numbers of records with missing data fields in our sample dataset. In order to simulate missing data, we will edit our sample dataset by replacing fields containing “unknown” values with empty strings. First, we created a DataFrame / Dataset from our edited file, as shown in the following figure: The following two statements give us a count of rows with certain fields having missing data. Later, we will look at effective ways of dealing with missing data and compute some basic statistics for sample dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a dataset containing a subset of fields from our original DataFrame. In our following example, we choose some of the numeric fields and the outcome field that is the “term deposit subscribed” field. Next, we use describe to quickly compute the count, mean, standard deviation, min and max values for the numeric columns in our dataset. Further, we use the stat package to compute additional statistics like covariance, correlation, creating crosstabs, examining items that occur most frequently in data columns, and computing quantiles. These computations are shown in the following figure: Next, we use the typed aggregation functions to summarize our data to understand our data better. In the following statement, we aggregate the results by whether a term deposit was subscribed along with total customers contacted, average number of calls made per customer, the average duration of the calls and the average number of previous calls made to such customers. The results are rounded to two decimal points. Similarly, executing the following statement givessimilar results by customers’ age. After getting a better understanding of our data by computing basic statistics, we shift our focus to identifying outliers in our data. Identifying data outliers An outlier or an anomalyis an observation of the data that deviates significantly from other observations in the dataset. Erroneous outliers are observations thatare distorted due to possible errors in the data-collection process. These outliers may exert undue influence on the results of statistical analysis, sothey should be identified using reliable detection methods prior to performing data analysis. Many algorithms find outliers as a side-product of clustering algorithms. However these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using a statistical distributions, and the outliers identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically doesnot have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. SparkMLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. In the following example, we use the k-means clustering algorithm to compute two clusters in our data. Other distributed algorithms useful for EDA includeclassification, regression, dimensionality reduction, correlation and hypothesis testing. Visualizing data with Apache Zeppelin Typically, we will generate many graphs to verify our hunches about the data.A lot of thesequick and dirty graphs used during EDA are,ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers, typically, cannot handle millions of data points.Hence we have to summarize, sample or model our data before we can effectively visualize it. Traditionally, BI tools provided extensive aggregation and pivoting features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data wassubsequently downloaded and visualizedon the practitioner’s workstations. Spark can eliminate many of these batch jobs to support interactive data visualization. Here, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is a web-based tool that supports interactive data analysis and visualization. It supports several language interpreters and comes with built-in Spark integration. Hence, it is quick and easy to get started with exploratory data analysis using Apache Zeppelin. You can download Appache Zeppelin from https://zeppelin.apache.org/. Unzip the package on your hard drive and start Zeppelin using the following command: Aurobindos-MacBook-Pro-2:zeppelin-0.6.2-bin-allaurobindosarkar$ bin/zeppelin-daemon.sh start You should see the following message: Zeppelin start [ OK] You should be able to see the Zeppelin home page at: http://localhost:8080/ Click on Create new note link, and specify a path and name for your notebook, as shown in the following figure: In the next step, we paste the same code as in the beginning of this article to create a DataFrame for our sample dataset. We can execute typical DataFrameoperations as shown in the following figure: Next, we create a table from our DataFrame and execute some SQL on it. The results of the SQL statements execution can be charted by clicking on the appropriatechart-type required. Here, we create bar charts as an illustrative example of summarizing and visualizing data: We can also plot a scatter plot, and read the coordinate values of each of the points plotted, as shown in the following two figures. Additionally, we can create a textbox that accepts input values to make experience interactive. In the following figure we create a textbox that can accept different values for the age parameter and the bar chart is updated, accordingly. Similarly, we can also create dropdown lists where the user can select the appropriate option, and the table of values or chart, automatically gets updated. Summary In this article, we demonstrated using Spark SQL for exploring datasets, performing basic data quality checks, generating samples and pivot tables, and visualizing data with Apache Zeppelin.

0
0
27233

article-image-how-to-build-a-cold-start-friendly-content-based-recommender-using-apache-spark-sql

Sunith Shetty

14 Dec 2017

12 min read

How to build a cold-start friendly content-based recommender using Apache Spark SQL

Sunith Shetty

14 Dec 2017

12 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. In this book, you will learn how to perform analytics on big data using Java, machine learning and other big data tools. [/box] From the below given article, you will learn how to create the content-based recommendation system using movielens dataset. Getting started with content-based recommendation systems In content-based recommendations, the recommendation systems check for similarity between the items based on their attributes or content and then propose those items to the end users. For example, if there is a movie and the recommendation system has to show similar movies to the users, then it might check for the attributes of the movie such as the director name, the actors in the movie, the genre of the movie, and so on or if there is a news website and the recommendation system has to show similar news then it might check for the presence of certain words within the news articles to build the similarity criteria. As such the recommendations are based on actual content whether in the form of tags, metadata, or content from the item itself (as in the case of news articles). Dataset For this article, we are using the very popular movielens dataset, which was collected by the GroupLens Research Project at the University of Minnesota. This dataset contains a list of movies that are rated by their customers. It can be downloaded from the site https:/grouplens.org/datasets/movielens. There are few files in this dataset, they are: u.item:This contains the information about the movies in the dataset. The attributes in this file are: Movie Id This is the ID of the movie Movie title This is the title of the movie Video release date This is the release date of the movie Genre (isAction, isComedy, isHorror) This is the genre of the movie. This is represented by multiple flats such as isAction, isComedy, isHorror, and so on and as such it is just a Boolean flag with 1 representing users like this genre and 0 representing user do not like this genre. u.data: This contains the data about the users rating for the movies that they have watched. As such there are three main attributes in this file and they are: Userid This is the ID of the user rating the movie Item id This is the ID of the movie Rating This is the rating given to the movie by the user u.user: This contains the demographic information about the users. The main attributes from this file are: User id This is the ID of the user rating the movie Age This is the age of the user Gender This is the rating given to the movie by the user ZipCode This is the zip code of the place of the user As the data is pretty clean and we are only dealing with a few parameters, we are not doing any extensive data exploration in this article for this dataset. However, we urge the users to try and practice data exploration on this dataset and try creating bar charts using the ratings given, and find patterns such as top-rated movies or top-rated action movies, or top comedy movies, and so on. Content-based recommender on MovieLens dataset We will build a simple content-based recommender using Apache Spark SQL and the MovieLens dataset. For figuring out the similarity between movies, we will use the Euclidean Distance. This Euclidean Distance would be run using the genre properties, that is, action, comedy, horror, and so on of the movie items and also run on the average rating for each movie. First is the usual boiler plate code to create the SparkSession object. For this we first build the Spark configuration object and provide the master which in our case is local since we are running this locally in our computer. Next using this Spark configuration object we build the SparkSession object and also provide the name of the application here and that is ContentBasedRecommender. SparkConf sc = new SparkConf().setMaster("local"); SparkSession spark = SparkSession .builder() .config(sc) .appName("ContentBasedRecommender") .getOrCreate(); Once the SparkSession is created, load the rating data from the u.data file. We are using the Spark RDD API for this, the user can feel free to change this code and use the dataset API if they prefer to use that. On this Java RDD of the data, we invoke a map function and within the map function code we extract the data from the dataset and populate the data into a Java POJO called RatingVO: JavaRDD<RatingVO> ratingsRDD = spark.read().textFile("data/movie/u.data").javaRDD() .map(row -> { RatingVO rvo = RatingVO.parseRating(row); return rvo; }); As you can see, the data from each row of the u.data dataset is passed to the RatingVO.parseRating method in the lambda function. This parseRating method is declared in the POJO RatingVO. Let's look at the RatingVO POJO first. For maintaining brevity of the code, we are just showing the first few lines of the POJO here: public class RatingVO implements Serializable { private int userId; private int movieId; private float rating; private long timestamp; private int like; As you can see in the preceding code we are storing movieId, rating given by the user and the userId attribute in this POJO. This class RatingVO contains one useful method parseRating as shown next. In this method, we parse the actual data row (that is tab separated) and extract the values of rating, movieId, and so on from it. In the method, do note the if...else block, it is here that we check whether the rating given is greater than three. If it is, then we take it as a 1 or like and populate it in the like attribute of RatingVO, else we mark it as 0 or dislike for the user: public static RatingVO parseRating(String str) { String[] fields = str.split("t"); ... int userId = Integer.parseInt(fields[0]); int movieId = Integer.parseInt(fields[1]); float rating = Float.parseFloat(fields[2]); if(rating > 3) return new RatingVO(userId, movieId, rating, timestamp,1); return new RatingVO(userId, movieId, rating, timestamp,0); } Once we have the RDD object built with the POJO's RatingVO contained per row, we are now ready to build our beautiful dataset object. We will use the createDataFrame method on the spark session and provide the RDD containing data and the corresponding POJO class, that is, RatingVO. We also register this dataset as a temporary view called ratings so that we can fire SQL queries on it: Dataset<Row> ratings = spark.createDataFrame(ratingsRDD, RatingVO.class); ratings.createOrReplaceTempView("ratings"); ratings.show(); The show method in the preceding code would print the first few rows of this dataset as shown next: Next we will fire a Spark SQL query on this view (ratings) and in this SQL query we will find the average rating in this dataset for each movie. For this, we will group by on the movieId in this query and invoke an average function on the rating column as shown here: Dataset<Row> moviesLikeCntDS = spark.sql("select movieId,avg(rating) likesCount from ratings group by movieId"); The moviesLikeCntDS dataset now contains the results of our group by query. Next we load the data for the movies from the u.item data file. As we did for users, we store this data for movies in a MovieVO POJO. This MovieVO POJO contains the data for the movies such as the MovieId, MovieTitle and it also stores the information about the movie genre such as action, comedy, animation, and so on. The genre information is stored as 1 or 0. For maintaining the brevity of the code, we are not showing the full code of the lambda function here: JavaRDD<MovieVO> movieRdd = spark.read().textFile("data/movie/u.item").javaRDD() .map(row -> { String[] strs = row.split("|"); MovieVO mvo = new MovieVO(); mvo.setMovieId(strs[0]); mvo.setMovieTitle(strs[1]); ... mvo.setAction(strs[6]); mvo.setAdventure(strs[7]); ... return mvo; }); As you can see in the preceding code we split the data row and from the split result, which is an array of strings, we extract our individual values and store in the MovieVO object. The results of this operation are stored in the movieRdd object, which is a Spark RDD. Next, we convert this RDD into a Spark dataset. To do so, we invoke the Spark createDataFrame function and provide our movieRdd to it and also supply the MovieVO POJO here. After creating the dataset for the movie RDD, we perform an important step here of combining our movie dataset with the moviesLikeCntDS dataset we created earlier (recall that moviesLikeCntDS dataset contains our movie ID and the average rating for that review in this movie dataset). We also register this new dataset as a temporary view so that we can fire Spark SQL queries on it: Dataset<Row> movieDS = spark.createDataFrame(movieRdd.rdd(), MovieVO.class).join(moviesLikeCntDS, "movieId"); movieDS.createOrReplaceTempView("movies"); Before we move further on this program, we will print the results of this new dataset. We will invoke the show method on this new dataset: movieDS.show(); This would print the results as (for brevity we are not showing the full columns in the screenshot): Now comes the turn for the meat of this content recommender program. Here we see the power of Spark SQL, we will now do a self join within a Spark SQL query to the temporary view movie. Here, we will make a combination of every movie to every other movie in the dataset except to itself. So if you have say three movies in the set as (movie1, movie2, movie3), this would result in the combinations (movie1, movie2), (movie1, movie3), (movie2, movie3), (movie2, movie1), (movie3, movie1), (movie3, movie2). You must have noticed by now that this query would produce duplicates as (movie1, movie2) is same as (movie2, movie1), so we will have to write separate code to remove those duplicates. But for now the code for fetching these combinations is shown as follows, for maintaining brevity we have not shown the full code: Dataset<Row> movieDataDS = spark.sql("select m.movieId movieId1,m.movieTitle movieTitle1,m.action action1,m.adventure adventure1, ... " + "m2.movieId movieId2,m2.movieTitle movieTitle2,m2.action action2,m2.adventure adventure2, ..." If you invoke show on this dataset and print the result, it would print a lot of columns and their first few values. We will show some of the values next (note that we show values for movie1 on the left-hand side and the corresponding movie2 on the right-hand side): Did you realize by now that on a big data dataset this operation is massive and requires extensive computing resources. Thankfully on Spark you can distribute this operation to run on multiple computer notes. For this, you can tweak the spark-submit job with the following parameters so as to extract the last bit of juice from all the clustered nodes: spark-submit executor-cores <Number of Cores> --num-executor <Number of Executors> The next step is the most important one in this content management program. We will run over the results of the previous query and from each row we will pull out the data for movie1 and movie2 and cross compare the attributes of both the movies and find the Euclidean Distance between them. This would show how similar both the movies are; the greater the distance, the less similar the movies are. As expected, we convert our dataset to an RDD and invoke a map function and within the lambda function for that map we go over all the dataset rows and from each row we extract the data and find the Euclidean Distance. For maintaining conciseness, we depict only a portion of the following code. Also note that we pull the information for both movies and store the calculated Euclidean Distance in a EuclidVO Java POJO object: JavaRDD<EuclidVO> euclidRdd = movieDataDS.javaRDD().map( row -> { EuclidVO evo = new EuclidVO(); evo.setMovieId1(row.getString(0)); evo.setMovieTitle1(row.getString(1)); evo.setMovieTitle2(row.getString(22)); evo.setMovieId2(row.getString(21)); int action = Math.abs(Integer.parseInt(row.getString(2)) – Integer.parseInt(row.getString(23)) ); ... double likesCnt = Math.abs(row.getDouble(20) - row.getDouble(41)); double euclid = Math.sqrt(action * action + ... + likesCnt * likesCnt); evo.setEuclidDist(euclid); return evo; }); As you can see in the bold text in the preceding code, we are calculating the Euclid Distance and storing it in a variable. Next, we convert our RDD to a dataset object and again register it in a temporary view movieEuclids and now are ready to fire queries for our predictions: Dataset<Row> results = spark.createDataFrame(euclidRdd.rdd(), EuclidVO.class); results.createOrReplaceTempView("movieEuclids"); Finally, we are ready to make our predictions using this dataset. Let's see our first prediction; let's find the top 10 movies that are closer to Toy Story, and this movie has movieId of 1. We will fire a simple query on the view movieEuclids to find this: spark.sql("select * from movieEuclids where movieId1 = 1 order by euclidDist asc").show(20); As you can see in the preceding query, we order by Euclidean Distance in ascending order as lesser distance means more similarity. This would print the first 20 rows of the result as follows: As you can see in the preceding results, they are not bad for content recommender system with little to no historical data. As you can see the first few movies returned are also famous animation movies like Toy Story and they are Aladdin, Winnie the Pooh, Pinocchio, and so on. So our little content management system we saw earlier is relatively okay. You can do many more things to make it even better by trying out more properties to compare with and using a different similarity coefficient such as Pearson Coefficient, Jaccard Distance, and so on. From the article, we learned how content recommenders can be built on zero to no historical data. These recommenders are based on the attributes present on the item itself using which we figure out the similarity with other items and recommend them. To know more about data analysis, data visualization and machine learning techniques you can refer to the book Big Data Analytics with Java.

0
0
27156

article-image-implement-neural-network-single-layer-perceptron

Pravin Dhandre

28 Dec 2017

10 min read

How to Implement a Neural Network with Single-Layer Perceptron

Pravin Dhandre

28 Dec 2017

10 min read

0
0
27029

Packt

10 Feb 2017

13 min read

Bitcoin

Packt

10 Feb 2017

13 min read

In this article by Imran Bashir, the author of the book Mastering Blockchain, will see about bitcoin and it's importance in electronic cash system. (For more resources related to this topic, see here.) Bitcoin is the first application of blockchain technology. In this article readers will be introduced to the bitcoin technology in detail. Bitcoin has started a revolution with the introduction of very first fully decentralized digital currency that has been proven to be extremely secure and stable. This has also sparked great interest in academic and industrial research and introduced many new research areas. Since its introduction in 2008 bitcoin has gained much popularity and currently is the most successful digital currency in the world with billions of dollars invested in it. It is built on decades of research in the field of cryptography, digital cash and distributed computing. In the following section brief history is presented in order to provide background required to understand the foundations behind the invention of bitcoin. Digital currencies have always been an active area of research for many decades. Early proposals to create digital cash goes as far back as the early 1980s. In 1982 David Chaum proposed a scheme that used blind signatures to build untraceable digital currency. In this scheme a bank would issue digital money by signing a blinded and random serial number presented to it by the user. The user can then use the digital token signed by the bank as currency. The limitation in this scheme is that the bank has to keep track of all used serial numbers. This is a central system by design and requires to be trusted by the users. Later on in 1990 David Chaum proposed a refined version named ecash that not only used blinded signature but also some private identification data to craft a message that was then sent to the bank. This scheme allowed detection of double spending but did not prevent it. If the same token is used at two different location then the identity of the double spender would be revealed. ecash could only represent fixed amount of money. Adam Back's hashcash introduced in 1997 was originally proposed to thwart the email spam. The idea behind hashcash is to solve a computational puzzle that is easy to verify but is comparatively difficult to compute. The idea is that for a single user and single email extra computational effort it not noticeable but someone sending large number of spam emails would be discouraged as the time and resources required to run the spam campaign will increase substantially. B-money was proposed by Wei Dai in 1998 which introduced the idea of using proof of work to create money. Major weakness in the system was that some adversary with higher computational power could generate unsolicited money without giving the chance to the network to adjust to an appropriate difficulty level. The system was lacking details on the consensus mechanism between nodes and some security issues like Sybil attacks were also not addressed. At the same time Nick Szabo introduced the concept of bit gold which was also based on proof of work mechanism but had same problems as b-money had with one exception that network difficulty level was adjustable. Tomas Sander and Ammon TaShama introduced an ecash scheme in 1999 that for the first time used merkle trees to represent coins and zero knowledge proofs to prove possession of coins. In the scheme a central bank was required who kept record of all used serial numbers. This scheme allowed users to be fully anonymous albeit at some computational cost. RPOW (Reusable Proof of Work) was introduced by Hal Finney in 2004 that used hash cash scheme by Adam Back as a proof of computational resources spent to create the money. This was also a central system that kept a central database to keep track of all used PoW tokens. This was an online system that used remote attestation made possible by trusted computing platform (TPM hardware). All the above mentioned schemes are intelligently designed but were weak from one aspect or another. Especially all these schemes rely on a central server which is required to be trusted by the users. Bitcoin In 2008 bitcoin paper Bitcoin: A Peer-to-Peer Electronic Cash System was written by Satoshi Nakamoto. First key idea introduced in the paper is that it is a purely peer to peer electronic cash that does need an intermediary bank to transfer payments between peers. Bitcoin is built on decades of Cryptographic research like merkle trees, hash functions, public key cryptography and digital signatures. Moreover ideas like bit gold, b-money, hashcash and cryptographic time stamping have provided the foundations for bitcoin invention. All these technologies are cleverly combined in bitcoin to create world's first decentralized currency. Key issue that has been addressed in bitcoin is an elegant solution to Byzantine Generals problem along with a practical solution of double spend problem. Value of bitcoin has increased significantly since 2011 as shown in the graph below: Bitcoin price and volume since 2012 (on logarithmic scale) Regulation of bitcoin is a controversial subject and as much as it is a libertarian's dream law enforcement agencies and governments are proposing various regulations to control it such as bitlicense issued by NewYorks state department of financial services. This is a license issued to businesses which perform activities related to virtual currencies. Growth of bitcoin is also due to so called Network Effect. Also called demand-side economies of scale, it is a concept which basically means that more users who use the network the more valuable it becomes. Over time exponential increase has been seen in bitcoin network growth. Even though the price of bitcoin is quite volatile it has increased significantly over a period of last few years. Currently (at the time of writing) bitcoin price is 815 GBP. Bitcoin definition Bitcoin can be defined in various ways, it's a protocol, a digital currency and a platform. It is a combination of peer to peer network, protocols and software that facilitate the creation and usage of digital currency named bitcoin. Note that Bitcoin with capital B is used to refer to Bitcoin protocol whereas bitcoin with lower case b is used to refer to bitcoin, the currency. Nodes in this peer to peer to network talk to each other using the Bitcoin protocol. Decentralization of currency was made possible for the first time with the invention of Bitcoin. Moreover double spending problem was solved in an elegant and ingenious way in bitcoin. Double spending problem arises when for example a user sends coins to two different users at the same time and they will be verified independently as valid transactions. Keys and addresses Elliptic curve cryptography is used to generate public and private key pairs in the Bitcoin network. The Bitcoin address is created by taking the corresponding public key of a private key and hashing it twice, first with SHA256 algorithm and then with RIPEMD160. The resultant 160-bit hash is then prefixed with a version number and finally encoded with Base58Check encoding scheme. The bitcoin addresses are 26 to 35 characters long and begin with digit 1 or 3. A typical bitcoin address looks like a string shown as follows: 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt This is also commonly encoded in a QR code for easy sharing. The QR code of the preceding shown address is as follows: QR code of a bitcoin address 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt There are currently two types of addresses, commonly used P2PKH and another P2SH type starting with 1 and 3 respectively. In early days bitcoin used direct Pay-to-Pubkey which is now superseded by P2PKH. However direct Pay-to-Pubkey is still used in bitcoin for coinbase addresses. Addresses should not be used for more than once otherwise privacy and security issues can arise. Avoiding address reuse circumvents anonymity issues to some extent, bitcoin has some other security issues also, such as transaction malleability which requires different approach to resolve. from bitaddress.org private key and bitcoin address in a paper wallet Public keys in bitcoin In Public key cryptography, public keys are generated from private keys. Bitcoin uses ECC based on SECP256K1 standard. A private key is randomly selected and is 256-bit in length. Public keys can be presented in uncompressed or compressed format. Public keys are basically x and y coordinates on an elliptic curve and in uncompressed format are presented with a prefix of 04 in hexadecimal format. X and Y co-ordinates are both 32-bit in length. In total the compressed public key is 33 bytes long as compared to 65 bytes in uncompressed format. Compressed version of public keys basically include only X part, since Y part can be derived from it. The reason why compressed version of public keys works is that bitcoin client initially used uncompressed keys, but starting from bitcoin core client 0.6 compressed keys are used as standard. Keys are identified by various prefixes described as follows: Uncompressed public keys used 0x04 as prefix. Compressed public key starts with 0x03 if the y 32-bit part of the public key is odd. Compressed public key starts with 0x02 if the y 32-bit part of the public key is even. More mathematical description and reason why it works is described later. If the ECC graph is visualized it reveals that the y co-ordinate can be either below the x-axis or above the x-axis and as the curve is symmetric only the location in the prime field is required to be stored. Private keys in bitcoin Private keys are basically 256-bit numbers chosen in the range specified by SECP256K1 ECDSA recommendation. Any randomly chosen 256-bit number from 0x1 to 0xFFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFE BAAE DCE6 AF48 A03B BFD2 5E8C D036 4140 is a valid private key. Private keys are usually encoded using Wallet Import Format (WIF) in order to make them easier to copy and use. WIF can be converted to private key and vice versa. Steps are described later. Also Mini Private Key Format is used sometimes to encode the key in under 30 characters to allow storage where physical space is limited, for example, etching on physical coins or damage resistant QR codes. Bitcoin core client also allows encryption of the wallet which contains the private keys. Bitcoin currency units Bitcoin currency units are described as follows. Smallest bitcoin denomination is Satoshi. Base58Check encoding This encoding is used to limit the confusion between various characters such as 0OIl as they can look same in different fonts. The encoding basically takes the binary byte arrays and converts them into human readable string. This string is composed by utilizing a set of 58 alphanumeric symbols. More explanation and logic can be found in base58.h source file in bitcoin source code. Explanation from bitcoin source code Bitcoin addresses are encoded using Base58check encoding. Vanity addresses As the bitcoin addresses are based on base 58 encoding, it is possible to generate addresses that contain human readable messages. An example is shown as follows: Public address encoded in QR Vanity addresses are generated by using a purely brute force method. An example is shown as follows: Vanity address generated from https://bitcoinvanitygen.com/ Transactions Transactions are at the core of bitcoin ecosystem. Transactions can be as simple as just sending some bitcoins to a bitcoin address or it can be quite complex depending on the requirements. Each transaction is composed of at least one input and output. Inputs can be thought of as coins being spent that have been created in a previous transaction and outputs as coins being created. If a transaction is minting new coins then there is no input and therefore no signature is needed. If a transaction is to send coins to some other user (a bitcoin address), then it needs to be signed by the sender with their private key and also a reference is required to the previous transaction to show the origin of the coins. Coins are in fact unspent transactions outputs represented in Satoshis. Transactions are not encrypted and are publicly visible in the blockchain. Blocks are made up of transactions and these can be viewed by using any online blockchain explorer. Transaction life cycle A user/sender sends a transaction using wallet software or some other interface. Wallet software signs the transaction using the sender's private key. Transaction is broadcasted to the Bitcoin network using a flooding algorithm. Mining nodes include this transaction in the next block to be mined. Mining starts and once a miner who solves the Proof of Work problem broadcasts the newly mined block to the network. Proof of Work is explained in detail later. The nodes verify the block and propagate the block further and confirmation start to generate. Finally the confirmations start to appear in the receiver's wallet and after approximately 6 confirmations the transaction is considered finalized and confirmed. However 6 is just a recommended number , the transaction can be considered final even after first confirmation. The key idea behind waiting for six confirmations is that the probability of double spending virtually eliminates after 6 confirmations. Transaction structure A transaction at a high level contains metadata, inputs and outputs. Transactions are combined together to create a block. The transaction structure is shown in the following table: Field Size Description Version Number 4 bytes Used to specify rules to be used by the miners and nodes for transaction processing. Input counter 1 to 9 bytes Number of inputs included in the transaction. list of inputs variable Each input is composed of several fields including Previous Transaction hash, Previous Txout-index, Txin-script length, Txin-script and optional sequence number. The first transaction in a block is also called coinbase transaction. Specifies on or more transaction inputs. Out-counter 1 to 9 bytes positive integer representing the number of outputs. list of outputs variable Outputs included in the transaction. lock_time 4 bytes It defines the earliest time when a transaction becomes valid. It is either a Unix timestamp or block number. MetaData: This part of the transaction contains some values like size of transaction, number of inputs and outputs, hash of the transaction and a lock_time field. Every transaction has a prefix specifying the version number. Inputs: Generally each input spends a previous output. Each output is considered a UTXO, Unspent transaction output until an input consumes it. Outputs: Outputs have only two fields and it contains instructions for sending bitcoins. First field contains the amount of Satoshis where as second field is a locking script which contains the conditions that needs to be met in order for the output to be spent. More information about transaction spending by using locking and unlocking scripts and producing outputs is discussed later. Verification: Verification is performed by using Bitcoin's scripting language Summary In this article, we learned the importance of bitcoin in digital currency and how bitcoins are encoded using various private keys and encoding techniques. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Protecting Your Bitcoins [article] FPGA Mining [article]

0
0
26924

article-image-convolutional-neural-networks-reinforcement-learning

Packt

06 Apr 2017

9 min read

Convolutional Neural Networks with Reinforcement Learning

Packt

06 Apr 2017

9 min read

In this article by Antonio Gulli, Sujit Pal, the authors of the book Deep Learning with Keras, we will learn about reinforcement learning, or more specifically deep reinforcement learning, that is, the application of deep neural networks to reinforcement learning. We will also see how convolutional neural networks leverage spatial information and they are therefore very well suited for classifying images. (For more resources related to this topic, see here.) Deep convolutional neural network A Deep Convolutional Neural Network (DCCN) consists of many neural network layers. Two different types of layers, convolutional and pooling, are typically alternated. The depth of each filter increases from left to right in the network. The last stage is typically made of one or more fully connected layers as shown here: There are three key intuitions beyond ConvNets: Local receptive fields Shared weights Pooling Let's review them together. Local receptive fields If we want to preserve the spatial information, then it is convenient to represent each image with a matrix of pixels. Then, a simple way to encode the local structure is to connect a submatrix of adjacent input neurons into one single hidden neuron belonging to the next layer. That single hidden neuron represents one local receptive field. Note that this operation is named convolution and it gives the name to this type of networks. Of course we can encode more information by having overlapping submatrices. For instance let's suppose that the size of each single submatrix is 5 x 5 and that those submatrices are used with MNIST images of 28 x 28 pixels, then we will be able to generate 23 x 23 local receptive field neurons in the next hidden layer. In fact it is possible to slide the submatrices by only 23 positions before touching the borders of the images. In Keras, the size of each single submatrix is called stride-length and this is an hyper-parameter which can be fine-tuned during the construction of our nets. Let's define the feature map from one layer to another layer. Of course we can have multiple feature maps which learn independently from each hidden layer. For instance we can start with 28 x 28 input neurons for processing MINST images, and then recall k feature maps of size 23 x 23 neurons each (again with stride of 5 x 5) in the next hidden layer. Shared weights and bias Let's suppose that we want to move away from the pixel representation in a row by gaining the ability of detecting the same feature independently from the location where it is placed in the input image. A simple intuition is to use the same set of weights and bias for all the neurons in the hidden layers. In this way each layer will learn a set of position-independent latent features derived from the image. Assuming that the input image has shape (256, 256) on 3 channels with tf (Tensorflow) ordering, this is represented as (256, 256, 3). Note that with th (Theano) mode the channels dimension (the depth) is at index 1, in tf mode is it at index 3. In Keras if we want to add a convolutional layer with dimensionality of the output 32 and extension of each filter 3 x 3 we will write: model = Sequential() model.add(Convolution2D(32, 3, 3, input_shape=(256, 256, 3)) This means that we are applying a 3 x 3 convolution on 256 x 256 image with 3 input channels (or input filters) resulting in 32 output channels (or output filters). An example of convolution is provided in the following diagram: Pooling layers Let's suppose that we want to summarize the output of a feature map. Again, we can use the spatial contiguity of the output produced from a single feature map and aggregate the values of a submatrix into one single output value synthetically describing the meaning associated with that physical region. Max pooling One easy and common choice is the so-called max pooling operator which simply outputs the maximum activation as observed in the region. In Keras, if we want to define a max pooling layer of size 2 x 2 we will write: model.add(MaxPooling2D(pool_size = (2, 2))) An example of max pooling operation is given in the following diagram: Average pooling Another choice is the average pooling which simply aggregates a region into the average values of the activations observed in that region. Keras implements a large number of pooling layers and a complete list is available online. In short, all the pooling operations are nothing more than a summary operation on a given region. Reinforcement learning Our objective is to build a neural network to play the game of catch. Each game starts with a ball being dropped from a random position from the top of the screen. The objective is to move a paddle at the bottom of the screen using the left and right arrow keys to catch the ball by the time it reaches the bottom. As games go, this is quite simple. At any point in time, the state of this game is given by the (x, y) coordinates of the ball and paddle. Most arcade games tend to have many more moving parts, so a general solution is to provide the entire current game screen image as the state. The following diagram shows four consecutive screenshots of our catch game: Astute readers might note that our problem could be modeled as a classification problem, where the input to the network are the game screen images and the output is one of three actions - move left, stay, or move right. However, this would require us to provide the network with training examples, possibly from recordings of games played by experts. An alternative and simpler approach might be to build a network and have it play the game repeatedly, giving it feedback based on whether it succeeds in catching the ball or not. This approach is also more intuitive and is closer to the way humans and animals learn. The most common way to represent such a problem is through a Markov Decision Process (MDP). Our game is the environment within which the agent is trying to learn. The state of the environment at time step t is given by st (and contains the location of the ball and paddle). The agent can perform certain actions at (such as moving the paddle left or right). These actions can sometimes result in a reward rt, which can be positive or negative (such as an increase or decrease in the score). Actions change the environment and can lead to a new state st+1, where the agent can perform another action at+1, and so on. The set of states, actions and rewards, together with the rules for transitioning from one state to the other, make up a Markov decision process. A single game is one episode of this process, and is represented by a finite sequence of states, actions and rewards: Since this is a Markov decision process, the probability of state st+1 depends only on current state st and action at. Maximizing future rewards As an agent, our objective is to maximize the total reward from each game. The total reward can be represented as follows: In order to maximize the total reward, the agent should try to maximize the total reward from any time point t in the game. The total reward at time step t is given by Rt and is represented as: However, it is harder to predict the value of the rewards the further we go into the future. In order to take this into consideration, our agent should try to maximize the total discounted future reward at time t instead. This is done by discounting the reward at each future time step by a factor γ over the previous time step. If γ is 0, then our network does not consider future rewards at all, and if γ is 1, then our network is completely deterministic. A good value for γ is around 0.9. Factoring the equation allows us to express the total discounted future reward at a given time step recursively as the sum of the current reward and the total discounted future reward at the next time step: Q-learning Deep reinforcement learning utilizes a model-free reinforcement learning technique called Q-learning. Q-learning can be used to find an optimal action for any given state in a finite Markov decision process. Q-learning tries to maximize the value of the Q-function which represents the maximum discounted future reward when we perform action a in state s: Once we know the Q-function, the optimal action a at a state s is the one with the highest Q-value. We can then define a policy π(s) that gives us the optimal action at any state: We can define the Q-function for a transition point (st, at, rt, st+1) in terms of the Q-function at the next point (st+1, at+1, rt+1, st+2) similar to how we did with the total discounted future reward. This equation is known as the Bellmann equation. The Q-function can be approximated using the Bellman equation. You can think of the Q-function as a lookup table (called a Q-table) where the states (denoted by s) are rows and actions (denoted by a) are columns, and the elements (denoted by Q(s, a)) are the rewards that you get if you are in the state given by the row and take the action given by the column. The best action to take at any state is the one with the highest reward. We start by randomly initializing the Q-table, then carry out random actions and observe the rewards to update the Q-table iteratively according to the following algorithm: initialize Q-table Q observe initial state s repeat select and carry out action a observe reward r and move to new state s' Q(s, a) = Q(s, a) + α(r + γ maxa' Q(s', a') - Q(s, a)) s = s' until game over You will realize that the algorithm is basically doing stochastic gradient descent on the Bellman equation, backpropagating the reward through the state space (or episode) and averaging over many trials (or epochs). Here α is the learning rate that determines how much of the difference between the previous Q-value and the discounted new maximum Q-value should be incorporated. Summary We have seen the application of deep neural networks, reinforcement learning. We have also seen convolutional neural networks and how they are well suited for classifying images. Resources for Article: Further resources on this subject: Deep learning and regression analysis [article] Training neural networks efficiently using Keras [article] Implementing Artificial Neural Networks with TensorFlow [article]

0
0
26890

article-image-google-ai-releases-cirq-and-open-fermion-cirq-to-boost-quantum-computation

Savia Lobo

19 Jul 2018

3 min read

Google AI releases Cirq and Open Fermion-Cirq to boost Quantum computation

Savia Lobo

19 Jul 2018

3 min read

Google AI Quantum team announced two releases at the First International Workshop on Quantum Software and Quantum Machine Learning(QSML) yesterday. Firstly the public alpha release of Cirq, an open source framework for NISQ computers. The second release is OpenFermion-Cirq, an example of a Cirq-based application enabling near-term algorithms. Noisy Intermediate Scale Quantum (NISQ) computers are devices including ~50 - 100 qubits and high fidelity quantum gates enhance the quantum algorithms such that they can understand the power that these machines uphold. However, quantum algorithms for the quantum computers have their limitations such as A poor mapping between the algorithms and the machines Also, some quantum processors have complex geometric constraints These and other nuances inevitably lead to wasted resources and faulty computations. Cirq comes as a great help for researchers here. It is focussed on near-term questions, which help researchers to understand whether NISQ quantum computers are capable of solving computational problems of practical importance. It is licensed under Apache 2 and is free to be either embedded or modified within any commercial or open source package. Cirq highlights With Cirq, researchers can write quantum algorithms for specific quantum processors. It provides a fine-tuned user control over quantum circuits by, specifying gate behavior using native gates, placing these gates appropriately on the device, and scheduling the timing of these gates within the constraints of the quantum hardware. Other features of Cirq include: Allows users to leverage the most out of NISQ architectures with optimized data structures to write and compile the quantum circuits. Supports running of the algorithms locally on a simulator Designed to easily integrate with future quantum hardware or larger simulators via the cloud. OpenFermion-Cirq highlights Google AI Quantum team also released OpenFermion-Cirq, which is an example of a CIrq-based application that enables the near-term algorithms. OpenFermion is a platform for developing quantum algorithms for chemistry problems. OpenFermion-Cirq extends the functionality of OpenFermion by providing routines and tools for using Cirq for compiling and composing circuits for quantum simulation algorithms. An instance of the OpenFermion-Cirq is, it can be used to easily build quantum variational algorithms for simulating properties of molecules and complex materials. While building Cirq, the Google AI Quantum team worked with early testers to gain feedback and insight into algorithm design for NISQ computers. Following are some instances of Cirq work resulting from the early adopters: Zapata Computing: simulation of a quantum autoencoder (example code, video tutorial) QC Ware: QAOA implementation and integration into QC Ware’s AQUA platform (example code, video tutorial) Quantum Benchmark: integration of True-Q software tools for assessing and extending hardware capabilities (video tutorial) Heisenberg Quantum Simulations: simulating the Anderson Model Cambridge Quantum Computing: integration of proprietary quantum compiler t|ket> (video tutorial) NASA: architecture-aware compiler based on temporal-planning for QAOA (slides) and simulator of quantum computers (slides) The team also announced that it is using Cirq to create circuits that run on Google’s Bristlecone processor. Their future plans include making the Bristlecone processor available in cloud with Cirq as the interface for users to write programs for this processor. To know more about both the releases, check out the GitHub repositories of each Cirq and OpenFermion-Cirq. Q# 101: Getting to know the basics of Microsoft’s new quantum computing language Google Bristlecone: A New Quantum processor by Google’s Quantum AI lab Quantum A.I. : An intelligent mix of Quantum+A.I.

0
0
26786

article-image-how-to-create-multithreaded-applications-in-qt

Gebin George

02 May 2018

10 min read

How to create multithreaded applications in Qt

Gebin George

02 May 2018

10 min read

In today’s tutorial, we will learn how to use QThread and its affiliate classes to create multithreaded applications. We will go through this by creating an example project, which processes and displays the input and output frames from a video source using a separate thread. This helps leave the GUI thread (main thread) free and responsive while more intensive processes are handled with the second thread. As it was mentioned earlier, we will focus mostly on the use cases common to computer vision and GUI development; however, the same (or a very similar) approach can be applied to any multithreading problem. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Computer Vision with OpenCV 3 and Qt5 written by Amin Ahmadi Tazehkandi. This book will help you blend the power of Qt with OpenCV to build cross-platform computer vision applications.[/box] We will use this example project to implement multithreading using two different approaches available in Qt for working with QThread classes. First, subclassing and overriding the run method, and second, using the moveToThread function available in all Qt objects, or, in other words, QObject subclasses. Subclassing QThread Let's start by creating an example Qt Widgets application in the Qt Creator named MultithreadedCV. To start with, add an OpenCV framework to this project: win32: { include("c:/dev/opencv/opencv.pri") } unix: !macx{ CONFIG += link_pkgconfig PKGCONFIG += opencv } unix: macx{ INCLUDEPATH += /usr/local/include LIBS += -L"/usr/local/lib" -lopencv_world } Then, add two label widgets to your mainwindow.ui file, shown as follows. We will use these labels to display the original and processed video from the default webcam on the computer: Make sure to set the objectName property of the label on the left to inVideo and the one on the right to outVideo. Also, set their alignment/Horizontal property to AlignHCenter. Now, create a new class called VideoProcessorThread by right-clicking on the project PRO file and selecting Add New from the menu. Then, choose C++ Class and make sure the combo boxes and checkboxes in the new class wizard look like the following screenshot: After your class is created, you'll have two new files in your project called videoprocessorthread.h and videoprocessor.cpp, in which you'll implement a video processor that works in a thread separate from the mainwindow files and GUI threads. First, make sure that this class inherits QThread by adding the relevant include line and class inheritance, as seen here (just replace QObject with QThread in the header file). Also, make sure you include OpenCV headers: #include <QThread> #include "opencv2/opencv.hpp" class VideoProcessorThread : public QThread You need to similarly update the videoprocessor.cpp file so that it calls the correct constructor: VideoProcessorThread::VideoProcessorThread(QObject *parent) : QThread(parent) Now, we need to add some required declarations to the videoprocessor.h file. Add the following line to the private members area of your class: void run() override; And then, add the following to the signals section: void inDisplay(QPixmap pixmap); void outDisplay(QPixmap pixmap); And finally, add the following code block to the videoprocessorthread.cpp file: void VideoProcessorThread::run() { using namespace cv; VideoCapture camera(0); Mat inFrame, outFrame; while(camera.isOpened() && !isInterruptionRequested()) { camera >> inFrame; if(inFrame.empty()) continue; bitwise_not(inFrame, outFrame); emit inDisplay( QPixmap::fromImage( QImage( inFrame.data, inFrame.cols, inFrame.rows, inFrame.step, QImage::Format_RGB888) .rgbSwapped())); emit outDisplay( QPixmap::fromImage( QImage( outFrame.data, outFrame.cols, outFrame.rows, Multithreading Chapter 8 [ 309 ] outFrame.step, QImage::Format_RGB888) .rgbSwapped())); } } The run function is overridden and it's implemented to do the required video processing task. If you try to do the same inside the mainwindow.cpp code in a loop, you'll notice that your program becomes unresponsive, and, eventually, you have to terminate it. However, with this approach, the same code is now in a separate thread. You just need to make sure you start this thread by calling the start function, not run! Note that the run function is meant to be called internally, so you only need to reimplement it as seen in this example; however, to control the thread and its execution behavior, you need to use the following functions: start: This can be used to start a thread if it is not already started. This function starts the execution by calling the run function we implemented. You can pass one of the following values to the start function to control the priority of the thread: QThread::IdlePriority (this is scheduled when no other thread is running) QThread::LowestPriority QThread::LowPriority QThread::NormalPriority QThread::HighPriority QThread::HighestPriority QThread::TimeCriticalPriority (this is scheduled as much as possible) QThread::InheritPriority (this is the default value, which simply inherits priority from the parent) terminate: This function, which should only be used in extreme cases (means never, hopefully), forces a thread to terminate. setTerminationEnabled: This can be used to allow or disallow the terminate Function. wait: This function can be used to block a thread (force waiting) until the thread is finished or the timeout value (in milliseconds) is reached. requestInterruption and isRequestInterrupted: These functions can be used to set and get the interruption request status. Using these functions is a useful approach to make sure the thread is stopped safely in the middle of a process that can go on forever. isRunning and isFinished: These functions can be used to request the execution status of the thread. Apart from the functions we mentioned here, QThread contains other functions useful for dealing with multithreading, such as quit, exit, idealThreadCount, and so on. It is a good idea to check them out for yourself and think about use cases for each one of them. QThread is a powerful class that can help maximize the efficiency of your applications. Let's continue with our example. In the run function, we used an OpenCV VideoCapture class to read the video frames (forever) and apply a simple bitwise_not operator to the Mat frame (we can do any other image processing at this point, so bitwise_not is only an example and a fairly simple one to explain our point), then converted that to QPixmap via QImage, and then sent the original and modified frames using two signals. Notice that in our loop that will go on forever, we will always check if the camera is still open and also check if there is an interruption request to this thread. Now, let's use our thread in MainWindow. Start by including its header file in the mainwindow.h file: #include "videoprocessorthread.h" Then, add the following line to the private members' section of MainWindow, in the mainwindow.h file: VideoProcessorThread processor; Now, add the following code to the MainWindow constructor, right after the setupUi line: connect(&processor, SIGNAL(inDisplay(QPixmap)), ui->inVideo, SLOT(setPixmap(QPixmap))); connect(&processor, SIGNAL(outDisplay(QPixmap)), ui->outVideo, SLOT(setPixmap(QPixmap))); processor.start(); And then add the following lines to the MainWindow destructor, right before the delete ui; line: processor.requestInterruption(); processor.wait(); We simply connected the two signals from the VideoProcessorThread class to two labels we had added to the MainWindow GUI, and then started the thread as soon as the program started. We also request the thread to stop as soon as MainWindow is closed and right before the GUI is deleted. The wait function call makes sure to wait for the thread to clean up and safely finish executing before continuing with the delete instruction. Try running this code to check it for yourself. You should see something similar to the following image as soon as the program starts: The video from your default webcam on the computer should start as soon as the program starts, and it will be stopped as soon as you close the program. Try extending the VideoProcessorThread class by passing a camera index number or a video file path into it. You can instantiate as many VideoProcessorThread classes as you want. You just need to make sure to connect the signals to correct widgets on the GUI, and this way you can have multiple videos or cameras processed and displayed dynamically at runtime. Using the moveToThread function As we mentioned earlier, you can also use the moveToThread function of any QObject subclass to make sure it is running in a separate thread. To see exactly how this works, let's repeat the same example by creating exactly the same GUI, and then creating a new C++ class (same as before), but, this time, naming it VideoProcessor. This time though, after the class is created, you don't need to inherit it from QThread, leave it as QObject (as it is). Just add the following members to the videoprocessor.h file: signals: void inDisplay(QPixmap pixmap); void outDisplay(QPixmap pixmap); public slots: void startVideo(); void stopVideo(); private: bool stopped; The signals are exactly the same as before. The stopped is a flag we'll use to help us stop the video so that it does not go on forever. The startVideo and stopVideo are functions we'll use to start and stop the processing of the video from the default webcam. Now, we can switch to the videoprocessor.cpp file and add the following code blocks. Quite similar to what we had before, with the obvious difference that we don't need to implement the run function since it's not a QThread subclass, and we have named our functions as we like: void VideoProcessor::startVideo() { using namespace cv; VideoCapture camera(0); Mat inFrame, outFrame; stopped = false; while(camera.isOpened() && !stopped) { camera >> inFrame; if(inFrame.empty()) continue; bitwise_not(inFrame, outFrame); emit inDisplay( QPixmap::fromImage( QImage( inFrame.data, inFrame.cols, inFrame.rows, inFrame.step, QImage::Format_RGB888) .rgbSwapped())); emit outDisplay( QPixmap::fromImage( QImage( outFrame.data, outFrame.cols, outFrame.rows, outFrame.step, QImage::Format_RGB888) .rgbSwapped())); } } void VideoProcessor::stopVideo() { stopped = true; } Now we can use it in our MainWindow class. Make sure to add the include file for the VideoProcessor class, and then add the following to the private members' section of MainWindow: VideoProcessor *processor; Now, add the following piece of code to the MainWindow constructor in the mainwindow.cpp file: processor = new VideoProcessor(); processor->moveToThread(new QThread(this)); connect(processor->thread(), SIGNAL(started()), processor, SLOT(startVideo())); connect(processor->thread(), SIGNAL(finished()), processor, SLOT(deleteLater())); connect(processor, SIGNAL(inDisplay(QPixmap)), ui->inVideo, SLOT(setPixmap(QPixmap))); connect(processor, SIGNAL(outDisplay(QPixmap)), ui->outVideo, SLOT(setPixmap(QPixmap))); processor->thread()->start(); In the preceding code snippet, first, we created an instance of VideoProcessor. Note that we didn't assign any parents in the constructor, and we also made sure to define it as a pointer. This is extremely important when we intend to use the moveToThread function. An object that has a parent cannot be moved into a new thread. The second highly important lesson in this code snippet is the fact that we should not directly call the startVideo function of VideoProcessor, and it should only be invoked by connecting a proper signal to it. In this case, we have used its own thread's started signal; however, you can use any other signal that has the same signature. The rest is all about connections. In the MainWindow destructor function, add the following lines: processor->stopVideo(); processor->thread()->quit(); processor->thread()->wait(); If you found our post useful, do check out this book Computer Vision with OpenCV 3 and Qt5 which will help you understand how to build a multithreaded computer vision applications with the help of OpenCV 3 and Qt5. Top 10 Tools for Computer Vision OpenCV Primer: What can you do with Computer Vision and how to get started? Image filtering techniques in OpenCV

0
0
26553

article-image-what-can-you-expect-at-neurips-2019

Sugandha Lahoti

06 Sep 2019

5 min read

What can you expect at NeurIPS 2019?

Sugandha Lahoti

06 Sep 2019

5 min read

0
0
26494

article-image-implement-memory-oltp-sql-server-linux

Fatema Patrawala

17 Feb 2018

11 min read

How to implement In-Memory OLTP on SQL Server in Linux

Fatema Patrawala

17 Feb 2018

11 min read

0
0
26486

article-image-highest-paying-data-science-jobs-2017

Amey Varangaonkar

27 Nov 2017

10 min read

Highest Paying Data Science Jobs in 2017

Amey Varangaonkar

27 Nov 2017

10 min read

It is no secret that this is the age of data. More data has been created in the last 2 years than ever before. Within the dumps of data created every second, businesses are looking for useful, action worthy insights which they can use to enhance their processes and thereby increase their revenue and profitability. As a result, the demand for data professionals, who sift through terabytes of data for accurate analysis and extract valuable business insights from it, is now higher than ever before. Think of Data Science as a large tree from which all things related to data branch out - from plain old data management and analysis to Big Data, and more. Even the recently booming trends in Artificial Intelligence such as machine learning and deep learning are applied in many ways within data science. Data science continues to be a lucrative and growing job market in the recent years, as evidenced by the graph below: Source: Indeed.com In this article, we look at some of the high paying, high trending job roles in the data science domain that you should definitely look out for if you’re considering data science as a serious career opportunity. Let’s get started with the obvious and the most popular role. Data Scientist Dubbed as the sexiest job of the 21st century, data scientists utilize their knowledge of statistics and programming to turn raw data into actionable insights. From identifying the right dataset to cleaning and readying the data for analysis, to gleaning insights from said analysis, data scientists communicate the results of their findings to the decision makers. They also act as advisors to executives and managers by explaining how the data affects a particular product or process within the business so that appropriate actions can be taken by them. Per Salary.com, the median annual salary for the role of a data scientist today is $122,258, with a range between $106,529 to $137,037. The salary is also accompanied by a whole host of benefits and perks which vary from one organization to the other, making this job one of the best and the most in-demand, in the job market today. This is a clear testament to the fact that an increasing number of businesses are now taking the value of data seriously, and want the best talent to help them extract that value. There are over 20,000 jobs listed for the role of data scientist, and the demand is only growing. Source: Indeed.com To become a data scientist, you require a bachelor’s or a master’s degree in mathematics or statistics and work experience of more than 5 years in a related field. You will need to possess a unique combination of technical and analytical skills to understand the problem statement and propose the best solution, good programming skills to develop effective data models, and visualization skills to communicate your findings with the decision makers. Interesting in becoming a data scientist? Here are some resources to help you get started: Principles of Data Science Data Science Algorithms in a Week Getting Started with R for Data Science [Video] For a more comprehensive learning experience, check out our skill plan for becoming a data scientist on Mapt, our premium skills development platform. Data Analyst Probably a term you are quite familiar with, Data Analysts are responsible for crunching large amounts of data and analyze it to come to appropriate logical conclusions. Whether it’s related to pure research or working with domain-specific data, a data analyst’s job is to help the decision-makers’ job easier by giving them useful insights. Effective data management, analyzing data, and reporting results are some of the common tasks associated with this role. How is this role different than a data scientist, you might ask. While data scientists specialize in maths, statistics and predictive analytics for better decision making, data analysts specialize in the tools and components of data architecture for better analysis. Per Salary.com, the median annual salary for an entry-level data analyst is $55,804, and the range usually falls between $50,063 to $63,364 excluding bonuses and benefits. For more experienced data analysts, this figure rises to around a mean annual salary of $88,532. With over 83,000 jobs listed on Indeed.com, this is one of the most popular job roles in the data science community today. This profile requires a pretty low starting point, and is justified by the low starting salary packages. As you gain more experience, you can move up the ladder and look at becoming a data scientist or a data engineer. Source: Indeed.com You may also come across terms such as business data analyst or simply business analyst which are sometimes interchangeably used with the role of a data analyst. While their primary responsibilities are centered around data crunching, business analysts model company infrastructure, while data analysts model business data structures. You can find more information related to the differences in this interesting article. If becoming a data analyst is something that interests you, here are some very good starting points: Data Analysis with R Python Data Analysis, Second Edition Learning Python Data Analysis [Video] Data Architect Data architects are responsible for creating a solid data management blueprint for an organization. They are primarily responsible for designing the data architecture and defining how data is stored, consumed and managed by different applications and departments within the organization. Because of these critical responsibilities, a data architect’s job is a very well-paid one. Per Salary.com, the median annual salary for an entry-level data architect is $74,809, with a range between $57,964 to $91,685. For senior-level data architects, the median annual salary rises up to $136,856, with a range usually between $121,969 to $159,212. These high figures are justified by the critical nature of the role of a data architect - planning and designing the right data infrastructure after understanding the business considerations to get the most value out of the data. At present, there are over 23,000 jobs for the role listed on Indeed.com, with a stable trend in job seeker interest, as shown: Source: Indeed.com To become a data architect, you need a bachelor’s degree in computer science, mathematics, statistics or a related field, and loads of real-world skills to qualify for even the entry-level positions. Technical skills such as statistical modeling, knowledge of languages such as Python and R, database architectures, Hadoop-based skills, knowledge of NoSQL databases, and some machine learning and data mining are required to become a data architect. You also need strong collaborative skills, problem-solving, creativity and the ability to think on your feet, to solve the trickiest of problems on the go. Suffice to say it’s not an easy job, but it is definitely a lucrative one! Get ahead of the curve, and start your journey to becoming a data architect now: Big Data Analytics Hadoop Blueprints PostgreSQL for Data Architects Data Engineer Data engineers or Big Data engineers are a crucial part of the organizational workforce and work in tandem with data architects and data scientists to ensure appropriate data management systems are deployed and the right kind of data is being used for analysis. They deal with messy, unstructured Big Data and strive to provide clean, usable data to the other teams within the organization. They build high-performance analytics pipelines and develop set of processes for efficient data mining. In many companies, the role of a data engineer is closely associated with that of a data architect. While an architect is responsible for the planning and designing stages of the data infrastructure project, a data engineer looks after the construction, testing and maintenance of the infrastructure. As such data engineers tend to have a more in-depth understanding of different data tools and languages than data architects. There are over 90,000 jobs listed on Indeed.com, suggesting there is a very high demand in the organizations for this kind of a role. An entry level data engineer has a median annual salary of $90,083 per Payscale.com, with a range of $60,857 to $131,851. For Senior Data Engineers, the average salary shoots up to $123,749 as per Glassdoor estimates. Source: Indeed.com With the unimaginable rise in the sheer volume of data, the onus is on the data engineers to build the right systems that empower the data analysts and data scientists to sift through the messy data and derive actionable insights from it. If becoming a data engineer is something that interests you, here are some of our products you might want to look at: Real-Time Big Data Analytics Apache Spark 2.x Cookbook Big Data Visualization You can also check out our detailed skill plan on becoming a Big Data Engineer on Mapt. Chief Data Officer There is a countless number of organizations that build their businesses on data, but don’t manage it that well. This is where a senior executive popularly known as the Chief Data Officer (CDO) comes into play - bearing the responsibility for implementing the organization’s data and information governance and assisting with data-driven business strategies. They are primarily responsible for ensuring that their organization gets the most value out of their data and put appropriate plans in place for effective data quality and its life-cycle management. The role of a CDO is one of the most lucrative and highest paying jobs in the data science frontier. An average median annual pay for a CDO per Payscale.com is around $192,812. Indeed.com lists just over 8000 job postings too - this is not a very large number, but understandable considering the recent emergence of the role and because it’s a high profile, C-suite job. Source: Indeed.com According to a Gartner research, almost 50% companies in a variety of regulated industries will have a CDO in place, by 2017. Considering the demand for the role and the fact that it is only going to rise in the future, the role of a CDO is one worth vying for. To become a CDO, you will obviously need a solid understanding of statistical, mathematical and analytical concepts. Not just that, extensive and high-value experience in managing technical teams and information management solutions is also a prerequisite. Along with a thorough understanding of the various Big Data tools and technologies, you will need strong communication skills and deep understanding of the business. If you’re planning to know more about how you can become a Chief Data Officer, you can browse through our piece on the role of CDO. Why demand for data science professionals will rise It’s hard to imagine an organization which doesn’t have to deal with data, but it’s harder to imagine the state of an organization with petabytes of data and not knowing what to do with it. With the vast amounts of data, organizations deal with these days, the need for experts who know how to handle the data and derive relevant and timely insights from it is higher than ever. In fact, IBM predicts there’s going to be a severe shortage of data science professionals, and thereby, a tremendous growth in terms of job offers and advertised openings, by 2020. Not everyone is equipped with the technical skills and know-how associated with tasks such as data mining, machine learning and more. This is slowly creating a massive void in terms of talent that organizations are looking to fill quickly, by offering lucrative salaries and added benefits. Without the professional expertise to turn data into actionable insights, Big Data becomes all but useless.

0
0
26415

How-To Tutorials - Data

iReport in NetBeans

Putting Your Database at the Heart of Azure Solutions

We must change how we think about AI, urge AI founding fathers

Build a custom Admin Home page in Salesforce CRM Lightning Experience

Bad Metadata can get you in legal hot water

Data Exploration using Spark SQL

How to build a cold-start friendly content-based recommender using Apache Spark SQL

How to Implement a Neural Network with Single-Layer Perceptron

Bitcoin

Convolutional Neural Networks with Reinforcement Learning

Trending Topics

Google AI releases Cirq and Open Fermion-Cirq to boost Quantum computation

How to create multithreaded applications in Qt

What can you expect at NeurIPS 2019?

How to implement In-Memory OLTP on SQL Server in Linux

Highest Paying Data Science Jobs in 2017

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access