Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Tech Guides - Big Data

50 Articles
article-image-15-ways-to-make-blockchains-scalable-secure-and-safe
Aaron Lazar
27 Dec 2017
14 min read
Save for later

15 ways to make Blockchains scalable, secure and safe!

Aaron Lazar
27 Dec 2017
14 min read
[box type="note" align="" class="" width=""]This article is a book extract from  Mastering Blockchain, written by Imran Bashir. In the book he has discussed distributed ledgers, decentralization and smart contracts for implementation in the real world.[/box] Today we will look at various challenges that need to be addressed before blockchain becomes a mainstream technology. Even though various use cases and proof of concept systems have been developed and the technology works well for most scenarios, some fundamental limitations continue to act as barriers to mainstream adoption of blockchains in key industries. We’ll start by listing down the main challenges that we face while working on Blockchain and propose ways to over each of those challenges: Scalability This problem has been a focus of intense debate, rigorous research, and media attention for the last few years. This is the single most important problem that could mean the difference between wider adaptability of blockchains or limited private use only by consortiums. As a result of substantial research in this area, many solutions have been proposed, which are discussed here: Block size increase: This is the most debated proposal for increasing blockchain performance (transaction processing throughput). Currently, bitcoin can process only about three to seven transactions per second, which is a major inhibiting factor in adapting the bitcoin blockchain for processing micro-transactions. Block size in bitcoin is hardcoded to be 1 MB, but if block size is increased, it can hold more transactions and can result in faster confirmation time. There are several Bitcoin Improvement Proposals (BIPs) made in favor of block size increase. These include BIP 100, BIP 101, BIP 102, BIP 103, and BIP 109. In Ethereum, the block size is not limited by hardcoding; instead, it is controlled by gas limit. In theory, there is no limit on the size of a block in Ethereum because it's dependent on the amount of gas, which can increase over time. This is possible because miners are allowed to increase the gas limit for subsequent blocks if the limit has been reached in the previous block. Block interval reduction: Another proposal is to reduce the time between each block generation. The time between blocks can be decreased to achieve faster finalization of blocks but may result in less security due to the increased number of forks. Ethereum has achieved a block time of approximately 14 seconds and, at times, it can increase. This is a significant improvement from the bitcoin blockchain, which takes 10 minutes to generate a new block. In Ethereum, the issue of high orphaned blocks resulting from smaller times between blocks is mitigated by using the Greedy Heaviest Observed Subtree (GHOST) protocol whereby orphaned blocks (uncles) are also included in determining the valid chain. Once Ethereum moves to Proof of Stake, this will become irrelevant as no mining will be required and almost immediate finality of transactions can be achieved. Invertible Bloom lookup tables: This is another approach that has been proposed to reduce the amount of data required to be transferred between bitcoin nodes. Invertible Bloom lookup tables (IBLTs) were originally proposed by Gavin Andresen, and the key attraction in this approach is that it does not result in a hard fork of bitcoin if implemented. The key idea is based on the fact that there is no need to transfer all transactions between nodes; instead, only those that are not already available in the transaction pool of the synching node are transferred. This allows quicker transaction pool synchronization between nodes, thus increasing the overall scalability and speed of the bitcoin network. Sharding: Sharding is not a new technique and has been used in distributed databases for scalability such as MongoDB and MySQL. The key idea behind sharding is to split up the tasks into multiple chunks that are then processed by multiple nodes. This results in improved throughput and reduced storage requirements. In blockchains, a similar scheme is employed whereby the state of the network is partitioned into multiple shards. The state usually includes balances, code, nonce, and storage. Shards are loosely coupled partitions of a blockchain that run on the same network. There are a few challenges related to inter-shard communication and consensus on the history of each shard. This is an open area for research. State channels: This is another approach proposed for speeding up the transaction on a blockchain network. The basic idea is to use side channels for state updating and processing transactions off the main chain; once the state is finalized, it is written back to the main chain, thus offloading the time-consuming operations from the main blockchain. State channels work by performing the following three steps: First, a part of the blockchain state is locked under a smart contract, ensuring the agreement and business logic between participants. Now off-chain transaction processing and interaction is started between the participants that update the state only between themselves for now. In this step, almost any number of transactions can be performed without requiring the blockchain and this is what makes the process fast and a best candidate for solving blockchain scalability issues. However, it could be argued that this is not a real on-blockchain solution such as, for example, sharding, but the end result is a faster, lighter, and robust network which can prove very useful in micropayment networks, IoT networks, and many other applications. Once the final state is achieved, the state channel is closed and the final state is written back to the main blockchain. At this stage, the locked part of the blockchain is also unlocked. This technique has been used in the bitcoin lightning network and Ethereum's Raiden. 6. Subchains: This is a relatively new technique recently proposed by Peter R. Rizun which is based on the idea of weak blocks that are created in layers until a strong block is found. Weak blocks can be defined as those blocks that have not been able to be mined by meeting the standard network difficulty criteria but have done enough work to meet another weaker difficulty target. Miners can build sub-chains by layering weak blocks on top of each other, unless a block is found that meets the standard difficulty target. At this point, the subchain is closed and becomes the strong block. Advantages of this approach include reduced waiting time for the first verification of a transaction. This technique also results in a reduced chance of orphaning blocks and speeds up transaction processing. This is also an indirect way of addressing the scalability issue. Subchains do not require any soft fork or hard fork to implement but need acceptance by the community. 7. Tree chains: There are also other proposals to increase bitcoin scalability, such as tree chains that change the blockchain layout from a linearly sequential model to a tree. This tree is basically a binary tree which descends from the main bitcoin chain. This approach is similar to sidechain implementation, eliminating the need for major protocol change or block size increase. It allows improved transaction throughput. In this scheme, the blockchains themselves are fragmented and distributed across the network in order to achieve scalability. Moreover, mining is not required to validate the blocks on the tree chains; instead, users can independently verify the block header. However, this idea is not ready for production yet and further research is required in order to make it practical. Privacy Privacy of transactions is a much desired property of blockchains. However, due to its very nature, especially in public blockchains, everything is transparent, thus inhibiting its usage in various industries where privacy is of paramount importance, such as finance, health, and many others. There are different proposals made to address the privacy issue and some progress has already been made. Several techniques, such as indistinguishability obfuscation, usage of homomorphic encryption, zero knowledge proofs, and ring signatures. All these techniques have their merits and demerits and are discussed in the following sections. Indistinguishability obfuscation: This cryptographic technique may serve as a silver bullet to all privacy and confidentiality issues in blockchains but the technology is not yet ready for production deployments. Indistinguishability obfuscation (IO) allows for code obfuscation, which is a very ripe research topic in cryptography and, if applied to blockchains, can serve as an unbreakable obfuscation mechanism that will turn smart contracts into a black box. The key idea behind IO is what's called by researchers a multilinear jigsaw puzzle, which basically obfuscates program code by mixing it with random elements, and if the program is run as intended, it will produce expected output but any other way of executing would render the program look random and garbage. This idea was first proposed by Sahai and others in their research paper Candidate Indistinguishability Obfuscation and Functional Encryption for All Circuits. Homomorphic encryption: This type of encryption allows operations to be performed on encrypted data. Imagine a scenario where the data is sent to a cloud server for processing. The server processes it and returns the output without knowing anything about the data that it has processed. This is also an area ripe for research and fully homomorphic encryption that allows all operations on encrypted data is still not fully deployable in production; however, major progress in this field has already been made. Once implemented on blockchains, it can allow processing on cipher text which will allow privacy and confidentiality of transactions inherently. For example, the data stored on the blockchain can be encrypted using homomorphic encryption and computations can be performed on that data without the need for decryption, thus providing privacy service on blockchains. This concept has also been implemented in a project named Enigma by MIT's Media Lab. Enigma is a peer-to-peer network which allows multiple parties to perform computations on encrypted data without revealing anything about the data. Zero knowledge proofs: Zero knowledge proofs have recently been implemented in Zcash successfully, as seen in previous chapters. More specifically, SNARKs have been implemented in order to ensure privacy on the blockchain. The same idea can be implemented in Ethereum and other blockchains also. Integrating Zcash on Ethereum is already a very active research project being run by the Ethereum R&D team and the Zcash Company. State channels: Privacy using state channels is also possible, simply due to the fact that all transactions are run off-chain and the main blockchain does not see the transaction at all except the final state output, thus ensuring privacy and confidentiality. Secure multiparty computation: The concept of secure multiparty computation is not new and is based on the notion that data is split into multiple partitions between participating parties under a secret sharing mechanism which then does the actual processing on the data without the need of the reconstructing data on single machine. The output produced after processing is also shared between the parties. MimbleWimble: The MimbleWimble scheme was proposed somewhat mysteriously on the bitcoin IRC channel and since then has gained a lot of popularity. MimbleWimble extends the idea of confidential transactions and Coinjoin, which allows aggregation of transactions without requiring any interactivity. However, it does not support the use of bitcoin scripting language along with various other features of standard Bitcoin protocol. This makes it incompatible with existing Bitcoin protocol. Therefore, it can either be implemented as a sidechain to bitcoin or on its own as an alternative cryptocurrency. This scheme can address privacy and scalability issues both at once. The blocks created using the MimbleWimble technique do not contain transactions as in traditional bitcoin blockchains; instead, these blocks are composed of three lists: an input list, output list, and something called excesses which are lists of signatures and differences between outputs and inputs. The input list is basically references to the old outputs, and the output list contains confidential transactions outputs. These blocks are verifiable by nodes by using signatures, inputs, and outputs to ensure the legitimacy of the block. In contrast to bitcoin, MimbleWimble transaction outputs only contain pubkeys, and the difference between old and new outputs is signed by all participants involved in the transactions. Coinjoin: Coinjoin is a technique which is used to anonymize the bitcoin transactions by mixing them interactively. The idea is based on forming a single transaction from multiple entities without causing any change in inputs and outputs. It removes the direct link between senders and receivers, which means that a single address can no longer be associated with transactions, which could lead to identification of the users. Coinjoin needs cooperation between multiple parties that are willing to create a single transaction by mixing payments. Therefore, it should be noted that, if any single participant in the Coinjoin scheme does not keep up with the commitment made to cooperate for creating a single transaction by not signing the transactions as required, then it can result in a denial of service attack. In this protocol, there is no need for a single trusted third party. This concept is different from mixing a service which acts as a trusted third party or intermediary between the bitcoin users and allows shuffling of transactions. This shuffling of transactions results in the prevention of tracing and the linking of payments to a particular user. Security Even though blockchains are generally secure and make use of asymmetric and symmetric cryptography as required throughout the blockchain network, there still are few caveats that can result in compromising the security of the blockchain. There are two types of Contract Verification that can solve the issue of Security. Why3 formal verification: Formal verification of solidity code is now available as a feature in the solidity browser. First the code is converted into Why3 language that the verifier can understand. In the example below, a simple solidity code that defines the variable z as maximum limit of uint is shown. When this code runs, it will result in returning 0, because uint z will overrun and start again from 0. This can also be verified using Why3, which is shown below: Once the solidity is compiled and available in the formal verification tab, it can be copied into the Why3 online IDE available at h t t p ://w h y 3. l r i . f r /t r y /. The example below shows that it successfully checks and reports integer overflow errors. This tool is under heavy development but is still quite useful. Also, this tool or any other similar tool is not a silver bullet. Even formal verification generally should not be considered a panacea because specifications in the first place should be defined appropriately: Oyente tool: Currently, Oyente is available as a Docker image for easy testing and installation. It is available at https://github.com/ethereum/oyente, and can be quickly downloaded and tested. In the example below, a simple contract taken from solidity documentation that contains a re-entrancy bug has been tested and it is shown that Oyente successfully analyzes the code and finds the bug: This sample code contains a re-entrancy bug which basically means that if a contract is interacting with another contract or transferring ether, it is effectively handing over the control to that other contract. This allows the called contract to call back into the function of the contract from which it has been called without waiting for completion. For example, this bug can allow calling back into the withdraw function shown in the preceding example again and again, resulting in getting Ethers multiple times. This is possible because the share value is not set to 0 until the end of the function, which means that any later invocations will be successful, resulting in withdrawing again and again. An example is shown of Oyente running to analyze the contract shown below and as can be seen in the following output, the analysis has successfully found the re-entrancy bug. The bug is proposed to be handled by a combination of the Checks-Effects-Interactions pattern described in the solidity documentation: We discussed the scalability, security, confidentiality, and privacy aspects of blockchain technology in this article. If you found it useful, go ahead and buy the book, Mastering Blockchain by Imran Bashir to become a professional Blockchain developer.      
Read more
  • 0
  • 0
  • 17453

article-image-13-reasons-exit-polls-wrong
Sugandha Lahoti
13 Nov 2017
7 min read
Save for later

13 reasons why Exit Polls get it wrong sometimes

Sugandha Lahoti
13 Nov 2017
7 min read
An Exit poll, as the name suggests, is a poll taken immediately after voters exit the polling booth. Private companies working for popular newspapers or media organizations conduct these exit polls and are popularly known as pollsters. Once the data is collected, data analysis and estimation is used to predict the winning party and the number of seats captured. Turnout models which are built using logistic regression or random forest techniques are used for prediction of turnouts in the exit poll results. Exit polls are dependent on sampling. Hence a margin of error does exist. This describes how close pollsters are in expecting an election result relative to the true population value. Normally, a margin of error plus or minus 3 percentage points is acceptable. However, in the recent times, there have been instances where the poll average was off by a larger percentage. Let us analyze some of the reasons why exit polls can get their predictions wrong. [dropcap]1[/dropcap] Sampling inaccuracy/quality Exit polls are dependent on the sample size, i.e. the number of respondents or the number of precincts chosen. Incorrect estimation of this may lead to error margins. The quality of sample data also matters. This includes factors such as whether the selected precincts are representative of the state, whether the polled audience in each precinct represents the whole etc. [dropcap]2[/dropcap] Model did not consider multiple turnout scenarios Voter turnout refers to the percentage of voters who cast a vote during an election. Pollsters may often misinterpret the number of people who actually vote based on the total no. of the population eligible to vote. Also, they often base their turnout prediction on past trends. However, voter turnout is dependent on many factors. For example, some voters might not turn up due to reasons such as indifference or a feeling of perception that their vote might not count--which is not true. In such cases, the pollsters adjust the weighting to reflect high or low turnout conditions by keeping the total turnout count in mind. The observations taken during a low turnout is also considered and the weights are adjusted therein. In short, pollsters try their best to maintain the original data. [dropcap]3[/dropcap] Model did not consider past patterns Pollsters may commit a mistake by not delving into the past. They can gauge the current turnout rates by taking into the account the presidential turnout votes or the previous midterm elections. Although, one may assume that the turnout percentage over the years have been stable a check on the past voter turnout is a must. [dropcap]4[/dropcap] Model was not recalibrated for year and time of election such as odd-year midterms Timing is a very crucial factor in getting the right traction for people to vote. At times, some social issues would be much more hyped and talked-about than the elections. For instance, the news of the Ebola virus breakout in Texas was more prominent than news about the contestants standing in the mid 2014 elections. Another example would be an election day set on a Friday versus on any other weekday. [dropcap]5[/dropcap] Number of contestants Everyone has a personal favorite. In cases where there are just two contestants, it is straightforward to arrive at a clear winner. For pollsters, it is easier to predict votes when the whole world's talking about it, and they know which candidate is most talked about. With the increase in the number of candidates, the task to carry out an accurate survey is challenging for the pollsters. They have to reach out to more respondents to carry out the survey required in an effective manner. [dropcap]6[/dropcap] Swing voters/undecided respondents Another possible explanation for discrepancies in poll predictions and the outcome is due to a large proportion of undecided voters in the poll samples. Possible solutions could be Asking relative questions instead of absolute ones Allotment of undecided voters in proportion to party support levels while making estimates [dropcap]7[/dropcap] Number of down-ballot races Sometimes a popular party leader helps in attracting votes to another less popular candidate of the same party. This is the down-ballot effect. At times, down-ballot candidates may receive more votes than party leader candidates, even when third-party candidates are included. Also, down-ballot outcomes tend to be influenced by the turnout for the polls at the top of the ballot. So the number of down-ballot races need to be taken into account. [dropcap]8[/dropcap] The cost incurred to commission a quality poll A huge capital investment is required in order to commission a quality poll. The cost incurred for a poll depends on the sample size, i.e. the number of people interviewed, the length of the questionnaire--longer the interview, more expensive it becomes, the time within which interviews must be conducted, are some contributing factors. Also, if a polling firm is hired or if cell phones are included to carry out a survey, it will definitely add up to the expense. [dropcap]9[/dropcap] Over-relying on historical precedence Historical precedence is an estimate of the type of people who have shown up previously on a similar type of election. This precedent should also be taken into consideration for better estimation of election results. However, care should be taken not to over-rely on it. [dropcap]10[/dropcap] Effect of statewide ballot measures Poll estimates are also dependent on state and local governments. Certain issues are pushed by local ballot measures. However, some voters feel that power over specific issues should belong exclusively to state governments. This causes opposition to local ballot measures in some states. These issues should be taken into account while estimation for better result prediction. [dropcap]11[/dropcap] Oversampling due to various factors such as faulty survey design, respondents’ willingness/unwillingness to participate etc   Exit polls may also sometimes oversample voters for many reasons. One example of this is related to the people of US with cultural ties to Latin America. Although, more than one-fourth of Latino voters prefer speaking Spanish to English, yet exit polls are almost never offered in Spanish. This might oversample English speaking Latinos. [dropcap]12[/dropcap] Social desirability bias in respondents People may not always tell the truth about who they voted for. In other words, when asked by pollsters they are likely to place themselves on the safer side, as exit polls is a sensitive topic. The voters happen to tell pollsters that they have voted for a minority candidate, but they have actually voted against the minority candidate. Social Desirability has no linking to issues with race or gender. It is just that people like to be liked and like to be seen as doing what everyone else is doing or what the “right” thing to do is, i.e., they play safe. Brexit polling, for instance, showed stronger signs of Social desirability bias. [dropcap]13[/dropcap] The spiral of silence theory People may not reveal their true thoughts to news reporters as they may believe media has an inherent bias. Voters may not come out to declare their stand publicly in fear of reprisal or the fear of isolation. They choose to remain silent. This may also hinder estimate calculation for pollsters. The above is just a shortlist of a long list of reasons why exit poll results must be taken with a pinch of salt. However, even with all its shortcomings, the striking feature of an exit poll is the fact that rather than predicting about a future action, it records an action that has just happened. So you rely on present indicators rather than ambiguous historical data. Exit polls are also cost-effective in obtaining very large samples. If these exit polls are conducted properly, keeping in consideration the points described above, they can predict election results with greater reliability.
Read more
  • 0
  • 0
  • 15281

article-image-black-friday-17-ways-ecommerce-machine-learning
Sugandha Lahoti
24 Nov 2017
10 min read
Save for later

Black Friday Special: 17 ways in 2017 that online retailers use machine learning

Sugandha Lahoti
24 Nov 2017
10 min read
Black Friday sales are just around the corner. Both online and traditional retailers have geared up to race past each other in the ultimate shopping frenzy of the year. Although both brick and mortar retailers and online platforms will generate high sales, online retailers will sweep past the offline platforms. Why? In case of online retailers, the best part remains the fact that shopping online, customers don’t have to deal with pushy crowds, traffic, salespeople, and long queues. Online shoppers have access to a much larger array of products. They can also switch between stores, by just switching between tabs on their smart devices. Considering the surge of shoppers expected on such peak seasons, Big Data Analytics is a helpful tool for online retailers. With the advances in Machine Learning, Big data analytics is no longer confined to the technology landscape, it also represents a way how retailers connect with consumers in a purposeful way.  For retailers, both big and small, adopting the right ML powered Big Data Analytic strategy would help in increasing their sales, retent their customers and generate high revenues. Here are 17 reasons why data is an important asset for retailers, especially on the 24th, this month. A. Improving site infrastructure The first thing that a customer sees when landing on an e-commerce website is the UI, ease of access, product classification, number of filters, etc. Hence building an easy to use website is paramount. Here’s how ML powered Big data analytics can help: [toggle title="" state="close"] 1. E-commerce site analysis A complete site analysis is one of the ways to increase sales and retain customers. By analyzing page views and actual purchases, bounce rates, and least popular products, the e-commerce website can be altered for better usability. For enhancing website features data mining techniques can also be used. This includes web mining, which is used to extract information from the web, and log files, which contain information about the user. For time-bound sales like Black Friday and Cyber Monday, this is quite helpful for better product placement, removing unnecessary products and showcasing products which cater to a particular user base. 2. Generating test data Generation of test data helps in a deeper analysis which helps in increasing sales. Big data analytics can give a helping hand here by organizing products based upon the type, shopper gender and age group, brands, pricing, number of views of each product page, and the information provided for that product. During peak seasons such as Black Friday,  ML powered data analytics can analyze most visited pages and shopper traffic flow for better product placements and personalized recommendations.[/toggle] B. Enhancing Products and Categories Every retailer in the world is looking for ways to reduce costs without sacrificing the quality of their products. Big data analytics in combination with machine learning is of great help here. [toggle title="" state="close"] 3. Category development Big Data analytics can help in building up of new product categories, or in eliminating or enhancing old ones. This is possible by using machine learning techniques to analyze patterns in the marketing data as well as other external factors such as product niches. ML powered assortment planning can help in selecting and planning products for a specified period of time, such as the Thanksgiving week,  so as to maximize sales and profit. Data analytics can also help in defining Category roles in order to clearly define the purpose of each category in the total business lifecycle. This is done to ensure that efforts made around a particular category, actually contribute to category development. It also helps to identify key categories, which are the featured products that specifically meet an objective for e.g. Healthy food items, Cheap electronics, etc. 4. Range selection An optimum and dynamic product range is essential to retain customers. Big data analytics can utilize sales data and shopper history to measure a product range for maximum profitability. This is especially important for Black Friday and Cyber Monday deals where products are sold at heavily discounted rates. 5. Inventory management Data analytics can give an overview of best selling products, non-performing or slow moving products, seasonal products and so on. These data pointers can help retailers manage their inventory and reduce the associated costs. Machine learning powered Big data analytics are also helpful in making product localization strategies i.e. which product sells well in what areas. In order to localize for China, Amazon changed its China branding to Amazon.cn. To make it easy for Chinese to pay, Amazon China introduced portable POS so users can pay the delivery guy via credit card at their doorstep. 6. Waste reduction Big Data analytics can analyze sales and reviews to identify products which don’t do well, and either eliminate the product or combine them with a companion well-doing product to increase its sales. Analysing data can also help in listing products that were returned due to damages and defects. Generating insights from this data using machine learning models can be helpful to retailers in many ways. Some examples are: they can modify their stocking methods, improve on their packaging and logistic support for those kinds of products. 7. Supply chain optimization Big Data analytics also have a role to play in Supply chain optimization. This includes using sales and forecast data to plan and manage goods from retailers to warehouses to transport, onto the doorstep of customers. Top retailers like Amazon, are offering deals under the Black Friday space for the entire week. Expanding the sale window is a great supply chain optimization technique for a more manageable selling.[/toggle] C. Upgrading the Customer experience Customers are the most important assets for any retailer. Big Data analytics is here to help you retain, acquire, and attract your customers. [toggle title="" state="close"] 8. Shopper segmentation Machine learning techniques can link and analyze granular data such as behavioral, transactional and interaction data to identify and classify customers who behave in similar ways. This eliminates the guesswork associated and helps in creating rich and highly dynamic consumer profiles. According to a report by Research Methodology, Walmart uses a mono-segment type of positioning targeted to single customer segment. Walmart also pays attention to young consumers due to the strategic importance of achieving the loyalty of young consumers for long-term perspectives. 9. Promotional analytics An important factor for better sales is analyzing how customers respond to promotions and discount. Analyzing data on an hour-to-hour basis on special days such as Black Friday or Cyber Monday, which have high customer traffic, can help retailers plan for better promotions and lead to brand penetration. The Boston consulting group uses data analytics to accurately gauge the performance of promotions and predict promotion performance in advance. 10. Product affinity models By analyzing a shopper’s past transaction history, product affinity models can track customers with the highest propensity of buying a particular product. Retailers can then use this for attracting more customers or providing the existing ones with better personalizations. Product affinity models can also cluster products that are mostly bought together, which can be used to improve recommendation systems. 11. Customer churn prediction The massive quantity of customer data being collected can be used for predicting customer churn rate. Customer churn prediction is helpful in retaining customers, attracting new ones, and also acquiring the right type of customers in the first place. Classification models such as Logistic regression can be used to predict customers most likely to churn. As part of the Azure Machine Learning offering, Microsoft has a Retail Customer Churn Prediction Template to help retail companies predict customer churns.[/toggle] D. Formulating and aligning business strategies Every retailer is in need of tools and strategies for a product or a service to reach and influence the consumers, generate profits, and contribute to the long-term success of the business. Below are some pointers depicting how ML powered Big Data Analytics can help retailers do just that. [toggle title="" state="close"] 12. Building dynamic pricing models Pricing models can be designed by looking at the customer’s purchasing habits and surfing history. This descriptive analytics can be fed into a predictive model to obtain an optimal pricing model such as price sensitivity scores, and price to demand elasticity.  For example, Amazon uses a dynamic price optimization technique by offering its biggest discounts on its most popular products, while making profits on less popular ones. IBM’s Predictive Customer Intelligence can dynamically adjust the price of a product based on customer’s purchase decision. 13. Time series analysis Time series analysis can be used to identify patterns and trends in customer purchases, or a product’s lifecycle by observing information in a sequential fashion. It can also be used to predict future values based on the sequence so generated. For online retailers this means using historical sales data to forecast future sales, analyzing time-dependent patterns to list new arrivals, mark up prices or lower them down depending events such as Black Friday or Cyber Monday sales etc. 14. Demand forecasting Machine learning powered Big Data analytics can learn demand levels from a wide array of factors such as product nature, characteristics, seasonality, relationships with other associated products, relationship with other market factors, etc. It can then forecast the type of demand associated with a particular product using a simulation model. Such predictive analytics are highly accurate and also reduce costs especially for events like Black Friday, where there is a high surge of shoppers. 15. Strategy Adjustment Predictive Big Data analytics can help shorten the go-to-market time for product launches, allowing marketers to adjust their strategy midcourse if needed. For Black Friday or Cyber Monday deals, an online retailer can predict the demand for a particular product and can amend strategies in between, such as increasing the discount, or placing a product at the discounted rate for a longer time, etc. 16. Reporting and sales analysis Big data analytics tools can analyze large quantities of retail data quickly. Also, most such tools have a simple UI Dashboard which helps retailers know detailed descriptions of their queries in a single click. Thus a lot of time is saved, which was previously used for creating reports or sales summary. Reports generated from a data analytics tool are quick, fast, and easy to understand. 17. Marketing mix spend optimization Forecasting sales and proving ROI of marketing activities are two pain points faced by most retailers. Marketing Mix Modelling is a big data statistical analysis, which uses historical data to show the impact of marketing activities on sales and then forecasts the impact of future marketing tactics. Insights derived from such tools can be used to enhance marketing strategies and optimize the costs.[/toggle] Adopting the strategies as mentioned above, retailers can maximize their gains this holiday season starting with Black Friday which begins as the clock chimes 12 today.  Machine Powered Big Data analytics is there to help retailers attract new shoppers, retain them, enhance product line, define new categories, and formulate and align business strategies. Gear up for a Big Data Black Friday this 2017!  
Read more
  • 0
  • 0
  • 14751

article-image-data-science-folks-12-reasons-thankful-thanksgiving
Savia Lobo
21 Nov 2017
8 min read
Save for later

Data science folks have 12 reasons to be thankful for this Thanksgiving

Savia Lobo
21 Nov 2017
8 min read
We are nearing the end of 2017. But with each ending chapter, we have remarkable achievements to be thankful for. Similarly, for the data science community, this year was filled with a number of new technologies, tools, version updates etc. 2017 saw blockbuster releases such as PyTorch, TensorFlow 1.0 and Caffe 2, among many others. We invite data scientists, machine learning experts, and other data science professionals to come together on this Thanksgiving Day, and thank the organizations, which made our interactions with AI easier, faster, better and generally more fun. Let us recall our blessings in 2017, one month at a time... [dropcap]Jan[/dropcap] Thank you, Facebook and friends for handing us PyTorch Hola 2017! While the world was still in the New Year mood, a brand new deep learning framework was released. Facebook along with a few other partners launched PyTorch. PyTorch came as an improvement to the popular Torch framework. It now supported the Python language over the less popular Lua. As PyTorch worked just like Python, it was easier to debug and create unique extensions. Another notable change was the adoption of a Dynamic Computational Graph, used to create graphs on the fly with high speed and flexibility. [dropcap]Feb[/dropcap] Thanks Google for TensorFlow 1.0 The month of February brought Data Scientist’s a Valentine's gift with the release of TensorFlow 1.0. Announced at the first annual TensorFlow Developer Summit, TensorFlow 1.0 was faster, more flexible, and production-ready. Here’s what the TensorFlow box of chocolate contained: Fully compatibility with Keras Experimental APIs for Java and Go New Android demos for object and image detection, localization, and stylization A brand new Tensorflow debugger An introductory glance of  XLA--a domain-specific compiler for TensorFlow graphs [dropcap]Mar[/dropcap] We thank Francois Chollet for making Keras 2 a production ready API Congratulations! Keras 2 is here. This was a great news for Data science developers as Keras 2, a high- level neural network API allowed faster prototyping. It provided support both CNNs (Convolutional Neural Networks) as well as RNNs (Recurrent Neural Networks). Keras has an API designed specifically for humans. Hence, a user-friendly API. It also allowed easy creation of modules, which meant it is perfect for carrying out an advanced research. Developers can now code in  Python, a compact, easy to debug language. [dropcap]Apr[/dropcap] We like Facebook for brewing us Caffe 2 Data scientists were greeted by a fresh aroma of coffee, this April, as Facebook released the second version of it’s popular deep learning framework, Caffe. Caffe 2 came up as a easy to use deep learning framework to build DL applications and leverage community contributions of new models and algorithms. Caffe 2 was fresh with a first-class support for large-scale distributed training, new hardware support, mobile deployment, and the flexibility for future high-level computational approaches. It also provided easy methods to convert DL models built in original Caffe to the new Caffe version. Caffe 2 also came with over 400 different operators--the basic units of computation in Caffe 2. [dropcap]May[/dropcap] Thank you, Amazon for supporting Apache MXNet on AWS and Google for your TPU The month of May brought in some exciting launches from the two tech-giants, Amazon and Google. Amazon Web Services’ brought Apache MXNet on board and Google’s Second generation TPU chips were announced. Apache MXNet, which is now available on AWS allowed developers to build Machine learning applications which can train quickly and run anywhere, which means it is a scalable approach for developers. Next up, was Google’s  second generation TPU (Tensor Processing Unit) chips, designed to speed up machine learning tasks. These chips were supposed to be (and are) more capable of CPUs and even GPUs. [dropcap]Jun[/dropcap] We thank Microsoft for CNTK v2 The mid of the month arrived with Microsoft’s announcement of the version 2 of its Cognitive Toolkit. The new Cognitive Toolkit was now enterprise-ready, had production-grade AI and allowed users to create, train, and evaluate their own neural networks scalable to multiple GPUs. It also included the Keras API support, faster model compressions, Java bindings, and Spark support. It also featured a number of new tools to run trained models on low-powered devices such as smartphones. [dropcap]Jul[/dropcap] Thank you, Elastic.co for bringing ML to Elastic Stack July made machine learning generally available for the Elastic Stack users with its version 5.5. With ML, the anomaly detection of the Elasticsearch time series data was made possible. This allows users to analyze the root cause of the problems in the workflow and thus reduce false positives. To know about the changes or highlights of this version visit here. [dropcap]Aug[/dropcap] Thank you, Google for your Deeplearn.js August announced the arrival of Google’s Deeplearn.js, an initiative that allowed Machine Learning models to run entirely in a browser. Deeplearn.js was an open source WebGL- accelerated JS library. It offered an interactive client-side platform which helped developers carry out rapid prototyping and visualizations. Developers were now able to use hardware accelerator such as the GPU via the webGL and perform faster computations with 2D and 3D graphics. Deeplearn.js also allowed TensorFlow model’s capabilities to be imported on the browser. Surely something to thank for! [dropcap]Sep[/dropcap] Thanks, Splunk and SQL for your upgrades September surprises came with the release of Splunk 7.0, which helps in getting Machine learning to the masses with an added Machine Learning Toolkit, which is scalable, extensible, and accessible. It includes an added native support for metrics which speed up query processing performance by 200x. Other features include seamless event annotations, improved visualization, faster data model acceleration, a cloud-based self-service application. September also brought along the release of MySQL 8.0 which included a first-class support for Unicode 9.0. Other features included are An extended support for native JSOn data Inclusion of windows functions and recursive SQL syntax for queries that were previously impossible or difficult to write Added document-store functionality So, big thanks to the Splunk and SQL upgrades. [dropcap]Oct[/dropcap] Thank you, Oracle for the Autonomous Database Cloud and Microsoft for SQL Server 2017 As Fall arrived, Oracle unveiled the World’s first Autonomous Database Cloud. It provided full automation associated with tuning, patching, updating and maintaining the database. It was self scaling i.e., it instantly resized compute and storage without downtime with low manual administration costs. It was also self repairing and guaranteed 99.995 percent reliability and availability. That’s a lot of reduction in workload! Next, developers were greeted with the release of SQL Server 2017 which was a major step towards making SQL Server a platform. It included multiple enhancements in Database Engine such as adaptive query processing, Automatic database tuning, graph database capabilities, New Availability Groups, Database Tuning Advisor (DTA) etc. It also had a new Scale Out feature in SQL Server 2017 Integration Services (SSIS) and SQL Server Machine Learning Services to reflect support for Python language. [dropcap]Nov[/dropcap] A humble thank you to Google for TensorFlow Lite and Elastic.co for Elasticsearch 6.0 Just a month more for the year to end!! The Data science community has had a busy November with too many releases to keep an eye on with Microsoft Connect(); to spill the beans. So, November, thank you for TensorFlow Lite and Elastic 6. Talking about TensorFlow Lite, a lightweight product  for mobile and embedded devices, it is designed to be: Lightweight: It allows inference of the on-device machine learning models that too with a small binary size, allowing faster initialization/ startup. Speed: The model loading time is dramatically improved, with an accelerated hardware support. Cross-platform: It includes a runtime tailormade to run on various platforms–starting with Android and iOS. And now for Elasticsearch 6.0, which is made generally available. With features such as easy upgrades, Index sorting, better Shard recovery, support for Sparse doc values.There are other new features spread out across the Elastic stack, comprised of Kibana, Beats and Logstash. These are, Elasticsearch’s solutions for visualization and dashboards, data ingestion and log storage. [dropcap]Dec[/dropcap] Thanks in advance Apache for Hadoop 3.0 Christmas gifts may arrive for Data Scientists in the form of General Availability of Hadoop 3.0. The new version is expected to include support for Erasure Encoding in HDFS, version 2 of the YARN Timeline Service, Shaded Client Jars, Support for More than 2 NameNodes, MapReduce Task-Level Native Optimization, support for Opportunistic Containers and Distributed Scheduling to name a few. It would also include a rewritten version of Hadoop shell scripts with bug fixes, improved compatibility and many changes in some existing installation procedures. Pheww! That was a large list of tools for Data Scientists and developers to thank for this year. Whether it be new frameworks, libraries or a new set of software, each one of them is unique and helpful to create data-driven applications. Hopefully, you have used some of them in your projects. If not, be sure to give them a try, because 2018 is all set to overload you with new, and even more amazing tools, frameworks, libraries, and releases.
Read more
  • 0
  • 0
  • 14049

article-image-blockchain-tools
Aaron Lazar
23 Oct 2017
7 min read
Save for later

"My Favorite Tools to Build a Blockchain App" - Ed, The Engineer

Aaron Lazar
23 Oct 2017
7 min read
Hey! It’s great seeing you here. I am Ed, the Engineer and today I’m going to open up my secret toolbox and share some great tools I use to build Blockchains. If you’re a Blockchain developer or a developer-to-be, you’ve come to the right place! If you are not one, maybe you should consider becoming one. “There are only 5,000 developers dedicated to writing software for cryptocurrencies, Bitcoin, and blockchain in general. And perhaps another 20,000 had dabbled with the technology, or have written front end applications that connect with the blockchain.” - William Mougayar, The Business Blockchain Decentralized apps or dapps, as they are fondly called, are serverless applications that can be run on the client-side, within a blockchain based distributed network. We’re going to learn what the best tools are to build dapps and over the next few minutes, we’ll take these tools apart one by one. For a better understanding of where they fit into our development cycle, we’ll group them up into stages - just like the buildings we build. So, shall we begin? Yes, we can!! ;) The Foundation: Platforms The first and foremost element for any structure to stand tall and strong is its foundation. The same goes for Blockchain apps. Here, in place of all the mortar and other things, we’ve got Decentralized and Public blockchains. There are several existing networks on the likes of Bitcoin, Ethereum or Hyperledger that can be used to build dapps. Ethereum and Bitcoin are both decentralized, public chains that are open source, while Hyperledger is private and also open source. Bitcoin may not be a good choice to build dapps on as it was originally designed for peer-to-peer transactions and not for building smart contracts. The Pillars of Concrete: Languages Now, once you’ve got your foundation in place, you need to start raising pillars that will act as the skeleton for your applications. How do we do this? Well, we’ve got two great languages specifically for building dapps. Solidity It’s an object-oriented language that you can use for writing smart contracts. The best part of Solidity is that you can use it across all platforms - making it the number one choice for many developers to use. It’s a lot like JavaScript and way more robust than other languages. Along with Solidity, you might want to use Solc, the compiler for Solidity. At the moment, Solidity is the language that’s getting the most support and has the best documentation. Serpent Before the dawn of Solidity, Serpent was the reigning language for building dapps. Something like how bricks replaced stone to build massive structures. Serpent though is still being used in many places to build dapps and it has great real-time garbage collection. The Transit Mixers: Frameworks After you choose your language to build dapps, you need a framework to simplify the mixing of concrete to build your pillars. I find these frameworks interesting: Embark This is a framework for Ethereum you can use to quicken development and to streamline the process by using tools or functionalities. It allows you to develop and deploy dapps easily, or even build a serverless HTML5 application that uses decentralized technology. It equips you with tools to create new smart contracts which can be made available in JavaScript code. Truffle Here is another great framework for Ethereum, which boasts of taking on the task of managing your contract artifacts for you. It includes support for the library that links complex Ethereum apps and provides custom deployments. The Contractors: Integrated Development Environments Maybe, you are not the kind that likes to build things from scratch. You just need a one-stop place where you can tell what kind of building you want and everything else just falls in place. Hire a contractor. If you’re looking for the complete package to build dapps, there are two great tools you can use, Ethereum Studio and Remix (Browser-Solidity). The IDE takes care of everything - right from emulating the live network to testing and deploying your dapps. Ethereum Studio This is an adapted version of Cloud9, built for Ethereum with some additional tools. It has a blockchain emulator called the sandbox, which is great for writing automated tests. Fair warning: You must pay for this tool as it’s not open source and you must use Azure Cloud to access it. Remix  This can pretty much do the same things that Ethereum Studio can. You can run Remix from your local computer and allow it to communicate with an Ethereum node client that’s on your local machine. This will let you execute smart contracts while connected to your local blockchain. Remix is still under development during the time of writing this article. The Rebound Hammer: Testing tools Nothing goes live until it’s tried and tested. Just like the rebound hammer you may use to check the quality of concrete, we have a great tool that helps you test dapps. Blockchain Testnet For testing purposes, use the testnet, an alternative blockchain. Whether you want to create a new dapp using Ethereum or any other chain, I recommend that you use the related testnet, which ideally works as a substitute in place of the true blockchain that you will be using for the real dapp. Testnet coins are different from actual bitcoins, and do not hold any value, allowing you as a developer or tester to experiment, without needing to use real bitcoins or having to worry about breaking the primary bitcoin chain. The Wallpaper: dapp Browsers Once you’ve developed your dapp, it needs to look pretty for the consumers to use. Dapp browsers are mostly the User Interfaces for the Decentralized Web. Two popular tools that help you bring dapps to your browser are Mist and Metamask. Mist  It is a popular browser for decentralized web apps. Just as Firefox or Chrome are for the Web 2.0, the Mist Browser will be for the decentralized Web 3.0. Ethereum developers would be able to use Mist not only to store Ether or send transactions but to also deploy smart contracts. Metamask  With Metamask, you can comfortably run dapps in your browser without having to run a full Ethereum node. It includes a secure identity vault that provides a UI to manage your identities on various sites, as well as sign blockchain contracts. There! Now you can build a Blockchain! Now you have all the tools you need to make amazing and reliable dapps. I know you’re always hungry for more - this Github repo created by Christopher Allen has a great listing of tools and resources you can use to begin/improve your Blockchain development skills. If you’re one of those lazy-but-smart folks who want to get things done at the click of a mouse button, then BaaS or Blockchain as a Service is something you might be interested in. There are several big players in this market at the moment, on the likes of IBM, Azure, SAP and AWS. BaaS is basically for organizations and enterprises that need blockchain networks that are open, trusted and ready for business. If you go the BaaS way, let me warn you - you’re probably going to miss out on all the fun of building your very own blockchain from scratch. With so many banks and financial entities beginning to set up their blockchains for recording transactions and transfer of assets, and investors betting billions on distributed ledger-related startups, there are hardly a handful of developers out there, who have the required skills. This leaves you with a strong enough reason to develop great blockchains and sharpen your skills in the area. Our Building Blockchain Projects book should help you put some of these tools to use in building reliable and robust dapps. So what are you waiting for? Go grab it now and have fun building blockchains!
Read more
  • 0
  • 2
  • 13620

article-image-looking-different-types-lookup-cache
Savia Lobo
20 Nov 2017
6 min read
Save for later

Looking at the different types of Lookup cache

Savia Lobo
20 Nov 2017
6 min read
[box type="note" align="" class="" width=""]The following is an excerpt from a book by Rahul Malewar titled Learning Informatica PowerCenter 10.x. We walk through the various types of lookup cache based on how a cache is defined in this article.[/box] Cache is the temporary memory that is created when you execute a process. It is created automatically when a process starts and is deleted automatically once the process is complete. The amount of cache memory is decided based on the property you define at the transformation level or session level. You usually set the property as default, so as required, it can increase the size of the cache. If the size required for caching the data is more than the cache size defined, the process fails with the overflow error. There are different types of caches available. Building the Lookup Cache - Sequential or Concurrent You can define the session property to create the cache either sequentially or concurrently. Sequential cache When you select to create the cache sequentially, Integration Service caches the data in a row-wise manner as the records enter the lookup transformation. When the first record enters the lookup transformation, lookup cache gets created and stores the matching record from the lookup table or file in the cache. This way, the cache stores only the matching data. It helps in saving the cache space by not storing unnecessary data. Concurrent cache When you select to create cache concurrently, Integration service does not wait for the data to flow from the source; it first caches complete data. Once the caching is complete, it allows the data to flow from the source. When you select a concurrent cache, the performance enhances as compared to sequential cache since the scanning happens internally using the data stored in the cache. Persistent cache - the permanent one You can configure the cache to permanently save the data. By default, the cache is created as non-persistent, that is, the cache will be deleted once the session run is complete. If the lookup table or file does not change across the session runs, you can use the existing persistent cache. Suppose you have a process that is scheduled to run every day and you are using lookup transformation to lookup on the reference table that which is not supposed to change for six months. When you use non-persistent cache every day, the same data will be stored in the cache; this will waste time and space every day. If you select to create a persistent cache, the integration service makes the cache permanent in the form of a file in the $PMCacheDir location. So, you save the time every day, creating and deleting the cache memory. When the data in the lookup table changes, you need to rebuild the cache. You can define the condition in the session task to rebuild the cache by overwriting the existing cache. To rebuild the cache, you need to check the rebuild option on the session property. Sharing the cache - named or unnamed You can enhance the performance and save the cache memory by sharing the cache if there are multiple lookup transformations used in a mapping. If you have the same structure for both the lookup transformations, sharing the cache will help in enhancing the performance by creating the cache only once. This way, we avoid creating the cache multiple times, which in turn, enhances the performance. You can share the cache--either named or unnamed Sharing unnamed cache If you have multiple lookup transformations used in a single mapping, you can share the unnamed cache. Since the lookup transformations are present in the same mapping, naming the cache is not mandatory. Integration service creates the cache while processing the first record in first lookup transformation and shares the cache with other lookups in the mapping. Sharing named cache You can share the named cache with multiple lookup transformations in the same mapping or in another mapping. Since the cache is named, you can assign the same cache using the name in the other mapping. When you process the first mapping with lookup transformation, it saves the cache in the defined cache directory and with a defined cache file name. When you process the second mapping, it searches for the same location and cache file and uses the data. If the Integration service does not find the mentioned cache file, it creates the new cache. If you run multiple sessions simultaneously that use the same cache file, Integration service processes both the sessions successfully only if the lookup transformation is configured for read-only from the cache. If there is a scenario when both lookup transformations are trying to update the cache file or a scenario where one lookup is trying to read the cache file and other is trying to update the cache, the session will fail as there is conflict in the processing. Sharing the cache helps in enhancing the performance by utilizing the cache created. This way we save the processing time and repository space by not storing the same data multiple times for lookup transformations. Modifying cache - static or dynamic When you create a cache, you can configure them to be static or dynamic. Static cache A cache is said to be static if it does not change with the changes happening in the lookup table. The static cache is not synchronized with the lookup table. By default, Integration service creates a static cache. The Lookup cache is created as soon as the first record enters the lookup transformation. Integration service does not update the cache while it is processing the data. Dynamic cache A cache is said to be dynamic if it changes with the changes happening in the lookup table. The static cache is synchronized with the lookup table. You can choose from the lookup transformation properties to make the cache dynamic. Lookup cache is created as soon as the first record enters the lookup transformation. Integration service keeps on updating the cache while it is processing the data. The Integration service marks the record as an insert for the new row inserted in the dynamic cache. For the record that is updated, it marks the record as an update in the cache. For every record that doesn't change, the Integration service marks it as unchanged. You use the dynamic cache while you process the slowly changing dimension tables. For every record inserted in the target, the record will be inserted in the cache. For every record updated in the target, the record will be updated in the cache. A similar process happens for the deleted and rejected records.
Read more
  • 0
  • 0
  • 13276
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-level-your-companys-big-data-mesos
Timothy Chen
23 Dec 2015
5 min read
Save for later

Level Up Your Company's Big Data with Mesos

Timothy Chen
23 Dec 2015
5 min read
In my last post I talked about how using a resource management platform can allow your Big Data workloads to be more efficient with less resources. In this post I want to continue the discussion with a specific resource management platform, which is Mesos. Introduction to Mesos Mesos is an Apache top-level project that provides an abstraction to your datacenter resources and an API to program against these resources to launch and manage your workloads. Mesos is able to manage your CPU, memory, disk, ports and other resources that the user can custom defines. Every application that wants to use resources in the datacenter to run tasks talks with Mesos is called a scheduler. It uses the scheduler API to receive resource offers and each scheduler can decide to use the offer, decline the offer to wait for future ones, or hold on the offer for a period of time to combine the resources. Mesos will ensure to provide fairness amongst multiple schedulers so no one scheduler can overtake all the resources. So how does your Big data frameworks benefit specifically by using Mesos in your datacenter? Autopilot your Big data frameworks The first benefit of running your Big data frameworks on top of Mesos, which by abstracting away resources and providing an API to program against your datacenter, is that it allows each Big data framework to self-manage itself without minimal human intervention. How does the Mesos scheduler API provide self management to frameworks? First we should understand a little bit more what does the scheduler API allows you to do. The Mesos scheduler API provides a set of callbacks whenever the following events occurs: New resources available, task status changed, slave lost, executor lost, scheduler registered/disconnected, etc. By reacting to each event with the Big data framework's specific logic it allows frameworks to deploy, handle failures, scale and more. Using Spark as an example, when a new Spark job is launched it launches a new scheduler waiting for resources from Mesos. When new resources are available it deploys Spark executors to these nodes automatically and provide Spark task information to these executors and communicate the results back to the scheduler. When some reason the task is terminated unexpectedly, the Spark scheduler receives the notification and can automatically relaunch that task on another node and attempt to resume the job. When the machine crashes, the Spark scheduler is also notified and can relaunch all the executors on that node to other available resources. Moreover, since the Spark scheduler can choose where to launch the tasks it can also choose the nodes that provides the most data locality to the data it is going to process. It can also choose to deploy the Spark executors in different racks to have more higher availability if it's a long running Spark streaming job. As you can see, by programming against an API allows lots of flexibility and self-managment for the Big data frameworks, and saves a lot of manually scripting and automation that needs to happen. Manage your resources among frameworks and users When there are multiple Big data frameworks sharing the same cluster, and each framework is shared with multiple users, providing a good policy around ensuring the important users and jobs gets executed becomes very important. Mesos allows you to specify roles, where multiple frameworks can belong to a role. Mesos then allows operators to specify weights among these roles, so that the fair share is enforced by Mesos to provide the resources according to the weight specified. For example, one might provide 70% resources to Spark and 30% resources to general tasks with the weighted roles in Mesos. Mesos also allows reserving a fixed amount of resources per agent to a specific role. This ensures that your important workload is guaranteed to have enough resources to complete its workload. There are more features coming to Mesos that also helps multi-tenancy. One feature is called Quota where it ensures over the whole cluster that a certain amount of resources is reserved instead of per agent. Another feature is called dynamic reservation, which allows frameworks and operators to reserve a certain amount of resources at runtime and can unreserve them once it's no longer necessary. Optimize your resources among frameworks Using Mesos also boosts your utilization, by allowing multiple tasks from different frameworks to use the same cluster and boosts utilization without having separate clusters. There are a number of features that are currently being worked on that will even boost the utilization even further. The first feature is called oversubscription, which uses the tasks runtime statistics to estimate the amount of resources that is not being used by these tasks, and offers these resources to other schedulers so more resources is actually being utilized. The oversubscription controller also monitors the tasks to make sure when the task is being affected by sharing resources, it will kill these tasks so it's no longer being affected. Another feature is called optimistic offers, which allows multiple frameworks to compete for resources. This helps utilization by allowing faster scheduling and allows the Mesos scheduler to have more inputs to choose how to best schedule its resources in the future. As you can see Mesos allows your Big data frameworks to be self-managed, more efficient and allows optimizations that are only possible by sharing the same resource management. If you're curious how to get started you can follow at the Mesos website or Mesosphere website that provides even simpler tools to use your Mesos cluster. Want more Big Data tutorials and insight? Both our Spark and Hadoop pages have got you covered. About the author Timothy Chen is a distributed systems engineer and entrepreneur. He works at Mesosphere and can be found on Github @tnachen.
Read more
  • 0
  • 0
  • 10966

article-image-mysteries-big-data-and-orient-db
Julian Ursell
30 Jun 2014
4 min read
Save for later

The Mysteries of Big Data and the Orient … DB

Julian Ursell
30 Jun 2014
4 min read
Mapping the world of big data must be a lot like demystifying the antiquated concept of the Orient, trying to decipher a mass of unknowns. With the ever multiplying expanse of data and the natural desire of humans to simultaneously understand it—as soon as possible and in real time—technology is continually evolving to allow us to make sense of it, make connections between it, turn it into actionable insight, and act upon it physically in the real world. It’s a huge enterprise, and you’ve got to imagine with the masses of data collated years before on legacy database systems, without the capacity for the technological insight and analysis we have now, there are relationships within the data that remain undefined—the known unknowns, the unknown knowns, and the known knowns (that Rumsfeld guy was making sense you see?). It's fascinating to think what we might learn from the data we have already collected. There is a burning need these days to break down the mysteries of big data and developers out there are continually thinking of ways we can interpret it, mapping data so that it is intuitive and understandable. The major way developers have reconceptualized data in order to make sense of it is as a network connected tightly together by relationships. The obvious examples are Facebook or LinkedIn, which map out vast networks of people connected by various shared properties, such as education, location, interest, or profession. One way of mapping highly connectable data is by structuring data in the form of a graph, a design that has emerged in recent years as databases have evolved. The main progenitor of this data structure is Neo4j, which is far and away the leader in the field of graph databases, mobilized by a huge number of enterprises working with big data. Neo4j has cornered the market, and it's not hard to see why—it offers a powerful solution with heavy commercial support for enterprise deployments. In truth there aren't many alternatives out there, but alternatives exist. OrientDB is a hybrid graph document database that offers the unique flexibility of modeling data in the form of either documents, or graphs, while incorporating object-oriented programming as a way of encapsulating relationships. Again, it's a great example of developers imagining ways in which we can accommodate the myriad of different data types, and relationships that connect it all together. The real mystery of the Orient(DB) however, is the relatively low (visible) adoption of a database that offers both innovation, and reputedly staggering levels of performance (claims are that it can store up to 150,000 records a second). The question isn't just why it hasn't managed to dent a market essentially owned by Neo4j, but why, on its own merits, haven’t more developers opted for the database? The answer may in the end be vaguely related to the commercial drivers—outside of Europe it seems as if OrientDB has struggled to create the kind of traction that would push greater levels of adoption, or perhaps it is related to the considerable development and tuning of the project for use in production. Related to that, maybe OrientDB still has a way to go in terms of enterprise grade support for production. For sure it's hard to say what the deciding factor is here. In many ways it’s a simple reiteration of the level of difficulty facing startups and new technologies endeavoring to acquire adoption, and that the road to this goal is typically a long one. Regardless, what both Neo4j and OrientDB are valuable for is adapting both familiar and unfamiliar programming concepts in order to reimagine the way we represent, model, and interpret connections in data, mapping the information of the world.
Read more
  • 0
  • 0
  • 9696

article-image-know-customer-envisaging-customer-sentiments-using-behavioral-analytics
Sugandha Lahoti
13 Nov 2017
6 min read
Save for later

Know Your Customer: Envisaging customer sentiments using Behavioral Analytics

Sugandha Lahoti
13 Nov 2017
6 min read
“All the world’s a stage and the men and women are merely players.” Shakespeare may have considered men and women as mere players, but as large number of users are connected with smart devices and the online world, these men, and women—your customers—become your most important assets. Therefore, knowing your customer and envisaging their sentiments using Behavioral Analytics has become paramount. Behavioral analytics: Tracking user events Say, you order a pizza through an app on your phone. After customizing and choosing the crust size, type and ingredients, you land in the payment section. Suppose, instead of paying, you abandon the order altogether. Immediately you get an SMS and an email, alerting you that you are just a step away from buying your choice of pizza. So how does this happen? Behavior analytics runs in the background here. By tracking user navigation, it prompts the user to complete an order, or offer a suggestion. The rise of smart devices has enabled almost everything to transmit data. Most of this data is captured between sessions of user activity and is in the raw form. By user activity we mean social media interactions, amount of time spent on a site, user navigation path, click activity of a user, their responses to change in the market, purchasing history and much more. Some form of understanding is therefore required to make sense of this raw and scrambled data and generate definite patterns. Here’s where behavior analytics steps in. It goes through a user's entire e-commerce journey and focuses on understanding the what and how of their activities. Based on this, it predicts their future moves. This, in turn, helps to generate opportunities for businesses to become more customer-centric. Why Behavioral analytics over traditional analytics The previous analytical tools lacked a single architecture and simple workflow. Although they assisted with tracking clicks and page loads, they required a separate data warehouse and visualization tools. Thus, creating an unstructured workflow. Behavioral Analytics go a step beyond standard analytics by combining rule-based models with deep machine learning. Where the former tells what the users do, the latter reveals the how and why of their actions. Thus, they keep track of where customers click, which pages are viewed, how many continue down the process, who eliminates a website at what step, among other things. Unlike traditional analytics, behavioral analytics is an aggregator of data from diverse sources (websites, mobile apps, CRM, email marketing campaigns etc.) collected across various sessions. Cloud-based behavioral analytic platforms can intelligently integrate and unify all sources of digital communication into a complete picture. Thus, offering a seamless and structured view of the entire customer journey. Such behavioral analytic platforms typically capture real-time data which is in raw format. They then automatically filter and aggregate this data into a structured dataset. It also provides visualization tools to see and observe this data, all the while predicting trends. The aggregation of data is done in such a way that it allows querying this data in an unlimited number of ways for the business to utilize. So, they are helpful in analyzing retention and churn trends, trace abnormalities, perform multidimensional funnel analysis and much more. Let’s look at some specific use cases across industries where behavioral analytics is highly used. Analysing customer behavior in E-commerce E-commerce platforms are on the top of the ladder in the list of sectors, which can largely benefit by mapping their digital customer journey. Analytic strategies can track if a customer spends more time on a product page X over product page Y by displaying views and data pointers of customer activity in a structured format. This enables industries to resolve issues, which may hinder a page’s popularity, including slow loading pages, expensive products etc. By tracking user session, right from when they entered a platform to the point a sale is made, behavior analytics predicts future customer behavior and business trends. Some of the parameters considered include number of customers viewing reviews and ratings before adding an item to their cart, what similar products the customer sees, how often the items in the cart are deleted or added etc. Behavioral analytics can also identify top-performing products and help in building powerful recommendation engines. By analyzing changes in customer behavior over different demographical conditions or on the basis of regional differences.This helps achieve customer-to-customer personalization. KISSmetrics is a powerful analytics tool that provides detailed customer behavior information report for businesses to slice through and find meaningful insights. RetentionGrid provides color-coded visualizations and also provides multiple strategies tailormade for customers, based on customer segmentation and demographics.   How can online gaming benefit from behavioral analysis Online gaming is a surging community with millions of daily active users. Marketers are always looking for ways to acquire customers and retain users. Monetization is another important focal point. This means not only getting more users to play but also to pay. Behavioral analytics keeps track of a user’s gaming session such as skill levels, amount of time spent at different stages, favorite features and activities within game-play, and drop-off points from the game. At an overall level, it tracks the active users, game logs, demographic data and social interaction between players over various community channels. On the basis of this data, a visualization graph is generated which can be used to drive market strategies such as identifying features that work, how to add additional players, or how to keep existing players engaged. Thus helping increase player retention and assisting game developers and marketers implement new versions based on player’s reaction. behavior analytics can also identify common characteristics of users. It helps in understanding what gets a user to play longer and in identifying the group of users most likely to pay based on common characteristics. All these help gaming companies implement right advertising and placement of content to the users. Mr Green’s casino launched a Green Gaming tool to predict a person’s playing behavior and on the basis of a gamer’s risk-taking behavior, they help generate personalized insights regarding their gaming. Nektan PLC has partnered with ‘machine learning’ customer insights firm Newlette. Newlette models analyze player behavior based on individual playing styles. They help in increasing player engagement and reduce bonus costs by providing the players with optimum offers and bonuses. The applications of behavioral analytics are not just limited to e-commerce or gaming alone. The security and surveillance domain uses behavioral analytics for conducting risk assessment of organizational resources and alerting against individual entities that are a potential threat. They do so by sifting through large amounts of company data and identifying patterns that portray irregularity or change. End-to-end monitoring of customer also helps app developers track customer adoption to new-feature development. It could also provide reports on the exact point where customers drop off and help in avoiding expensive technical issues. All these benefits highlight how customer tracking and knowing user behavior is an essential tool to drive a business forward. As Leo Burnett, the founder of a prominent advertising agency says “What helps people, helps business.”
Read more
  • 0
  • 1
  • 9290

article-image-retail-analytics-offline-retailers
Savia Lobo
24 Nov 2017
7 min read
Save for later

Cyber Monday Special: Can Walmart beat Amazon in its own game? Probably, if they go full-throttle AI

Savia Lobo
24 Nov 2017
7 min read
Long gone are the days when people would go out exploring one store to another, to buy that beautiful pink dress or a particular pair of shoe.The e-commerce revolution, has surged online shopping drastically. How many time have we heard physical stores are dying. And yet they seem to have a cat-like hold on their lifespan. Wonder why? Because not everyone likes shopping online. We are aware of a group of people who still prefer to buy from the brick and mortar structure. They are like the doubting Thomas, remember the touch and believe concept? For customers who love shopping physically in a store, retailers strive to create a fascinating shopping experience in a way that online platform cannot offer. This is especially important for them to increase sale and generate profits on peak festive seasons such as Black Fridays or New Year’s. A lot has been talked about the wonders retail analytics data can do for e-commerce sites, read this article for instance. But not as much is talked about traditional stores. So, here we have listed down 10 retail analytics options for offline retailers, to capture maximum customer attention and retention. 1. Location analytics and proximity marketing A large number of retail stores collect data to analyze the volume of customers buying online and offline. They use this data for internal retail analytics, which helps them in merchandise tracking, adjust staffing levels, monitor promotions and so on. Retailers benefit from location analytics in order to detect a section of the store with high customer traffic. Proximity marketing uses location-based technology to communicate with customers through their smartphones. Customers receive targeted offers and discounts based on their proximity to the product. For instance, a 20% off on the floral dress to your right. Such on-the-go attractive deals have a higher likelihood of resulting in a customer sale. Euclid Analytics provides solutions to track the buying experience of every visitor in the store. This helps retailers retarget and rethink their strategies to influence sales at an individual customer level. 2. Music systems Nowadays, most large retail formats have music systems set up for the customers. A playlist with a mixed genre of music is an ideal fit for various customers visiting the store.  Retailers use the tactic right, with a correct tempo, volume, and genre to uplift customer’s mood resulting in a purchase. They also have to keep in mind the choice of a playlist. As, music preferences differ with generation, behavior, and the mood of the customer.  Store owners opt for music services with a customized playlist to create an influential buying factor. Atmos select provides a customized music service to retailers. It takes customer demographics into consideration to draft an audio branding strategy.  It is then used to design an audio solution for most of the retail outlets and stores. 3. Guest WiFi Guest wifi benefits customers by giving them free internet connection while they shop. Who would not want that? However, not only customers but retailers too benefit with such an offering. An in-store wifi provides them with detailed customer analytics and enables to track various shopping patterns. Cloud4Wi, offers Volare guest Wi-Fi, which provides free wi-fi service to customers within the retail store. It provides a faster and easier login option, to connect to the wifi. It also collects customer’s data for retailers to provides unique and selective marketing list. 4. Workforce tools Unison among the staff members within the work environment creates positivity in store. To increase communication between the staff, workforce tools are put to use. These are various messaging applications and work-planner platforms that help in maintaining a rapport among the staff members. It helps empower employees to maintain their work-life, check overtime details, attendance, and more. Branch, a tool to improve workforce productivity, helps internal messaging networks and also notifies employees about their shift timing, and other details. 5. Omnichannel retail analytics Omnichannel retail enables customer with an interactive and seamless shopping experience across platforms. Additionally,  with the data collected from different digital channels, retailers get an overview of customer’s shopping journey and the choices they made over time. Omnichannel analytics also assists them to showcase personalized shopping ads based on customer’s social media habits. Intel offers solutions for Omnichannel analytics which helps retailers increase customer loyalty and generate substantial revenue growth. 6. Dressing Room Technology The mirror within the trial room knows it all! Retailers can attract maximum customer traffic with the mirror technology. It is an interactive, touch screen mirror that allows customers to request new items and adjust the lights in the trial room. The mirror can also sense products that the customer brings in, using the RFID technology, and recommends similar products. It also assists them in saving products to their online accounts-- in case they decide to purchase them later--or digitally seek assistance from the store associate. Oak Labs, has created one such mirror which transforms customer trial room experience while bridging the gap between technology and retail. 7. Pop-ups and kiosks Pop-ups are mini-outlets for large retail formats, set up to sell a seasonal product. Whereas kiosks are temporary alternatives for retailers, to attract a high number of footfalls in store. Both pop-ups and kiosks benefit shoppers with the choice of self-service. They get an option to shop from the store’s physical as well as online product offering. They not only enable secure purchase but also deliver orders to your doorstep. Such techniques attract customers to choose retail shopping over online shopping. Withme, a startup firm that offers a platform to set up POP ups for retail outlets and brands.   8. Inventory management Managing the inventory is a major task for a store manager - to place the right product in the right place at the right time. Predictive analytics helps optimize inventory management for proper allocation, and replenishment process. It also equips retailers to markdown the inventory for clearance to reload a new batch. Celect, an inventory management startup helps retailers to analyze customer preferences and simultaneously map future demand for the product. It also helps in extraction of existing data from the inventory to gain meaningful insights. Such insights can then be taken into account for the faster sale of inventory and to get a detailed retail analytics based sales report. 9.  Smart receipts and ratings Retailers continuously aim to provide better quality service to the customer. Receiving a 5-star rating for their service in return is like a cherry on the cake.  For higher customer engagement, retailers offer smart receipts, which helps retailers collect customer email addresses to send promotional offers or festive sale discounts. Retailers also provide customers with personalized offerings and incentives in order to attract customer revisitation. To know how well retailers have fared in providing services, they set up a digital kiosk at the checkout area, where in-store customers can rate retailers based on the shopping experience. Startup firms such as TruRating aid retailers with a rating mechanism for shoppers at the checkout. FlexReceipts helps retailers to set up smart receipt application for the customers. 10. Shopping cart tech Retailers can now provide a next-gen shopping cart to their customers. A technology that can guide customer’s in-store shopping journey with a tablet-equipped shopping cart. The tablet uses machine vision to keep a track of the shelves, as the cart moves within the store. It also displays digital-ads to promote each product, the shopping cart passes through. Focal Systems build powerful technical assistance for retailers, which can give tough competition to their online counterparts. Online shopping is convenient but more often than not we still crave for the look and feel of a product and the immersive shopping experience especially during holidays and festive occasions. And that’s the USP of a Brick and Mortar shop. Offline retailers who know their data and know how to leverage retail analytics using advances in machine learning and retail tech stand a chance to provide their customers with a shopping experience superior to their online counterparts.
Read more
  • 0
  • 0
  • 8973
article-image-handpicked-weekend-reading-15th-dec-2017
Aarthi Kumaraswamy
16 Dec 2017
2 min read
Save for later

Handpicked for your Weekend Reading - 15th Dec, 2017

Aarthi Kumaraswamy
16 Dec 2017
2 min read
As you gear up for the holiday season and the year-end celebrations, make a resolution to spend a fraction of your weekends in self-reflection and in honing your skills for the coming year. Here is the best of the DataHub for your reading this weekend. Watch out for our year-end special edition in the last week of 2017! NIPS Special Coverage A deep dive into Deep Bayesian and Bayesian Deep Learning with Yee Whye Teh How machine learning for genomics is bridging the gap between research and clinical trial success by Brendan Frey 6 Key Challenges in Deep Learning for Robotics by Pieter Abbeel For the complete coverage, visit here. Experts in Focus Ganapati Hegde and Kaushik Solanki, Qlik experts from Predoole Analytics on How Qlik Sense is driving self-service Business Intelligence 3 things you should know that happened this week Generative Adversarial Networks: Google open sources TensorFlow-GAN (TFGAN) “The future is quantum” — Are you excited to write your first quantum computing code using Microsoft’s Q#? “The Blockchain to Fix All Blockchains” – Overledger, the meta blockchain, will connect all existing blockchains Try learning/exploring these tutorials weekend Implementing a simple Generative Adversarial Network (GANs) How Google’s MapReduce works and why it matters for Big Data projects How to write effective Stored Procedures in PostgreSQL How to build a cold-start friendly content-based recommender using Apache Spark SQL Do you agree with these insights/opinions Deep Learning is all set to revolutionize the music industry 5 reasons to learn Generative Adversarial Networks (GANs) in 2018 CapsNet: Are Capsule networks the antidote for CNNs kryptonite? How AI is transforming the manufacturing Industry
Read more
  • 0
  • 0
  • 8549

article-image-stitch-fix-full-stack-data-science-winning-strategies
Aaron Lazar
05 Dec 2017
8 min read
Save for later

Stitch Fix: Full Stack Data Science and other winning strategies

Aaron Lazar
05 Dec 2017
8 min read
Last week, a company in San Francisco was popping bottles of champagne for their achievements. And trust me, they’re not at all small. Not even a couple of weeks gone by, since it was listed on the stock market and it has soared to over 50%. Stitch Fix is an apparel company run by co-founder and CEO, Katrina Lake. In just a span of 6 years, she’s been able to build the company with an annual revenue of a whopping $977 odd million. The company has been disrupting traditional retail and aims to bridge the gap of personalised shopping, that the former can’t accomplish. Stitch Fix is more of a personalized stylist, rather than a traditional apparel company. It works in 3 basic steps: Filling a Style Profile: Clients are prompted to fill out a style profile, where they share their style, price and size preferences. Setting a Delivery Date: The clients set a delivery date as per their availability. Stitch Fix mixes and matches various clothes from their warehouses and comes up with the top 5 clothes that they feel would best suit the clients, based on the initial style profile, as well as years of experience in styling. Keep or Send Back: The clothes reach the customer on the selected date and the customer can try on the clothes, keep whatever they like or send back what they don’t. The aim of Stitch Fix is to bring a personal touch to clothes shopping. According to Lake, “There are millions and millions of products out there. You can look at eBay and Amazon. You can look at every product on the planet, but trying to figure out which one is best for you is really the challenge” and that’s the tear Stitch Fix aims to sew up. In an interview with eMarketer, Julie Bornstein, COO of Stitch Fix said “Over a third of our customers now spend more than half of their apparel wallet share with Stitch Fix. They are replacing their former shopping habits with our service.” So what makes Stitch Fix stand out among its competitors? How do they do it? You see, Stitch Fix is not just any apparel company. It has created the perfect formula by blending human expertise with just the right amount of Data Science to enable it to serve its customers. When we’re talking about the kind of Data Science that Stitch Fix does, we’re talking about a relatively new and exciting term that’s on the rise - Full Stack Data Science. Hello Full Stack Data Science! For those of you who’ve heard of this before, cheers! I hope you’ve had the opportunity to experience its benefits. For those of you who haven’t heard of the term, Full Stack Data Science basically means a single data scientist does their own work, which is mining data, cleans it, writes an algorithm to model it and then visualizes the results, while also stepping into the shoes of an engineer, implementing the model, as well as a Project Manager, tracking the entire process and ensuring it’s on track. Now while this might sound like a lot for one person to do, it’s quite possible and practical. It’s practical because of the fact that when these roles are performed by different individuals, they induce a lot of latency into the project. Moreover, a synchronization of priorities of each individual is close to impossible, thus creating differences within the team. The Data (Science) team at Stitch Fix is broadly categorized based on what area they work on: Because most of the team focuses on full stack, there are over 80 Data Scientists on board. That’s a lot of smart people in one company! On a serious note, although unique, this kind of team structure has been doing well for them, mainly because it gives each one the freedom to work independently. Tech Treasure Trove When you open up Stitch Fix’s tech toolbox, you won’t find Aladdin’s lamp glowing before you. Their magic lies in having a simple tech stack that works wonders when implemented the right way. They work with Ruby on Rails and Bootstrap for their web applications that are hosted on Heroku. Their data platform relies on a robust Postgres implementation. Among programming languages, we found Python, Go, Java and JavaScript also being used. For an ML Framework, we’re pretty sure they’re playing with TensorFlow. But just working with these tools isn’t enough to get to the level they’re at. There’s something more under the hood. And believe it or not, it’s not some gigantic artificial intelligent system running on a zillion cores! Rather, it’s all about the smaller, simpler things in life. For example, if you have 3 different kinds of data and you need to find a relationship between them, instead of bringing in the big guns (read deep learning frameworks), a simple tensor decomposition using word vectors would do the deed quite well. Advantages galore: Food for the algorithms One of the main advantages Stitch Fix has, is that they have almost 5 years’ worth client data. This data is obtained from clients in several ways like through a Client Profile, After-Delivery Feedback, Pinterest photos, etc. All this data is put through algorithms that learn more about the likes and dislikes of clients. Some interesting algorithms that feed on this sumptuous data are on the likes of collaborative filtering recommenders to group clients based on their likes, mixed-effects modeling to learn about a client’s interests over time, neural networks to derive vector descriptions of the Pinterest images and to compare them with in-house designs, NLP to process customer feedback, Markov chain models to predict demand, among several others. A human Touch: When science meets art While the machines do all the calculations and come up with recommendations on what designs customers would appreciate, they still lack the human touch involved. Stitch Fix employs over 3000 stylists. Each client is assigned a stylist who knows the entire preference of the client at the glance of a custom-built interface. The stylist finalizes the selections from the inventory list also adding in a personal note that describes how the client can accessorize the purchased items for a particular occasion and how they can pair them with any other piece of clothing in their closet. This truly advocates “Humans are much better with the machines, and the machines are much better with the humans”. Cool, ain't it? Data Platform Apart from the Heroku platform, Stitch Fix seems to have internal SaaS platforms where the data scientists effectively carry out analysis, write algorithms and put them into production. The platforms exhibit properties like data distribution, parallelization, auto-scaling, failover, etc. This lets the data scientists focus on the science aspect while still enjoying the benefits of a scalable system. The good, the bad and the ugly: Microservices, Monoliths and Scalability Scalability is one of the most important aspects a new company needs to take into account before taking the plunge. Using a microservice architecture helps with this, by allowing small independent services/mini applications to run on their own. Stitch Fix uses this architecture to improve scalability although, their database is a monolith. They now are breaking the monolith database into microservices. This is a takeaway for all entrepreneurs just starting out with their app. Data Driven Applications Data-driven applications ensure that the right solutions are built for customers. If you’re a customer-centric organisation, there’s something you can learn from Stitch Fix. Data-Driven Apps seamlessly combine the operational and analytic capabilities of the organisation, thus breaking down the traditional silos. TDD + CD = DevOps Simplified Both Test Driven Development and Continuous Delivery go hand in hand and it’s always better to imbibe this culture right from the very start. In the end, it’s really great to see such creative and technologically driven start-ups succeed and sail to the top. If you’re on the journey to building that dream startup of yours and you need resources for your team, here’s a few books you’ll want to pick up to get started with: Hands-On Data Science and Python Machine Learning by Frank Kane Data Science Algorithms in a Week by Dávid Natingga Continuous Delivery and DevOps : A Quickstart Guide - Second Edition by Paul Swartout Practical DevOps by Joakim Verona    
Read more
  • 0
  • 0
  • 7978

article-image-what-did-big-data-deliver-in-2014
Akram Hussein
30 Dec 2014
5 min read
Save for later

What Did Big Data Deliver In 2014?

Akram Hussein
30 Dec 2014
5 min read
Big Data has always been a hot topic and in 2014 it came to play. ‘Big Data’ has developed, evolved and matured to give significant value to ‘Business Intelligence’. However there is so much more to big data than meets the eye. Understanding enormous amounts of unstructured data is not easy by any means; yet once that data is analysed and understood, organisations have started to value its importance and need. ‘Big data’ has helped create a number of opportunities which range from new platforms, tools, technologies; to improved economic performances in different industries; through development of specialist skills, job creation and business growth. Let’s do a quick recap of 2014 and on what Big Data has offered to the tech world from the perspective of a tech publisher.   Data Science The term ‘Data Science’ has been around for sometime admittedly, yet in 2014 it received a lot more attention thanks to the demands created by ‘Big Data’. Looking at Data Science from a tech publisher’s point of view, it’s a concept which has rapidly been adopted with potential for greater levels of investment and growth.   To address the needs of Big data, Data science has been split into four key categories, which are; Data mining, Data analysis, Data visualization and Machine learning. Equally we have important topics which fit inbetween those such as: Data cleaning (Munging) which I believe takes up majority of a data scientist time. The rise in jobs for data scientists has exploded in recent times and will continue to do so, according to global management firm McKinsey & Company there will be a shortage of 140,000 to 190,000 data scientists due to the continued rise of ‘big data’ and also has been described as the ‘Sexiest job of 21st century’.   Real time Analytics The competitive battle in Big data throughout 2014 was focused around how fast data could be streamed to achieve real time performance. Real-time analytics most important feature is gaining instant access and querying data as soon as it comes through. The concept is applicable to different industries and supports the growth of new technologies and ideas. Live analytics are more valuable to social media sites and marketers in order to provide actionable intelligence. Likewise Real time data is becoming increasing important with the phenomenon known as ‘the internet of things’. The ability to make decisions instantly and plan outcome in real time is possible now than before; thanks to development of technologies like Spark and Storm and NoSQL databases like the Apache Cassandra,  enable organisations to rapidly retrieve data and allow fault tolerant performance. Deep Learning Machine learning (Ml) became the new black and is in constant demand by many organisations especially new startups. However even though Machine learning is gaining adoption and improved appreciation of its value; the concept Deep Learning seems to be the one that’s really pushed on in 2014. Now granted both Ml and Deep learning might have been around for some time, we are looking at the topics in terms of current popularity levels and adoption in tech publishing. Deep learning is a subset of machine learning which refers to the use of artificial neural networks composed of many layers. The idea is based around a complex set of techniques for finding information to generate greater accuracy of data and results. The value gained from Deep learning is the information (from hierarchical data models) helps AI machines move towards greater efficiency and accuracy that learn to recognize and extract information by themselves and unsupervised! The popularity around Deep learning has seen large organisations invest heavily, such as: Googles acquisition of Deepmind for $400 million and Twitter’s purchase of Madbits, they are just few of the high profile investments amongst many, watch this space in 2015!     New Hadoop and Data platforms Hadoop best associated with big data has adopted and changed its batch processing techniques from MapReduce to what’s better known as YARN towards the end of 2013 with Hadoop V2. MapReduce demonstrated the value and benefits of large scale, distributed processing. However as big data demands increased and more flexibility, multiple data models and visual tool became a requirement, Hadoop introduced Yarn to address these problems.  YARN stands for ‘Yet-Another-Resource-Negotiator’. In 2014, the emergence and adoption of Yarn allows users to carryout multiple workloads such as: streaming, real-time, generic distributed applications of any kind (Yarn handles and supervises their execution!) alongside the MapReduce models. The biggest trend I’ve seen with the change in Hadoop in 2014 would be the transition from MapReduce to YARN. The real value in big data and data platforms are the analytics, and in my opinion that would be the primary point of focus and improvement in 2015. Rise of NoSQL NoSQL also interpreted as ‘Not Only SQL’ has exploded with a wide variety of databases coming to maturity in 2014. NoSQL databases have grown in popularity thanks to big data. There are many ways to look at data stored, but it is very difficult to process, manage, store and query huge sets of messy, complex and unstructured data. Traditional SQL systems just wouldn’t allow that, so NoSQL was created to offer a way to look at data with no restrictive schemas. The emergence of ‘Graph’, ‘Document’, ‘Wide column’ and ‘Key value store’ databases have showed no slowdown and the growth continues to attract a higher level of adoption. However NoSQL seems to be taking shape and settling on a few major players such as: Neo4j, MongoDB, Cassandra etc, whatever 2015 brings, I am sure it would be faster, bigger and better! 
Read more
  • 0
  • 0
  • 7738
article-image-level-your-companys-big-data-resource-management
Timothy Chen
24 Dec 2015
4 min read
Save for later

Level Up Your Company's Big Data With Resource Management

Timothy Chen
24 Dec 2015
4 min read
Big data was once one of the biggest technology hypes, where tons of presentations and posts talked about how the new systems and tools allows large and complex data to be processed that traditional tools wasn't able to. While Big data was at the peak of its hype, most companies were still getting familiar with the new data processing frameworks such as Hadoop, and new databases such as HBase and Cassandra. Fast foward to now where Big data is still a popular topic, and lots of companies has already jumped into the Big data bandwagon and are already moving past the first generation Hadoop to evaluate newer tools such as Spark and newer databases such as Firebase, NuoDB or Memsql. But most companies also learn from running all of these tools, that deploying, operating and planning capacity for these tools is very hard and complicated. Although over time lots of these tools have become more mature, they are still usually running in their own independent clusters. It's also not rare to find multiple clusters of Hadoop in the same company since multi-tenant isn't built in to many of these tools, and you run the risk of overloading the cluster by a few non-critical big data jobs. Problems running indepdent Big data clusters There are a lot of problems when you run a lot of these independent clusters. One of them is monitoring and visibility, where all of these clusters have their own management tools and to integrate the company's shared monitoring and management tools is a huge challenge especially when onboarding yet another framework with another cluster. Another problem is multi-tenancy. Although having independent clusters solves the problem, another org's job can overtake the whole cluster. It still doesn't solve the problem when a bug in the Hadoop application just uses all the available resources and the pain of debugging this is horrific. A another problem is utilization, where a cluster is usually not 100% being utilized and all of these instances running in Amazon or in your datacenter are just racking up bills for doing no work. There are more major pain points that I don't have time to get into. Hadoop v2 The Hadoop developers and operators saw this problem, and in the 2nd generation of Hadoop they developed a separate resource management tool called YARN to have a single management framework that manages all of the resources in the cluster from Hadoop, enforce the resource limitations of the jobs, integrate security in the workload, and even optimize the workload by placing jobs closer to the data automatically. This solves a huge problem when operating a Hadoop cluster, and also consolidates all of the Hadoop clusters into one cluster since it allows a finer grain control over the workload and saves effiency of the cluster. Beyond Hadoop Now with the vast amount of Big data technologies that are growing in the ecosystem, there is a need to integrate a common resource management layer among all of the tools since without a single resource management system across all the frameworks we run back into the same problems as we mentioned before. Also when all these frameworks are running under one resource management platform, a lot of options for optimizations and resource scheduling are now possible. Here are some examples what could be possible with one resource management platform: With one resource management platform the platform can understand all of the cluster workload and available resources and can auto resize and scale up and down based on worklaods across all these tools. It can also resize jobs according to priority. The cluster is able to detect under utilization from other jobs and offer the slack resources to Spark batch jobs while not impacting your very important workloads from other frameworks, and maintain the same business deadlines and save a lot more cost. In the next post I'll continue to cover Mesos, which is one such resource management system and how the upcoming features in Mesos allows optimizations I mentioned to be possible. For more Big Data tutorials and analysis, visit our dedicated Hadoop and Spark pages. About the author Timothy Chen is a distributed systems engineer and entrepreneur. He works at Mesosphere and can be found on Github @tnachen.
Read more
  • 0
  • 0
  • 7224

article-image-big-data-more-than-just-buzz-word
Akram Hussain
16 Dec 2014
4 min read
Save for later

Big Data Is More Than Just a Buzz Word!

Akram Hussain
16 Dec 2014
4 min read
We all agree big data sounds cool (well I think it does!), but what is it? Put simply, big data is the term used to describe massive volumes of data. We are thinking of data along the lines of "Terabytes," "Petabytes," and "Exabytes" in size. In my opinion, that’s as simple as it gets when thinking about the term "big data." Despite this simplicity, big data has been one of the hottest, and most misunderstood, terms in the Business Intelligence industry recently; every manager, CEO, and Director is demanding it. However, once the realization sets in on just how difficult big data is to implement, they may be scared off! The real reason behind the "buzz" was all the new benefits that organizations could gain from big data. Yet many overlooked the difficulties involved, such as:  How do you get that level of data?  If you do, what do you do with it?  Cultural change is involved, and most decisions would be driven by data. No decision would be made without it.  The cost and skills required to implement and benefit from big data. The concept was misunderstood initially; organisations wanted data but failed to understand what they wanted it for and why, even though they were happy to go on the chase. Where did the buzz start? I truly believe Hadoop is what gave big data its fame. Initially founded by Yahoo, used in-house, and then open sourced as an Apache project, Hadoop served a true market need for large scale storage and analytics. Hadoop is so well linked to big data that it’s become natural to think of the two together. The graphic above demonstrates the similarities in how often people searched for the two terms. There’s a visible correlation (if not causation). I would argue that “buzz words” in general (or trends) don’t take off before the technology that allows them to exist does. If we consider buzz words like "responsive web design", they needed the correct CSS rules; "IoT" needed Arduino, and Raspberry Pi and likewise "big data" needed Hadoop. Hadoop was on the rise before big data had taken off, which supports my theory. Platforms like Hadoop allowed businesses to collect more data than they could have conceived of a few years ago. Big data grew as a buzz word because the technology supported it. After the data comes the analysis However, the issue still remains on collecting data with no real purpose, which ultimately yields very little in return; in short, you need to know what you want and what your end goal is. This is something that organisations are slowly starting to realize and appreciate, represented well by Gartner’s 2014 Hype Cycle. Big data is currently in the "Trough of Disillusionment," which I like to describe as “the morning after the night before.” This basically means that realisation is setting in, the excitement and buzz of big data has come down to something akin to shame and regret. The true value of big data can be categorised into three sections: Data types, Speed, and Reliance. By this we mean: the larger the data, the more difficult it becomes to manage the types of data collected, that is, it would be messy, unstructured, and complex. The speed of analytics is crucial to growth and on-demand expectations. Likewise, having a reliable infrastructure is at the core for sustainable efficiency. Big data’s actual value lies in processing and analyzing complex data to help discover, identify, and make better informed data-driven decisions. Likewise, big data can offer a clear insight into strengths, weaknesses, and areas of improvements by discovering crucial patterns for success and growth. However, this comes at a cost, as mentioned earlier. What does this mean for big data? I envisage that the invisible hand of big data will be ever present. Even though devices are getting smaller, data is increasing at a rapid rate. When the true meaning of big data is appreciated, it will genuinely turn from a buzz word into one that smaller organisations might become reluctant to adopt. In order to implement big data, they will need to appreciate the need for structure change, the costs involved, the skill levels required, and an overall shift towards a data-driven culture. To gain the maximum efficiency from big data and appreciate that it's more than a buzz word, organizations will have to be very agile and accept the risks to benefit from the levels of change.
Read more
  • 0
  • 0
  • 6603
Modal Close icon
Modal Close icon