Data Analysis | 0 articles | Tech News, Tutorials & Expert Insights

article-image-data-science-for-non-techies-how-i-got-started

20 Jul 2018

7 min read

Data science for non-techies: How I got started (Part 1)

20 Jul 2018

As a category manager, I manage the data science portfolio of product ideas for Packt Publishing, a leading tech publisher. In simple terms, I place informed bets on where to invest, what topics to publish on etc. While I have a decent idea of where the industry is heading and what data professionals are looking forward to learn and why etc, it is high time I walked in their shoes for a couple of reasons. Basically, I want to understand the reason behind Data Science being the ‘Sexiest job of the 21st century’, and if the role is really worth all the fame and fortune. In the process, I also wanted to explore the underlying difficulties, challenges and obstacles that every data scientist has had to endure at some point in his/her journey, or still does, maybe. The cherry on top, is that I get to use the skills I develop, to supercharge my success in my current role that is primarily insight-driven. This is the first of a series of posts on how I got started with Data Science. Today, I’m sharing my experience with devising a learning path and then gathering appropriate learning resources. Devising a learning path To understand the concepts of data science, I had to research a lot. There are tons and tons of resources out there, many of which are very good. Once you seperate the good from the rest, it can be quite intimidating to pick the options that suit you the best. Some of the primary questions that clouded my mind were: What should be my programming language of choice? R or Python? Or something else? What tools and frameworks do I need to learn? What about the statistics and mathematical aspects of machine learning? How essential are they? Two videos really helped me find the answers to the questions above: If you don’t want to spend a lot of your time mastering the art of data science, there’s a beautiful video on how to become a data scientist in six months What are the questions asked in a data science interview? What are the in-demand skills that you need to master in order to get a data science job? This video on 5 Tips For Getting a Data Science Job really is helpful. After a lot of research that included reading countless articles and blogs and discussions with experts, here is my learning plan: Learn Python Per the recently conducted Stack Overflow Developer Survey 2018, Python stood out as the most-wanted programming language, meaning the developers who do not use it yet want to learn it the most. As one of the most widely used general-purpose programming languages, Python finds large applications when it comes to data science. Naturally, you get attracted to the best option available, and Python was the one for me. The major reasons why I chose to learn Python over the other programming languages: Very easy to learn: Python is one of the easiest programming languages to learn. Not only is the syntax clean and easy to understand, even the most complex of data science tasks can be done in a few lines of Python code. Efficient libraries for Data Science: Python has a vast array of libraries suited for various data science tasks, from scraping data to visualizing and manipulating it. NumPy, SciPy, pandas, matplotlib, Seaborn are some of the libraries worth mentioning here. Python has terrific libraries for machine learning: Learning a framework or a library which makes machine learning easier to perform is very important. Python has libraries such as scikit-learn and Tensorflow that makes machine learning easier and a fun-to-do activity. To make the most of these libraries, it is important to understand the fundamentals of Python. My colleague and good friend Aaron has put out a list of top 7 Python programming books which helped as a brilliant starting point to understand the different resources out there to learn Python. The one book that stood out for me was Learn Python Programming - Second Edition - This is a very good book to start Python programming from scratch. There is also a neat skill-map present on Mapt, where you can progressively build up your knowledge of Python - right from the absolute basics to the most complex concepts. Another handy resource to learn the A-Z of Python is Complete Python Masterclass. This is a slightly long course, but it will take you from the absolute fundamentals to the most advanced aspects of Python programming. Task Status: Ongoing Learn the fundamentals of data manipulation After learning the fundamentals of Python programming, the plan is to head straight to the Python-based libraries for data manipulation, analysis and visualization. Some of the major ones are what we already discussed above, and the plan to learn them is in the following order: NumPy - Used primarily for numerical computing pandas - One of the most popular Python packages for data manipulation and analysis matplotlib - The go-to Python library for data visualization, rivaling the likes of R’s ggplot2 Seaborn - A data visualization library that runs on top of matplotlib used for creating visually appealing charts, plots and histograms Some very good resources to learn about all these libraries: Python Data Analysis Python for Data Science and Machine Learning - This is a very good course with a detailed coverage on the machine learning concepts. Something to learn later. The aim is to learn these libraries upto a fairly intermediate level, and be able to manipulate, analyze and visualize any kind of data, including missing, unstructured data and time-series data. Understand the fundamentals of statistics, linear algebra and probability In order to take a step further and enter into the foray of machine learning, the general consensus is to first understand the maths and statistics behind the concepts of machine learning. Implementing them in Python is relatively easier once you get the math right, and that is what I plan to do. I shortlisted some very good resources for this as well: Statistics for Machine Learning Stanford University - Machine Learning Course at Coursera Task Status: Ongoing Learn Machine Learning (Sounds odd I know) After understanding the math behind machine learning, the next step is to learn how to perform predictive modeling using popular machine learning algorithms such as linear regression, logistic regression, clustering, and more. Using real-world datasets, the plan is to learn the art of building state-of-the-art machine learning models using Python’s very own scikit-learn library, as well as the popular Tensorflow package. To learn how to do this, the courses I mentioned above should come in handy: Stanford University - Machine Learning Course at Coursera Python for Data Science and Machine Learning Python Machine Learning, Second Edition Task Status: To be started [box type="shadow" align="" class="" width=""]During the course of this journey, websites like Stack Overflow and Stack Exchange will be my best friends, along with the popular resources such as YouTube.[/box] As I start this journey, I plan to share my experiences and knowledge with you all. Do you think the learning path looks good? Is there anything else that I should include in my learning path? I would really love to hear your comments, suggestions and experiences. Stay tuned for the next post where I seek answers to questions such as ‘How much of Python should I learn in order to be comfortable with Data Science?’, ‘How much time should I devote per day or week to learn the concepts in Data Science?’ and much more.. Read more Why is data science important? 9 Data Science Myths Debunked 30 common data science terms explained

0
0
25655

article-image-4-misconceptions-about-data-wrangling

Sugandha Lahoti

17 Oct 2018

4 min read

4 misconceptions about data wrangling

Sugandha Lahoti

17 Oct 2018

4 min read

Around 80% of the time in data analysis is spent on cleaning and preparing data for analysis. This is, however, an important task, and is a prerequisite to the rest of the data analysis workflow, including visualization, analysis, and reporting. Although, being an important task given its nature, there are certain myths associated with data wrangling which developers should be cautious of. In this post, we will discuss four such misconceptions. Myth #1: Data wrangling is all about writing SQL query There was a time when data processing needed data to be presented in a relational manner so that SQL queries could be written. Today, there are many other types of data sources in addition to the classic static SQL databases, which can be analyzed. Often, an engineer has to pull data from diverse sources such as web portals, Twitter feeds, sensor fusion streams, police or hospital records. Static SQL query can help only so much in those diverse domains. A programmatic approach, which is flexible enough to interface with myriad sources and is able to parse the raw data through clever algorithmic techniques and use of fundamental data structures (trees, graphs, hash tables, heaps), will be the winner. Myth #2: Knowledge of statistics is not required for data wrangling Quick statistical tests and visualizations are always invaluable to check the ‘quality’ of the data you sourced. These tests can help detect outliers and wrong data entry, without running complex scripts. For effective data wrangling, you don’t need to have knowledge of advanced statistics. However, you must understand basic descriptive statistics and know how to execute them using built-in Python libraries. Myth #3: You have to be a machine learning expert to do great data wrangling Deep knowledge of machine learning is certainly not a pre-requisite for data wrangling. It is true that the end goal of data wrangling is often to prepare the data so that it can be used in a machine learning task downstream. As a data wrangler, you do not have to know all the nitty-gritties of your project’s machine learning pipeline. However, it is always a good idea to talk to the machine learning expert who will use your data and understand the data structure interface and format he/she needs to run the model fast and accurately. Myth #4: Deep knowledge of programming is not required for data wrangling As explained above, the diversity and complexity of data sources require that you are comfortable with deep notions of fundamental data structures and how a programming language paradigm handles them. Increasing deep knowledge of the programming framework (Python for example) will surely help you to come up with innovative methods for dealing with data source interfacing and data cleaning issues. The speed and efficiency of your data processing pipeline can often be benefited from using advanced knowledge of basic algorithms e.g. search, sort, graph traversal, hash table building, etc. Although built-in methods in standard libraries are optimized, having this knowledge gives you an edge for any situation. You read a guest post from Tirthajyoti Sarkar and Shubhadeep Roychowdhury, the authors of Data Wrangling with Python. We hope that these misconceptions would help you realize that data wrangling is not as difficult as it seems. Have fun wrangling data! About the authors Dr. Tirthajyoti Sarkar works as a Sr. Principal Engineer in the semiconductor technology domain where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. Shubhadeep Roychowdhury works as a Sr. Software Engineer at a Paris based Cyber Security startup. He holds a Master Degree in Computer Science from West Bengal University Of Technology and certifications in Machine Learning from Stanford. Don’t forget to check out Data Wrangling with Python to learn the essential basics of data wrangling using Python. 30 common data science terms explained Python, Tensorflow, Excel and more – Data professionals reveal their top tools How to create a strong data science project portfolio that lands you a job

0
0
24998

article-image-top-five-questions-to-ask-when-evaluating-a-data-monitoring-solution

Guest Contributor

27 Oct 2018

6 min read

Messaging app Telegram's updated Privacy Policy is an open challenge

Amarabha Banerjee

08 Sep 2018

7 min read

Social media companies are facing a lot of heat presently because of their privacy issues. One of them is Facebook. The Cambridge analytica scandal had even prompted a senate hearing for Mark Zuckerberg. On the other end of this spectrum, there is another messaging app known as Telegram, registered in London, United Kingdom, founded by the Russian entrepreneur Pavel Durov. Telegram has been in the news for an absolutely opposite situation. It’s often touted as one of the most secure and secretive messaging apps. The end to end encryption ensures that security agencies across the world have a tough time getting access to any suspicious piece of information. For this reason Russia has banned the use of Telegram app on April 2018. Telegram updated their privacy policies on . These updates have further ensured that Telegram will retain the title of the most secure messaging application in the planet. It’s imperative for any messaging app to get access to our data. But how they choose to use it makes you either vulnerable or secure. Telegram in their latest update have stated that they process personal data on the grounds that such processing caters to the following two goals: Providing effective and innovative Services to our users To detect, prevent or otherwise address fraud or security issues in respect of their provision of Services. The caveat for the second point being the security interests shall not override the space of fundamental rights and freedoms that require protection of personal data. This clause is an excellent example on how applications can prove to be a torchbearer for human rights and basic human privacy amidst glaring loopholes. Telegram have listed the the kind of user data accessed by the app. They are as follows: Basic Account Data Telegram stores basic account user data that includes mobile number, profile name, profile picture and about information, which are needed to create a Telegram account. The most interesting part of this is Telegram allows you to only keep your username (if you choose to) public. The people who have you in their contact list will see you as you want them to - for example you might be a John Doe in public, but your mom will still see you as ‘Dear Son’ in their contacts. Telegram doesn’t require your real name, gender, age or even your screen name to be your real name. E-mail Address When you enable 2-step-verification for your account or store documents using the Telegram Passport feature, you can opt to set up a password recovery email. This address will only be used to send you a password recovery code if you forget it. They are particular about not sending any unsolicited marketing emails to you. Personal Messages Cloud Chats Telegram stores messages, photos, videos and documents from your cloud chats on their servers so that you can access your data from any of your devices anytime without having to rely on third-party backups. All data is stored heavily encrypted and the encryption keys in each case are stored in several other data centers in different jurisdictions. This way local engineers or physical intruders cannot get access to user data. Secret Chats Telegram has a feature called Secret chats that uses end-to-end encryption. This means that all data is encrypted with a key that only the sender and the recipients know. There is no way for us or anybody else without direct access to your device to learn what content is being sent in those messages. Telegram does not store ‘secret chats’ on their servers. They also do not keep any logs for messages in secret chats, so after a short period of time there is no way of determining who or when you messaged via secret chats. Secret chats are not available in the cloud — you can only access those messages from the device they were sent to or from. Media in Secret Chats When you send photos, videos or files via secret chats, before being uploaded, each item is encrypted with a separate key, not known to the server. This key and the file’s location are then encrypted again, this time with the secret chat’s key — and sent to your recipient. They can then download and decipher the file. This means that the file is technically on one of Telegram’s servers, but it looks like a piece of random indecipherable garbage to everyone except for you and the recipient. This complete process is random and there random data packets are periodically purged from the storage disks too. Public Chats In addition to private messages, Telegram also supports public channels and public groups. All public chats are cloud chats. Like everything else on Telegram, the data you post in public communities is encrypted, both in storage and in transit — but everything you post in public will be accessible to everyone. Phone Number and Contacts Telegram uses phone numbers as unique identifiers so that it is easy for you to switch from SMS and other messaging apps and retain your social graph. But the most important thing is that permissions from the users are a must before the cookies are allowed into your browser. Cookies Telegram promises that the only cookies they use are those to operate and provide their Services on the web. They clearly state that they don’t use cookies for profiling or advertising. Their cookies are small text files that allow them to provide and customize their Services, and provide an enhanced user experience. Also, whether or not to use these cookies is a choice made by the users. So, how does Telegram remain in business? The Telegram business model doesn’t match that of a revenue generating service. The founder Pavel Durov is also the founder of the popular Russian social networking site VK. Telegram doesn’t charge for any messaging services, it doesn’t show ads yet. Some new in app purchase features might be included in the new version. As of now, the main source of revenue for Telegram are donations and mainly the earnings of Pavel Durov himself (from the social networking site VK). What can social networks learn from Telegram? Telegram’s policies elevate privacy standards that many are asking from other social messaging apps. The clamour for stopping the exploitation of user data, using their location details for targeted marketing and advertising campaigns is increasing now. Telegram shows that privacy can be achieved, if intended, in today’s overexposed social media world. But there is are also costs to this level of user privacy and secrecy, that are sometimes not discussed enough. The ISIS members behind the 2015 Paris attacks used Telegram to spread propaganda. ISIS also used the app to recruit the perpetrators of the Christmas market attack in Berlin last year and claimed credit for the massacre. More recently, a Turkish prosecutor found that the shooter behind the New Year’s Eve attack at the Reina nightclub in Istanbul used Telegram to receive directions for it from an ISIS leader in Raqqa. While these incidents can never negate the need for a secure and less intrusive social media platform like Telegram, there should be workarounds and escape routes designed for stopping extremists and terrorist activities. Telegram have assured that all ISIS messaging channels are deleted from their network which is a great way to start. Content moderation, proactive sentiment and pattern recognition and content/account isolation are the next challenges for Telegram. One thing is for sure, Telegram’s continual pursuance of user secrey and user data privacy is throwing an open challenge to others to follow suite. Whether others will oblige or not, only time will tell. To read about Telegram’s updated privacy policies in detail, you can check out the official Telegram Privacy Settings. How to stay safe while using Social Media Time for Facebook, Twitter and other social media to take responsibility or face regulation What RESTful APIs can do for Cloud, IoT, social media and other emerging technologies

0
0
21171

article-image-can-cryptocurrency-establish-a-new-economic-world-order

Amarabha Banerjee

22 Jul 2018

5 min read

Can Cryptocurrency establish a new economic world order?

Amarabha Banerjee

22 Jul 2018

5 min read

Cryptocurrency has already established one thing - there is a viable alternative to dollars and gold as a measure of wealth. Our present economic system is flawed. Cryptocurrencies, if utilized properly, can change the way the world deals with money and wealth. But can it completely overthrow the present system and create a new economic world order? To know the answer to this we will have to understand the concept of cryptocurrencies and the premise for their creation. Money - The weapon to control the world Money is a measure of wealth, which translates into power. The power centers have largely remained the same throughout history, be it a monarchy, or autocracy or democracy. Power has shifted from one king to one dictator, to a few elected/selected individuals. To remain in power, they had to control the source and distribution of money. That’s why till date, only the government can print money and distribute it among citizens. We can earn money in exchange for our time and skills or loan money in exchange for our future time. But there’s only so much of time that we can give away and hence the present day economy always runs on the philosophy of scarcity and demand. The money distribution follows a trickle down approach in a pyramid structure. Source: Credit Suisse Inception of Cryptocurrency - Delocalization of money It’s abundantly clear from the image above that while printing of money is under the control of the powerful and the wealth creators, the pyramidal distribution mechanism also has ensured very less money flows to the bottom most segments of the population. The money creators have been ensuring their safety and prosperity throughout history, by accumulating chunks of money for themselves. Subsequently, the global wealth gap has increased staggeringly. This could have possibly triggered the rise of cryptocurrencies, as a form of an alternative economic system, that theoretically, doesn’t just accumulate at the top, but also rewards anyone who is interested in mining these currencies and spending their time and resources. The main concept that made this possible was the distributed computing mechanism which has gained tremendous interest in recent times. Distributed Computing, Blockchain & the possibilities The foundation of our present economic system is a central power, be it government or a ruler or dictator. The alternative of this central system is a distributed system, where every single node of communication contains the power of decision making and is equally important for the system. So if one node is cut-off, the system will not fall apart, it will keep on functioning. That’s what makes distributed computing terrifying for the centralized economic systems. Because they can’t just attack the creator of the system or use a violent hack to bring down the entire system. Source: Medium.com When the white paper on Cryptocurrencies was first published by the anonymous Satoshi Nakamoto, there was this hope of constituting a parallel economy, where any individual with an access to a mobile phone and internet might be able to mine bitcoins and create wealth, for not just himself/herself, but for the system also. Satoshi himself invented the concept of Blockchain, an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable and permanent way. Blockchain was the technology on top of which the first unit of Cryptocurrency, Bitcoins, were created. The concept of Bitcoin mining seemed revolutionary at that time. The more people that joined the system, the more enriched the system would become. The hope was that it would make the mainstream economic system take note and cause a major overhaul of the wealth distribution system. But sadly, none of that seems to have taken place yet. The phase of Disillusionment The reality is that bitcoin mining capabilities were controlled by system resources. The creators also had accumulated enough bitcoins for themselves similar to the traditional wealth creation system. Satoshi’s Bitcoin holdings were valued at $19.4 Billion during the Dec 2017 peak, making him the 44th richest person in the world during that time. This basically meant that the wealth distribution system was at fault again, very few could get their hands onto Bitcoins as their prices in traditional currencies had climbed. The government then duly played their part in declaring that trading in Bitcoins was illegal, cracking down on several cryptocurrency top guns. Recently different countries have joined the bandwagon to ban Cryptocurrency. Hence the value is much less now. The major concern is that the skepticism in public minds might kill the hype earlier than anticipated. Source: Bitcoin.com The Future and Hope for a better Alternative What we must keep in mind is that Bitcoins are just a derivative of the concept of Cryptocurrencies. The primary concept of distributed systems and the resulting technology - Blockchain, is still a very viable and novel one. The problem in the current Bitcoin system is the distribution mechanism. Whether we would be able to tap into the distributed system concept and create a better version of the Bitcoin model, only time will tell. But for the sake of better wealth propagation and wealth balance, we can only hope that this realignment of economic system happens sooner than later. Blockchain can solve tech’s trust issues – Imran Bashir A brief history of Blockchain Crypto-ML, a machine learning powered cryptocurrency platform

0
0
20157

article-image-computing-technology-at-a-tipping-point-says-wef-davos-panel

Melisha Dsouza

30 Jan 2019

9 min read

‘Computing technology at a tipping point’, says WEF Davos Panel

Melisha Dsouza

30 Jan 2019

9 min read

The ongoing World Economic Forum meeting 2019 has seen a vast array of discussions on political, technological and other industrial agendas. The meeting brings together the world’s foremost CEOs, government officials, policy-makers, experts and academics, international organizations, youth, technology innovators and representatives of civil society with an aim to drive positive change in the world on multiple facets. This article will focus on the talk ‘Computing Technology at a Tipping Point’ that was moderated by Nicholas Carlson from Business Insider with a panel consisting of Antonio Neri, president and Chief Executive Officer of Hewlett Packard Enterprise, Jeremy O’Brien, CEO of PsiQuantum and Amy Webb, Adjunct Assistant Professor of NYU Stern School of Business. Their discussion explored questions of today's age, ranging from- why this is an important time for technology, the role of governments in encouraging a technological revolution, role of the community and business in optimizing tech and the challenges faced as we set out to utilize the next generation computing technologies like quantum computing and AI. Quantum Computing - The necessity of the future The discussion kicked off with the importance of Quantum computing at the present as well as the future. O’Brien defined Quantum computing as “Nothing short of a necessary tool that humans need to build their future”. According to him, QC is a “genuinely exponentially powerful technology”, due to the varied applications that quantum computing can impact if put to use in the correct way - from human health, energy, to molecular chemistry among others. Webb calls the year 2019 as the year of divergence, where we will move from the classic Von Neumann architecture to a more diversified Quantum age. Neri believes we are now at the end of Moore’s law that states overall processing power for computers will double every two years. He says that two years from now we will generate twice the amount of data as generated today and there will be a major divergence between the data generated and the computation power. This is why we need to focus on solving architectural problems of processing algorithms and computing data rather than focussing on the amount of data. Why is this an exciting time for tech? O’Brien: Quantum Computing, Molecular simulation for Techno-Optimism O’Brien expresses his excitement in the Quantum Computing and molecular simulation field where developers are just touching the waters with both these concepts. He has been in the QC field for the past 20 years and says that he has faith in Quantum computing and even though it's the next big thing to watch out for, he assures developers that it will not replace conventional computing. Quantum computers can be used in fact to improve the performance of classical computing systems to handle the huge amounts of data and information that we are faced with today. In addition to QC, another concept he believes that ‘will transform lives’ is molecular simulation. Molecular simulation will design new pharmaceuticals, new chemicals and help build really sophisticated computers to solve exponentially large problems. Webb: The beginning of the end of smartphones “We are in the midst of a great transformation. This is an explosion happening in slow motion”. Based on data-driven models she says this is the beginning of the end of smartphones. 10 years from now, as our phones retrieve biometric information to information derived from what we wear and we use, the computing environments will look different. Citing an example of MagicLeap who creates spatial glasses, she mentions how computable devices we wear will turn our environment into a computable space to visualize data in a whole different way. She advises business' to rethink how they function; even between the current cloud V/s edge and computer architectures change. Companies should start thinking in terms of 10 years rather than short term, since decisions made today will have long term consequences. While this is the positive side, Webb is pessimistic that there is no global alignment on the use of data. On the basis of GDPR and other data laws, systems have to be trained. Neri: continuous re-skilling to stay relevant Humans should continuously re-skill themselves with changing times and technologies to avoid an exclusion from new jobs as and when they arrive. He further states that, in the field of Artificial intelligence, there should not be a concentration of power in a few entities like Baidu, Alibaba, Tencent Google, Microsoft, Facebook, Apple and others. While these companies are at the foremost while deciding the future of AI, innovation should happen at all levels. We need guidelines and policy for the same- not to regulate but to guide the revolution. Business, community and Government should start thinking about ethical and moral codes. Government’s role in Technological Optimism The speakers emphasized on the importance of the government's’ involvement in these ‘exciting times’ and how they can work towards making citizens feel safe against the possible abuse of technology. Webb: Regulation of AI doesn't make sense We need to have conversations on optimizing Artificial Intelligence using available data. She expresses her opinion that the regulation of AI doesn't make sense. This is because we shift from a group of people understanding and implementing optimization to lawmakers who do not understand technical know-how. Nowadays, people focus on regulating tech instead of optimizing it because most don’t understand the nitty-gritties of a system, nor do they understand a system’s limitations. Governments play a huge role in this optimization or regulation decision making. She emphasizes on the need to get hold of the right people to come to an agreement ,“ where companies are a hero to their shareholders and the government to their citizens” . Governments should start talking about and exploring Quantum computing such that its benefits are distributed equitably in a shortest amount of time. Neri: Human centered future of computing He adds that for a human centered future of computing, it is we who need to decide what is good or bad for us. He agrees with Webb’s point that since technology evolves in a way we cannot think of, we need to come to reasonable conclusions before a crisis arrives. Further, he adds that governments should inculcate moral ethics while adopting and implementing technology and innovation. Role of Politicians in technology During the discussion, a member of the European Parliament stated that people have a common notion that politicians do not understand technology and cannot keep up with changing times. Stating that many companies do not think about governance, human rights, democracy and possible abuse of their products; the questioner says that we need a minimum threshold to protect human rights and safeguard humans against abuse. Her question was centered around ways to invite politicians to understand tech better before it's too late. Expressing her gratitude that the European Parliament is asking such a thoughtful question, Webb suggested that creating some kind of framework that the key people on all sides of the spectrum can agree to and a mechanism that incentivises everyone to play fairly- will help parliaments and other law making bodies to feel inclusive in understanding technology. Neri also suggested a guiding principle to think ethically before using any technology without stopping innovation. Technological progress in China and its implications on the U.S. Another question that caught our attention was the progress of technology in China and its implications on the US. Webb says that the development of tools, technologies, frameworks and data gathering mechanisms to mine, refine and monetize data have different approaches in US and China. In China, the activities related to AI and activities of Baidu, Alibaba and Tencent are under the leadership of the Chinese communist Party. She says that it is hard to overlook what is happening in Chain with the BRI (Belt to Road Initiative), 5G, digital transformation, expansion in fibre and expansion in e-commerce and a new world order is being formed because of the same. She is worried that the US and its allies will be locked out economically from the BRI countries and AI will be one of the factors propelling the same . Role of the Military in technology The last question pointed out that some of the worst abuses of technology can be done by governments and the military has the potential to misuse technology. We need to have conversations on the ethical use of technology and how to design technology to fit ethical morals. Neri says that corporations do have a point of view on the military using technology for various reasons and the governments are consulting them on the impacts of technology on the world as well. This is a hard topic and the debate is ongoing even though it is not visible to the people. Webb says that the US always had ties with the government. We live in a world of social media where conversations spiral out of control because of the same. She advises companies to meet quarterly to have conversations along this line and understanding how their work with the military/ government align with the core values of their company. Sustainability and Technology Neri states that 6% of the global power is used to power data centers. It is important to determine how to address this problem. The solutions proposed for the same are: Innovate in different ways. Be mindful the entire supply chain--->from the time you procure minerals to build the system and recycle it. We need to think of a circular economy. Consider if systems can be re-used by other companies, check parts to be re-cycled and reused. We can use synthetic DNA to back up data - this could potentially use less energy. To sustain human life on this planet, we need to optimise how we ruse resources- physical and virtual, QC tool will invent the future. Materials can be built using QC. You can listen to the entire talk at the World Economic Forum’s official page. What the US-China tech and AI arms race means for the world – Frederick Kempe at Davos 2019 Microsoft’s Bing ‘back to normal’ in China Facebook’s outgoing Head of communications and policy takes the blame for hiring PR firm ‘Definers’ and reveals more

0
0
18977

article-image-ubers-kepler-gl-an-open-source-toolbox-for-geospatial-analysis

Pravin Dhandre

28 Jun 2018

4 min read

Uber's kepler.gl, an open source toolbox for GeoSpatial Analysis

Pravin Dhandre

28 Jun 2018

4 min read

Geography Visualization, also called as Geovisualization plays a pivotal role in areas like cartography, geographic information systems, remote sensing and global positioning systems. Uber, a peer-to-peer transportation network company headquartered at California believes in data-driven decision making and hence keeps developing smart frameworks like deck.gl for exploring and visualizing advanced geospatial data at scale. Uber strives to make the data web-based and shareable in real-time across their teams and customers. Early this month, Uber surprised the geospatial market with its newly open-source toolbox, kepler.gl, a geoanalytics tool to gain quick insights from geospatial data with amazing and intuitive visualizations. What’s exactly Kepler.gl is? kepler.gl is a visualization-rich web platform, developed on top of deck.gl, a WebGL-powered data visualization library providing real-time visual analytics of millions of geolocation points. The platform provides visual exploration of geographical data sets along with spatial aggregation of all data points collected. The platform is said to be data-agnostic with a single interface to convert your data into insightful visualizations. https://www.youtube.com/watch?v=i2fRN4e2s0A The platform is very user-friendly where one can just drag the CSV or the GeoJSON files and drop them into the browser to visualize the dataset more intuitively. The platform is supported with different map layers, filtering option, aggregation feature through which you can get the final visualization in an animated format or like a video. The usability of features is so high that you can apply all the metrics available to your data points without much of a hassle. The web platform exhibits high performance where you can get insights from your spatial data in less than 10 minutes and that too in a single window. Another advantage of this framework is it does not involve any sort of coding and hence non-technical users can also reap the benefits by churn valuable insights from the data points. The platform is also equipped with some advanced, complex features such as 2D cartographic plane,a separate dimension for altitude, visibility of height of hexagon and grids. The users seem happy with the new height feature which helps them detect abnormalities and illicit traits in an aggregated map. With the filtering menu, the analysts and engineers can compare their data and have a granular look at their data points. This option also helps in reading the histogram well and one can easily detect outliers and make their dataset more reliable. It has a feature to add playback to time series data points which makes getting useful information of real time location systems easy. The team at Uber looks at this toolbox with a long-term vision where they are planning to keep adding new features and enhancements to make it highly functional and a single-click visualization dashboard. The team has already announced that they would be powering it up with two major enhancements to the current functionality in next couple of months. They would add support on, More robust exploration: There will be interlinkage between charts and maps, and support for custom charts, maps and widgets like the renowned BI tool Tableau through which it will facilitate analytics teams to unveil deeper insights. Addition of newer geo-analytical capabilities: To support massive datasets, there will be added features on data operations such as polygon aggregation, union of data points, operations like joining and buffering. Companies across different verticals such as Airbnb, Atkins Global, Cityswifter, Mapbox have found great value in kepler.gl offerings and are looking towards engineering their products to leverage this framework. The visualization specialists at these companies have already praised Uber for building such a simple yet fast platform with remarkable capabilities. To get started with kepler.gl, read the documentation available at Github and start creating visualizations and enhance your geospatial data analysis. Top 7 libraries for geospatial analysis Using R to implement Kriging – A Spatial Interpolation technique for Geostatistics data Data Visualization with ggplot2

0
0
18680

article-image-why-data-science-needs-great-communicators

Erik Kappelman

16 Jan 2018

4 min read

Why data science needs great communicators

Erik Kappelman

16 Jan 2018

4 min read

One of the biggest problems facing data science (and many other technical industries) today is communication. This is true on both an individual level, but at a much bigger organizational and cultural level. On the one hand, we can all be better communicators, but at the same time organizations and businesses can do a lot more to facilitate knowledge and information sharing. At an individual level, it’s important to recognize that some people find communication very difficult. Obviously it’s a cliché that many of these people find themselves in technical industries, and while we shouldn’t get stuck on stereotypes, there is certainly an element of truth in it. The reasons why this might be the case is incredibly complex, but it may be true that part of the problem is how technology has been viewed within institutions and other organizations. This is the sort of attitude that says “those smart people just have bad social skills. We should let them do what they're good at and leave them alone.” There are lots of problems with this and it isn’t doing anyone any favors, from the people that struggle with communication to the organizations who encourage this attitude. Statistics and communicating insights Let’s take a field like statistics. There is a notion that you do not need to be good at communicating to be good at statistics; it is often viewed as a primarily numerical and technical skill. However, when you think about what statistics really is, it becomes clear that that is nonsensical. The primary purpose of the field is to tease out information and insights from noisy data and then communicate those insights. If you don’t do that you’re not doing statistics. Some forms of communication are inherent to statistical research; graphs and charts communicate the meaning of data and most statisticians or data scientists have a well worn skill of chart making. But there’s more than just charts – great visualizations, great presentations can all be the work of talented statisticians. Of course, there are some data-related roles where communication is less important. If you’re working on data processing and storage, for example, being a great communicator may not be quite as valuable. But consider this: if you can’t properly discuss and present why you’re doing what you’re doing to the people that you work with and the people that matter in your organization you’re immediately putting up a barrier to success. The data explosion makes communication even more important There is an even bigger reason data science needs great communicators and it has nothing to do with individual success. We have entered what I like to call the Data Century. Computing power and tools using computers, like the Internet, hit a sweet spot somewhere around the new millennium and the data and analysis now available to the world is unprecedented. Who knows what kind of answers this mass of data holds? Data scientists are at the frontier of the greatest human exploration since the settling of the New World. This exploration is faced inward, as we try to understand how and why human beings do various things by examining the ever growing piles of data. If data scientists cannot relay their findings, we all miss out on this wonderful time of exploration and discovery. People need data scientists to tell them about the whole new world of data that we are just entering. It would be a real shame if the data scientists didn’t know how. Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for the Department of Transportation as a transportation demand modeler.

0
0
16683

article-image-containerized-data-science-docker

Darwin Corn

03 Jul 2016

4 min read

Containerized Data Science with Docker

Darwin Corn

03 Jul 2016

4 min read

So, you're itching to begin your journey into data science but you aren't sure where to start. Well, I'm glad you’ve found this post since I will give the details in a step-by-step fashion as to how I circumvented the unnecessarily large technological barrier to entry and got my feet wet, so to speak. Containerization in general and Docker in particular have taken the IT world by storm in the last couple of years by making LXC containers more than just VM alternatives for the enterprising sysadmin. Even if you're coming at this post from a world devoid of IT, the odds are good that you've heard of Docker and their cute whale mascot. Of course, now that Microsoft is on board, the containerization bandwagon and a consortium of bickering stakeholders have formed, so you know that container tech is here to stay. I know, FreeBSD has had the concept of 'jails' for almost two decades now. But thanks to Docker, container tech is now usable across the big three of Linux, Windows and Mac (if a bit hack-y in the case of the latter two), and today we're going to use its positives in an exploration into the world of data science. Now that I have your interest piqued, you're wondering where the two intersect. Well, if you're like me, you've looked at the footprint of R-studio and the nightmare maze of dependencies of IPython and “noped” right out of there. Thanks to containers, these problems are solved! With Docker, you can limit the amount of memory available to the container, and the way containers are constructed ensures that you never have to deal with troubleshooting broken dependencies on update ever again. So let's install Docker, which is as straightforward as using your package manager in Linux, or downloading Docker Toolbox if you're using a Mac or Windows PC, and running the installer. The instructions that follow will be tailored to a Linux installation, but are easily adapted to Windows or Mac as well. On those two platforms, you can even bypass these CLI commands and use Kitematic, or so I hear. Now that you have Docker installed, let's look at some use cases for how to use it to facilitate our journey into data science. First, we are going to pull the Jupyter Notebook container so that you can work with that language-agnostic tool. # docker run --rm -it -p 8888:8888 -v "$(pwd):/notebooks" jupyter/notebook The -v "$(pwd):/notebooks" flag will mount the current directory to the /notebooks directory in the container, allowing you to save your work outside the container. This will be important because you’ll be using the container as a temporary working environment. The --rm flag ensures that the container is destroyed when it exits. If you rerun that command to get back to work after turning off your computer for instance, the container will be replaced with an entirely new one. That flag allows it access to the folder on the local filesystem, ensuring that your work survives the casually disposable nature of development containers. Now go ahead and navigate to http://localhost:8888, and let's get to work. You did bring a dataset to analyze in a notebook, right? The actual nuts and bolts of data science are beyond the scope of this post, but for a quick intro to data and learning materials, I've found Kaggle to be a great resource. While we're at it, you should look at that other issue I mentioned previously—that of the application footprint. Recently a friend of mine convinced me to use R, and I was enjoying working with the language until I got my hands on some real data and immediately felt the pain of an application not designed for endpoint use. I ran a regression and it locked up my computer for minutes! Fortunately, you can use a container to isolate it and only feed it limited resources to keep the rest of the computer happy. # docker run -m 1g -ti --rm r-base This command will drop you into an interactive R CLI that should keep even the leanest of modern computers humming along without a hiccup. Of course, you can also use the -c and --blkio-weight flags to restrict access to the CPU and HDD resources respectively, if limiting it to the GB of RAM wasn't enough. So, a program installation and a command or two (or a couple of clicks in the Kitematic GUI), and we're off and running using data science with none of the typical headaches. About the Author Darwin Corn is a systems analyst for the Consumer Direct Care Network. He is a mid-level professional with diverse experience in the information technology world.

0
0
11956

article-image-data-science-getting-easier

Erik Kappelman

10 Sep 2017

5 min read

Is data science getting easier?

Erik Kappelman

10 Sep 2017

5 min read

The answer is yes, and no. This is a question that could've easily been applied to textile manufacturing in the 1890s, and could've received a similar answer. By this I mean, textile manufacturing improved leaps and bounds throughout the industrial revolution, however, despite their productivity, textile mills were some of the most dangerous places to work. Before I further explain my answer, let’s agree on a definition for data science. Wikipedia defines data science as, “an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured.” I see this as the process of acquiring, managing, and analyzing data. Advances in data science First, let's discuss why data science is definitely getting easier. Advances in technology and data collection have made data science easier. For one thing, data science as we know it wasn’t even possible 40 years ago, but due to advanced technology we can now analyze, gather, and manage data in completely new ways. Scripting languages like R and Python have mostly replaced more convoluted languages like Haskell and Fortran in the realm of data analysis. Tools like Hadoop bring together a lot of different functionality to expedite every element of data science. Smartphones and wearable tech collect data more effectively and efficiently than older data collection methods, which gives data scientists more data of higher quality to work with. Perhaps most importantly, the utility of data science has become more and more recognized throughout the broader world. This helps provide data scientists the support they need to be truly effective. These are just some of the reasons why data science is getting easier. Unintended consequences While many of these tools make data science easier in some respects, there are also some unintended consequences that actually might make data science harder. Improved data collection has been a boon for the data science industry, but using the data that is streaming in is similar to drinking out of a firehose. Data scientists are continually required to come up with more complicated ways of taking data in, because the stream of data has become incredibly strong. While R and Python are definitely easier to learn than older alternatives, neither language is usually accused of being parsimonious. What a skilled Haskell programming might be able to do in 100 lines, might take a less skilled Python scripter 500 lines. Hadoop, and tools like it, simplify the data science process, but it seems like there are 50 new tools like Hadoop a day. While these tools are powerful and useful, sometimes data scientists spend more time learning about tools and less time doing data science, just to keep up with the industry’s landscape. So, like many other fields related to computer science and programming, new tech is simultaneously making things easier and harder. Golden age of data science Let me rephrase the title question in an effort to provide even more illumination: is now the best time to be a data scientist or to become one? The answer to this question is a resounding yes. While all of the current drawbacks I brought up remain true, I believe that we are in a golden age of data science, for all of the reasons already mentioned, and more. We have more data than ever before and our data collection abilities are improving at an exponential rate. The current situation has gone so far as to create the necessity for a whole new field of data analysis, Big Data. Data science is one of the most vast and quickly expanding human frontiers at present. Part of the reason for this is what data science can be used for. Data science can effectively answer questions that were previously unanswered. Of course this makes for an attractive field of study from a research standpoint. One final note on whether or not data science is getting easier. If you are a person who actually creates new methods or techniques in data science, especially if you need to support these methods and techniques with formal mathematical and scientific reasoning, data science is definitely not getting easier for you. As I just mentioned, Big Data is a whole new field of data science created to deal with new problems caused by the efficacy of new data collection techniques. If you are a researcher or academic, all of this means a lot of work. Bootstrapped standard errors were used in data analysis before a formal proof of their legitimacy was created. Data science techniques might move at the speed of light, but formalizing and proving these techniques can literally take lifetimes. So if you are a researcher or academic, things will only get harder. If you are more of a practical data scientist, it may be slightly easier for now, but there’s always something! About the Author Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for theDepartment of Transportation as a transportation demand modeler.

0
0
11603

article-image-war-data-science-python-versus-r

Akram Hussain

30 Jun 2014

7 min read

The War on Data Science: Python versus R

Akram Hussain

30 Jun 2014

7 min read

Data science The relatively new field of data science has taken the world of big data by storm. Data science gives valuable meaning to large sets of complex and unstructured data. The focus is around concepts like data analysis and visualization. However, in the field of artificial intelligence, a valuable concept known as Machine Learning has now been adopted by organizations and is becoming a core area for many data scientists to explore and implement. In order to fully appreciate and carry out these tasks, data scientists are required to use powerful languages. R and Python currently dominate this field, but which is better and why? The power of R R offers a broad, flexible approach to data science. As a programming language, R focuses on allowing users to write algorithms and computational statistics for data analysis. R can be very rewarding to those who are comfortable using it. One of the greatest benefits R brings is its ability to integrate with other languages like C++, Java, C, and tools such as SPSS, Stata, Matlab, and so on. The rise to prominence as the most powerful language for data science was supported by R’s strong community and over 5600 packages available. However, R is very different to other languages; it’s not as easily applicable to general programming (not to say it can’t be done). R’s strength and its ability to communicate with every data analysis platform also limit its ability outside this category. Game dev, Web dev, and so on are all achievable, but there’s just no benefit of using R in these domains. As a language, R is difficult to adopt with a steep learning curve, even for those who have experience in using statistical tools like SPSS and SAS. The violent Python Python is a high level, multi-paradigm programming language. Python has emerged as one of the more promising languages of recent times thanks to its easy syntax and operability with a wide variety of different eco-systems. More interestingly, Python has caught the attention of data scientists over the years, and thanks to its object-oriented features and very powerful libraries, Python has become the go-to language for data science, many arguing it’s taken over R. However, like R, Python has its flaws too. One of the drawbacks in using Python is its speed. Python is a slow language and one of the fundamentals of data science is speed! As mentioned, Python is very good as a programming language, but it’s a bit like a jack of all trades and master of none. Unlike R, it doesn’t purely focus on data analysis but has impressive libraries to carry out such tasks. The great battle begins While comparing the two languages, we will go over four fundamental areas of data science and discuss which is better. The topics we will explore are data mining, data analysis, data visualization, and machine learning. Data mining: As mentioned, one of the key components to data science is data mining. R seems to win this battle; in the 2013 Data Miners Survey, 70% of data miners (from the 1200 who participated in the survey) use R for data mining. However, it could be argued that you wouldn’t really use Python to “mine” data but rather use the language and its libraries for data analysis and development of data models. Data analysis: R and Python boast impressive packages and libraries. Python, NumPy, Pandas, and SciPy’s libraries are very powerful for data analysis and scientific computing. R, on the other hand, is different in that it doesn’t offer just a few packages; the whole language is formed around analysis and computational statistics. An argument could be made for Python being faster than R for analysis, and it is cleaner to code sets of data. However, I noticed that Python excels at the programming side of analysis, whereas for statistical and mathematical programming R is a lot stronger thanks to its array-orientated syntax. The winner of this is debatable; for mathematical analysis, R wins. But for general analysis and programming clean statistical codes more related to machine learning, I would say Python wins. Data visualization: the “cool” part of data science. The phrase “A picture paints a thousand words” has never been truer than in this field. R boasts its GGplot2 package which allows you to write impressively concise code that produces stunning visualizations. However. Python has Matplotlib, a 2D plotting library that is equally as impressive, where you can create anything from bar charts and pie charts, to error charts and scatter plots. The overall concession of the two is that R’s GGplot2 offers a more professional feel and look to data models. Another one for R. Machine learning: it knows the things you like before you do. Machine learning is one of the hottest things to hit the world of data science. Companies such as Netflix, Amazon, and Facebook have all adopted this concept. Machine learning is about using complex algorithms and data patterns to predict user likes and dislikes. It is possible to generate recommendations based on a user’s behaviour. Python has a very impressive library, Scikit-learn, to support machine learning. It covers everything from clustering and classification to building your very own recommendation systems. However, R has a whole eco system of packages specifically created to carry out machine learning tasks. Which is better for machine learning? I would say Python’s strong libraries and OOP syntax might have the edge here. One to rule them all From the surface of both languages, they seem equally matched on the majority of data science tasks. Where they really differentiate is dependent on an individual’s needs and what they want to achieve. There is nothing stopping data scientists using both languages. One of the benefits of using R is that it is compatible with other languages and tools as R’s rich packagescan be used within a Python program using RPy (R from Python). An example of such a situation would include using the Ipython environment to carry out data analysis tasks with NumPy and SciPy, yet to visually represent the data we could decide to use the R GGplot2 package: the best of both worlds. An interesting theory that has been floating around for some time is to integrate R into Python as a data science library; the benefits of such an approach would mean data scientists have one awesome place that would provide R’s strong data analysis and statistical packages with all of Python’s OOP benefits, but whether this will happen remains to be seen. The dark horse We have explored both Python and R and discussed their individual strengths and flaws in data science. As mentioned earlier, they are the two most popular and dominant languages available in this field. However a new emerging language called Julia might challenge both in the future. Julia is a high performance language. The language is essentially trying to solve the problem of speed for large scale scientific computation. Julia is expressive and dynamic, it’s fast as C, it can be used for general programming (its focus is on scientific computing) and the language is easy and clean to use. Sounds too good to be true, right?

0
0
10670

article-image-5-things-that-matter-data-science-2018

Richard Gall

11 Dec 2017

4 min read

5 things that will matter in data science in 2018

Richard Gall

11 Dec 2017

4 min read

The world of data science is now starting to change quickly. This was arguably the year when discussions around AI and automation started to escalate, taking on more and more importance in the public sphere. But as interesting as all that is, there are nevertheless real people - like you - actually working with data not to rig elections or steal someone’s jobs but simply to make things better. Arguably, data science and analysis has never been under the spotlight to the extent it is today. Whereas a decade ago there was a whole lotta hope stored in the big data revolution, today there’s anxiety that we’re not doing enough with data, that we don’t have the right data. That makes it a challenging but important time to be working in the world of data. With that in mind, here are our 5 things that will matter in data science in 2018… Find out what 5 data science tools we think will matter most in 2018 here. 1. The ethical considerations in machine learning and artificial intelligence This is huge, but it can’t be ignored. At its heart, this is important because it highlights that there’s human agency at the heart of modern data science, that algorithms are things created and designed by the engineers behind them. But even more important than that, these ethical considerations will be important in 2018 because it will end up defining everyone’s relationship to data for decades to come. And yes, although legislative bodies may play a part in that, it’s also up to people actually working with data to contribute to that discussion about what data does, who uses it and why. That might sound like a lot of responsibility, but it makes things pretty exciting, no? 2. Greater alignment between data projects and business goals This has long been a challenge for just about every business and, indeed, anyone who works in data - architect, analyst or scientist. But as the data hype curve flattens out with more organizations taking advantage of the opportunities it offers, with budgets getting tighter and expectations higher than they have ever been, ensuring that data programs are delivering real value will be crucial in 2018. That means there will be more pressure on data pros to deliver; sharpening your commercial instincts will be essential, and could be your route to the next step in your career. 3. Automated machine learning If budgets are getting tighter and management expectations are higher than ever, the emergence of automated machine learning will be a godsend for 2018. Automated machine learning isn’t a threat to anyone’s job - it’s simply a way of making the steps of algorithm selection and optimization much faster. If you’ve ever lamented the time you’ve spent tweaking an algorithm only for it not to work as you wanted it to, only to move to a further iteration to find a similar problem, automated machine learning will automate away all those iterations. What this means is that you’ll be able to spend more time on value-adding activities that will never be automated away. And in turn this will make you a more valuable data scientist. 4. Taking advantage of cloud Cloud has been a big trend for some years now. But as a word on it’s own it’s always felt a bit abstract and amorphous. However, it’s once you start to see how it can be put into practice that you begin to see how potentially transformative it might be. In the case of machine learning, cloud becomes a vital solution in the battle for resources - it makes machine learning at scale more accessible to more people. The key tool here is Google’s cloud machine learning engine - it’s been built to make building machine learning models as straightforward as possible. When you look at this alongside automated machine learning, it’s possible to suggest that the data science skill set might change somewhat throughout 2018… 5. Better self-service BI 2018 is the year when all employees will need to be empowered by data. The idea that a specific team handles everything relating to data will end; using data will be crucial to a range of different stakeholders. This doesn’t mean the end of the data scientist - as said earlier, no one is going to be losing their jobs. But it does mean that self-service BI tools are going to take on greater importance than ever before in 2018. That means data scientists may have to start thinking more like data architects (especially if there’s no data architect in their organization), and taking into consideration how they make their work accessible and meaningful for stakeholders all around their organization.

0
0
9583

Erol Staveley

18 Jan 2016

7 min read

Data Science Is the New Alchemy

Erol Staveley

18 Jan 2016

7 min read

Every day I come into work and sit opposite Greg. Greg (in my humble opinion) is a complete badass. He directly turns information that we’ve had hanging around for years and years into actual currency. Single handedly, he generates more direct revenue than any one individual in the business. When we were shuffling seating positions not too long ago (we now have room for that standing desk I’ve always wanted ❤), we were afraid to turn off his machine in fear of losing thousands upon thousands of dollars. I remember somebody saying “guys, we can’t unplug Skynet”. Nobody fully knows how it works. Nobody except Greg. We joked that by turning off his equipment, we’d ruin Greg's on-the-side Bitcoin mining gig that he was probably running off the back of the company network. We then all looked at one another in a brief moment of silence. We were all thinking the same thing — it wouldn’t surprise any of us if Greg was actually doing this. We wouldn’t know any better. To many, what Greg does is like modern day alchemy. In reality, Greg is a data scientist — an increasingly crucial role that helps businesses deliver more meaningful, relevant interactions with their customers. I like to think of them more as new-age alchemists, who wield keyboards instead of perfectly choreographed vials and alembics. This week - find out how to become a data alchemist with R. Save 50% on some of our top titles... or pick up any 5 for $50! Find them all here! Content might have been king a few years back. Now, it’s data. Everybody wants more — and the people who can actually make sense of it all. By surveying 20,000 developers, we found out just how valuable these roles are to businesses of all shapes and sizes. Let’s take a look. Every Kingdom Needs an Alchemist Even within quite a technical business, Greg’s work lends a fresh perspective on what it is other developers want from our content. Putting the value of direct revenue generation to one side, the insight we’ve derived from purchasing patterns and user behaviour is incredibly valuable. We’re constantly challenging our own assumptions, and spending more time looking at what our customers are actually doing. We’re not alone in taking this increasingly data-driven approach. In general, the highest data science salaries are paid by large enterprises. This isn’t too surprising considering that’s where the real troves of precious data reside. At such scale, the aggregation and management of data alone can warrant the recruitment of specialised teams. On average though, SMEs are not too far behind when it comes to how much they’re willing to pay for top talent. Average salary by company size. Apache Spark was a particularly important focus going forward for folks in the Enterprise segment. What’s clear is that data science isn’t just for big businesses any more. It’s for everybody. We can see that in the growth of data-related roles for SMEs. We’re paying more attention to data because it represents the actions of our customers, but also because we’ve just got more of it lying around all over the place. Irrespective of company size, the range of industries we captured (and classified) was colossal. Seems like everybody needs an alchemist these days. They Double as Snake Charmers When supply is low and demand is high in a particular job market, we almost always see people move to fill the gap. It’s a key driver of learning. After all, if you’re trying to move to a new role, you’re likely to be developing new skills. It’s no surprise that Python is the go-to choice for data science. It’s an approachable language with some great introductory resources out there on the market like Python for Secret Agents. It also has a fantastic ecosystem of data science libraries and documentation that can help you get up and running quite quickly. Percentage of respondents who said they used a given technology. When looking at roles in more detail, you see strong patterns between technologies used. For example, those using Python were most likely to also be using R. When you dive deeper into the data you start to notice a lot of crossover between certain segments. It was at this point where we were able to also start seeing the relationships between certain technologies in specific segments. For example, the Financial sector was more likely to use R, and also paid (on average) higher salaries to those who had a more diverse technical background. Alchemists Have Many Forms Back at a higher level, what was really interesting is the natural technology groupings that started to emerge between four very distinct ‘types’ of data alchemist. “What are they?”, I hear you ask. The Visualizers Those who bring data to life. They turn what otherwise would be a spreadsheet or a copy-and-paste pie chart into delightful infographics and informative dashboards. Welcome to the realm of D3.js and Tableau. The Wranglers The SME all-stars. They aggregate, clean and process data with Python whilst leveraging the functionality of libraries like pandas to their full potential. A jack of all trades, master of all. The Builders Those who use Hadoop and other OS tools to deploy and maintain large-scale data projects. They keep the world running by building robust, scalable data platforms. The Architects Those who harness the might of the enterprise toolchain. They co-ordinate large scale Oracle and Microsoft deployments, the sheer scale of which would break the minds of mere mortals. Download the Full Report With 20,000 developers taking part overall, our most recent data science survey contains plenty of juicy information about real-world skills, salaries and trends. Packtpub.com In a Land of Data, the Alchemist is King We used to have our reports delivered in Excel. Now we have them as notebooks on Jupyter. If it really is a golden age for developers, data scientists must be having a hard time keeping their inbox clear of all the recruitment spam. What’s really interesting going forward is that the volume of information we have to deal with is only going to increase. Once IoT really kicks off and wearables become more commonly accepted (the sooner the better if you’re Apple), businesses of all sizes will find dealing with data overload to be a key growing pain — regardless of industry. Plenty of web services and platforms are already popping up, promising to deliver ‘actionable insight’ to everybody who can spare the monthly fees. This is fine for standardised reporting and metrics like bounce rate and conversion, but not so helpful if you’re working with a product that’s unique to you. Greg’s work doesn’t just tell us how we can improve our SEO. It shows us how we can make our products better without having to worry about internal confirmation bias. It helps us better serve our customers. That’s why present-day alchemists like Greg, are heroes.

0
0
9070

Akram Hussain

30 Jun 2014

5 min read

The Rise of Data Science

Akram Hussain

30 Jun 2014

5 min read

The rise of big data and business intelligence has been one of the hottest topics to hit the tech world. Everybody who’s anybody has heard of the term business intelligence, yet very few can actually articulate what this means. Nonetheless it’s something all organizations are demanding. But you must be wondering why and how do you develop business intelligence? Enter data scientists! The concept of data science was developed to work with large sets of structured and unstructured data. So what does this mean? Let me explain. Data science was introduced to explore and give meaning to random sets of data floating around (we are talking about huge quantities here, that is, terabytes and petabytes), which are then used to analyze and help identify areas of poor performance, areas of improvement, and areas to capitalize on. The concept was introduced for large data-driven organisations that required consultants and specialists to deal with complex sets of data. However, data science has been adopted very quickly by organizations of all shapes and sizes, so naturally an element of flexibility would be required to fit data scientists in the modern work flow. There seems to be a shortage for data scientists and an increase in the amount of data out there. The modern data scientist is one who would be able to apply analytical skills necessary to any organization with or without large sets of data available. They are required to carry out data mining tasks to discover relevant meaningful data. Yet, smaller organizations wouldn’t have enough capital to invest in paying for a person who is experienced enough to derive such results. Nonetheless, because of the need for information, they might instead turn to a general data analyst and help them move towards data science and provide them with tools/processes/frameworks that allow for the rapid prototyping of models instead. The natural flow of work would suggest data analysis comes after data mining, and in my opinion analysis is at the heart of the data science. Learning languages like R and Python are fundamental to a good data scientist’s tool kit. However would a data scientist with a background in mathematics and statistics and little to no knowledge of R and Python still be as efficient? Now, the way I see it, data science is composed of four key topic areas crucial to achieving business intelligence, which are data mining, data analysis, data visualization, and machine learning. Data analysis can be carried out in many forms; it’s essentially looking at data and understanding it to make a factual conclusion from it (in simple terms). A data scientist may choose to use Microsoft Excel and VBA to analyze their data, but it wouldn’t be as accurate, clean, or as in depth as using Python or R, but it sure would be useful as a quick win with smaller sets of data. The approach here is that starting with something like Excel doesn’t mean it’s not counted as data science, it’s just a different form of it, and more importantly it actually gives a good foundation to progress on to using things like MySQL, R, Julia, and Python, as with time, business needs would grow and so would expectations of the level of analysis. In my opinion, a good data scientist is not one who knows more than one or two languages or tools, but one who is well-versed in the majority of them and knows which language and skill set are best suited to the task in hand. Data visualization is hugely important, as numbers themselves tell a story, but when it comes to representing the data to customers or investors, they're going to want to view all the different aspects of that data as quickly and easily as possible. Graphically representing complex data is one of the most desirable methods, but the way the data is represented varies dependent on the tool used, for example R’s GGplot2 or Python’s Matplotlib. Whether you’re working for a small organization or a huge data-driven company, data visualization is crucial. The world of artificial intelligence introduced the concept of machine learning, which has exploded on the scene and to an extent is now fundamental to large organizations. The opportunity for organizations to move forward by understanding a consumer’s behaviour and equally matching their expectations has never been so valuable. Data scientists are required to learn complex algorithms and core concepts such as classifications, recommenders, neural networks, and supervised and unsupervised learning techniques. This is just touching the edges of this exciting field, which goes into much more depth especially with emerging concepts such as deep learning. To conclude, we covered the basic fundamentals of data science and what it means to be data scientists. For all you R and Python developers (not forgetting any mathematical wizards out there), data science has been described as the ‘Sexiest job of 21st century’ as well as being handsomely rewarding too. The rise in jobs for data scientists has without question exploded and will continue to do so; according to global management firm McKinsey & Company, there will be a shortage of 140,000 to 190,000 data scientists due to the continued rise of ‘big data’.

0
0
8592

article-image-top-5-misconceptions-about-data-science

Erik Kappelman

02 Oct 2017

6 min read

Top 5 misconceptions about data science

Erik Kappelman

02 Oct 2017

6 min read

Data science is a well-defined, serious field of study and work. But the term ‘data science’ has become a bit of a buzzword. Yes, 'data scientists’ have become increasingly important to many different types of organizations, but it has also become a trend term in tech recruitment. The fact that these words are thrown around so casually has led to a lot of confusion about what data science and data scientists actually is and are. I would formerly include myself in this group. When I first heard the word data scientist, I assumed that data science was actually just statistics in a fancy hat. Turns out I was quite wrong. So here are the top 5 misconceptions about data science. Data science is statistics and vice versa I fell prey to this particular misconception myself. What I have come to find out is that statistical methods are used in data science, but conflating the two is really inaccurate. This would be somewhat like saying psychology is statistics because research psychologists use statistical tools in studies and experiments. So what's the difference? I am of the mind that the primary difference lies in the level of understanding of computing required to succeed in each discipline. While many statisticians have an excellent understanding of things like database design, one could be a statistician and actually know nothing about database design. To succeed as a statistician, all the way up to the doctoral level, you really only need to master basic modeling tools like R, Python, and MatLab. A data scientist needs to be able to mine data from the Internet, create machine learning algorithms, design, build and query databases and so on. Data science is really computer science This is the other half of the first misconception. While it is tempting to lump data science in with computer science, the two are quite different. For one thing, computer science is technically a field of mathematics focused on algorithms and optimization, and data science is definitely not that. Data science requires many skills that overlap with those of computer scientists, but data scientists aren’t going to need to know anything about computer hardware, kernels, and the like. A data scientist ought to have some understanding of network protocols, but even here, the level of understanding required for data science is nothing like the understanding held by the average computer scientist. Data scientists are here to replace statisticians In this case, nothing could be further from the truth. One way to keep this straight is that statisticians are in the business of researching existing statistical tools as well as trying to develop new statistical tools. These tools are then turned around and used by data scientists and many others. Data scientists are usually more focused on applied solutions to real problems and less interested in what many might regard as pure research. Data science is primarily focused on big data This is an understandable misconception. Just so we’re clear, Wikipedia defines big data as “a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them.” Then big data is really just the study of how to deal with, well, big datasets. Data science absolutely has a lot to contribute in this area. Data scientists usually have skills that work really well when it comes to analyzing big data. Skills related to databases, machine learning, and how data is transferred around a local network or the internet, are skills most data scientists have, and are very helpful when dealing with big data. But data science is actually very broad in scope. big data is a hot topic right now and receiving a lot of attention. Research into the field is receiving a lot private and public funding. In any situation like this, many different types of people working in a diverse range of areas are going to try to get in on the action. As a result, talking up data science's connection to big data makes sense if you're a data scientist - it's really about effective marketing. So, you might work with big data if you're a data scientist - but data science is also much, much more than just big data. Data scientists can easily find a job I thought I would include this one to add a different perspective. While there are many more misconceptions about what data science is or what data scientists do, I think this is actually a really damaging misconception and should be discussed. I hear a lot of complaints these days from people with some skill set that is sought after not being able to find gainful employment. Data science is like any other field, and there is always going to be a whole bunch of people that are better at it than you. Don’t become a data scientist because you’re sure to get a job - you’re not. The industries related to data science are absolutely growing right now, and will continue to do so for the foreseeable future. But that doesn’t mean people who can call themselves data scientists just automatically get jobs. You have to have the talent, but you also need to network and do all the same things you need to do to get on in any other industry. The point is, it's not easy to get a job no matter what your field is; study and practice data science because it's awesome, don’t do it because you heard it’s a sure way to get a job. Misconceptions abound, but data science is a wonderful field of research, study, and practice. If you are interested in pursuing a career or degree related to data science, I encourage you to do so, however, make sure you have the right idea about what you’re getting yourself into. Erik Kappelman wears many hats including blogger, developer, data consultant, economist, and transportation planner. He lives in Helena, Montana and works for theDepartment of Transportation as a transportation demand modeler.

0
0
8018

Tech Guides - Data Analysis

Data science for non-techies: How I got started (Part 1)

4 misconceptions about data wrangling

Top five questions to ask when evaluating a Data Monitoring solution

Messaging app Telegram's updated Privacy Policy is an open challenge

Can Cryptocurrency establish a new economic world order?

‘Computing technology at a tipping point’, says WEF Davos Panel

Uber's kepler.gl, an open source toolbox for GeoSpatial Analysis

Why data science needs great communicators

Containerized Data Science with Docker

Is data science getting easier?

The War on Data Science: Python versus R

5 things that will matter in data science in 2018

Data Science Is the New Alchemy

The Rise of Data Science

Top 5 misconceptions about data science

Tech Guides - Data Analysis

Data science for non-techies: How I got started (Part 1)

4 misconceptions about data wrangling

Top five questions to ask when evaluating a Data Monitoring solution

Messaging app Telegram's updated Privacy Policy is an open challenge

Can Cryptocurrency establish a new economic world order?

‘Computing technology at a tipping point’, says WEF Davos Panel

Uber's kepler.gl, an open source toolbox for GeoSpatial Analysis

Why data science needs great communicators

Containerized Data Science with Docker

Is data science getting easier?

Trending Topics

The War on Data Science: Python versus R

5 things that will matter in data science in 2018

Data Science Is the New Alchemy

The Rise of Data Science

Top 5 misconceptions about data science

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access