Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-implement-neural-network-single-layer-perceptron

28 Dec 2017

10 min read

How to Implement a Neural Network with Single-Layer Perceptron

28 Dec 2017

0
0
27029

Sunith Shetty

28 Dec 2017

10 min read

Getting started with Spark 2.0

Sunith Shetty

28 Dec 2017

10 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Muhammad Asif Abbasi titled Learning Apache Spark 2. In this book, author explains how to perform big data analytics using Spark streaming, machine learning techniques and more.[/box] Today, we will learn about the basics of Spark architecture and its various components. We will also explore how to install Spark for running it in the local mode. Apache Spark architecture overview Apache Spark is an open source distributed data processing engine for clusters, which provides a unified programming model engine across different types data processing workloads and platforms. At the core of the project is a set of APIs for Streaming, SQL, Machine Learning (ML), and Graph. Spark community supports the Spark project by providing connectors to various open source and proprietary data storage engines. Spark also has the ability to run on a variety of cluster managers like YARN and Mesos, in addition to the Standalone cluster manager which comes bundled with Spark for standalone installation. This is thus a marked difference from Hadoop ecosystem where Hadoop provides a complete platform in terms of storage formats, compute engine, cluster manager, and so on. Spark has been designed with the single goal of being an optimized compute engine. This therefore allows you to run Spark on a variety of cluster managers including being run standalone, or being plugged into YARN and Mesos. Similarly, Spark does not have its own storage, but it can connect to a wide number of storage engines. Currently Spark APIs are available in some of the most common languages including Scala, Java, Python, and R. Let's start by going through various API's available in Spark. Spark-core At the heart of the Spark architecture is the core engine of Spark, commonly referred to as spark-core, which forms the foundation of this powerful architecture. Spark-core provides services such as managing the memory pool, scheduling of tasks on the cluster (Spark works as a Massively Parallel Processing (MPP) system when deployed in cluster mode), recovering failed jobs, and providing support to work with a wide variety of storage systems such as HDFS, S3, and so on. Note: Spark-Core provides a full scheduling component for Standalone Scheduling: Code is available at: https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler Spark-Core abstracts the users of the APIs from lower-level technicalities of working on a cluster. Spark-Core also provides the RDD APIs which are the basis of other higher-level APIs, and are the core programming elements on Spark. Note: MPP systems generally use a large number of processors (on separate hardware or virtualized) to perform a set of operations in parallel. The objective of the MPP systems is to divide work into smaller task pieces and running them in parallel to increase in throughput time. Spark SQL Spark SQL is one of the most popular modules of Spark designed for structured and semistructured data processing. Spark SQL allows users to query structured data inside Spark programs using SQL or the DataFrame and the Dataset API, which is usable in Java, Scala, Python, and R. Because of the fact that the DataFrame API provides a uniform way to access a variety of data sources, including Hive datasets, Avro, Parquet, ORC, JSON, and JDBC, users should be able to connect to any data source the same way, and join across these multiple sources together. The usage of Hive meta store by Spark SQL gives the user full compatibility with existing Hive data, queries, and UDFs. Users can seamlessly run their current Hive workload without modification on Spark. Spark SQL can also be accessed through spark-sql shell, and existing business tools can connect via standard JDBC and ODBC interfaces. Spark streaming More than 50% of users consider Spark Streaming to be the most important component of Apache Spark. Spark Streaming is a module of Spark that enables processing of data arriving in passive or live streams of data. Passive streams can be from static files that you choose to stream to your Spark cluster. This can include all sorts of data ranging from web server logs, social-media activity (following a particular Twitter hashtag), sensor data from your car/phone/home, and so on. Spark-streaming provides a bunch of APIs that help you to create streaming applications in a way similar to how you would create a batch job, with minor tweaks. As of Spark 2.0, the philosophy behind Spark Streaming is not to reason about streaming and building data application as in the case of a traditional data source. This means the data from sources is continuously appended to the existing tables, and all the operations are run on the new window. A single API lets the users create batch or streaming applications, with the only difference being that a table in batch applications is finite, while the table for a streaming job is considered to be infinite. MLlib MLlib is Machine Learning Library for Spark, if you remember from the preface, iterative algorithms are one of the key drivers behind the creation of Spark, and most machine learning algorithms perform iterative processing in one way or another. Note: Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. Spark MLlib allows developers to use Spark API and build machine learning algorithms by tapping into a number of data sources including HDFS, HBase, Cassandra, and so on. Spark is super fast with iterative computing and it performs 100 times better than MapReduce. Spark MLlib contains a number of algorithms and utilities including, but not limited to, logistic regression, Support Vector Machine (SVM), classification and regression trees, random forest and gradient-boosted trees, recommendation via ALS, clustering via K-Means, Principal Component Analysis (PCA), and many others. GraphX GraphX is an API designed to manipulate graphs. The graphs can range from a graph of web pages linked to each other via hyperlinks to a social network graph on Twitter connected by followers or retweets, or a Facebook friends list. Graph theory is a study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph is made up of vertices (nodes/points), which are connected by edges (arcs/lines). – Wikipedia.org Spark provides a built-in library for graph manipulation, which therefore allows the developers to seamlessly work with both graphs and collections by combining ETL, discovery analysis, and iterative graph manipulation in a single workflow. The ability to combine transformations, machine learning, and graph computation in a single system at high speed makes Spark one of the most flexible and powerful frameworks out there. The ability of Spark to retain the speed of computation with the standard features of fault-tolerance makes it especially handy for big data problems. Spark GraphX has a number of built-in graph algorithms including PageRank, Connected components, Label propagation, SVD++, and Triangle counter. Spark deployment Apache Spark runs on both Windows and Unix-like systems (for example, Linux and Mac OS). If you are starting with Spark you can run it locally on a single machine. Spark requires Java 7+, Python 2.6+, and R 3.1+. If you would like to use Scala API (the language in which Spark was written), you need at least Scala version 2.10.x. Spark can also run in a clustered mode, using which Spark can run both by itself, and on several existing cluster managers. You can deploy Spark on any of the following cluster managers, and the list is growing everyday due to active community support: Hadoop YARN. Apache Mesos. Standalone scheduler. Yet Another Resource Negotiator (YARN) is one of the key features including a redesigned resource manager thus splitting out the scheduling and resource management capabilities from original MapReduce in Hadoop. Apache Mesos is an open source cluster manager that was developed at the University of California, Berkeley. It provides efficient resource isolation and sharing across distributed applications, or frameworks. Installing Apache Spark As mentioned in the earlier pages, while Spark can be deployed on a cluster, you can also run it in local mode on a single machine. In this chapter, we are going to download and install Apache Spark on a Linux machine and run it in local mode. Before we do anything we need to download Apache Spark from Apache's web page for the Spark project: Use your recommended browser to navigate to http://spark.apache.org/downloads.html. Choose a Spark release. You'll find all previous Spark releases listed here. We'll go with release 2.0.0 (at the time of writing, only the preview edition was available). You can download Spark source code, which can be built for several versions of Hadoop, or download it for a specific Hadoop version. In this case, we are going to download one that has been pre-built for Hadoop 2.7 or later. You can also choose to download directly or from among a number of different Mirrors. For the purpose of our exercise we'll use direct download and download it to our preferred location. Note: If you are using Windows, please remember to use a pathname without any spaces. 5. The file that you have downloaded is a compressed TAR archive. You need to extract the archive. Note: The TAR utility is generally used to unpack TAR files. If you don't have TAR, you might want to download that from the repository or use 7-ZIP, which is also one of my favorite utilities. 6. Once unpacked, you will see a number of directories/files. Here's what you would typically see when you list the contents of the unpacked directory: The bin folder contains a number of executable shell scripts such as pypark, sparkR, spark-shell, spark-sql, and spark-submit. All of these executables are used to interact with Spark, and we will be using most if not all of these. 7. If you see my particular download of spark you will find a folder called yarn. The example below is a Spark that was built for Hadoop version 2.7 which comes with YARN as a cluster manager. We'll start by running Spark shell, which is a very simple way to get started with Spark and learn the API. Spark shell is a Scala Read-Evaluate-Print-Loop (REPL), and one of the few REPLs available with Spark which also include Python and R. You should change to the Spark download directory and run the Spark shell as follows: /bin/spark-shell. We now have Spark running in standalone mode. We'll discuss the details of the deployment architecture a bit later in this chapter, but now let's kick start some basic Spark programming to appreciate the power and simplicity of the Spark framework. We have gone through the Spark architecture overview and the steps to install Spark on your own personal machine. To know more about Spark SQL, Spark Streaming, and Machine Learning with Spark, you can refer our book Learning Apache Spark 2.

0
0
3236

article-image-18-striking-ai-trends-2018-part-1

Sugandha Lahoti

27 Dec 2017

14 min read

18 striking AI Trends to watch in 2018 - Part 1

Sugandha Lahoti

27 Dec 2017

14 min read

Artificial Intelligence is the talk of the town. It has evolved past merely being a buzzword in 2016, to be used in a more practical manner in 2017. As 2018 rolls out, we will gradually notice AI transitioning into a necessity. We have prepared a detailed report, on what we can expect from AI in the upcoming year. So sit back, relax, and enjoy the ride through the future. (Don’t forget to wear your VR headgear! ) Here are 18 things that will happen in 2018 that are either AI driven or driving AI: Artificial General Intelligence may gain major traction in research. We will turn to AI enabled solution to solve mission-critical problems. Machine Learning adoption in business will see rapid growth. Safety, ethics, and transparency will become an integral part of AI application design conversations. Mainstream adoption of AI on mobile devices Major research on data efficient learning methods AI personal assistants will continue to get smarter Race to conquer the AI optimized hardware market will heat up further We will see closer AI integration into our everyday lives. The cryptocurrency hype will normalize and pave way for AI-powered Blockchain applications. Advancements in AI and Quantum Computing will share a symbiotic relationship Deep learning will continue to play a significant role in AI development progress. AI will be on both sides of the cybersecurity challenge. Augmented reality content will be brought to smartphones. Reinforcement learning will be applied to a large number of real-world situations. Robotics development will be powered by Deep Reinforcement learning and Meta-learning Rise in immersive media experiences enabled by AI. A large number of organizations will use Digital Twin. 1. General AI: AGI may start gaining traction in research. AlphaZero is only the beginning. 2017 saw Google’s AlphaGo Zero (and later AlphaZero) beat human players at Go, Chess, and other games. In addition to this, computers are now able to recognize images, understand speech, drive cars, and diagnose diseases better with time. AGI is an advancement of AI which deals with bringing machine intelligence as close to humans as possible. So, machines can possibly do any intellectual task that a human can! The success of AlphaGo covered one of the crucial aspects of AGI systems—the ability to learn continually, avoiding catastrophic forgetting. However, there is a lot more to achieving human-level general intelligence than the ability to learn continually. For instance, AI systems of today can draw on skills it learned on one game to play another. But they lack the ability to generalize the learned skill. Unlike humans, these systems do not seek solutions from previous experiences. An AI system cannot ponder and reflect on a new task, analyze its capabilities, and work out how best to apply them. In 2018, we expect to see advanced research in the areas of deep reinforcement learning, meta-learning, transfer learning, evolutionary algorithms and other areas that aid in developing AGI systems. Detailed aspects of these ideas are highlighted in later points. We can indeed say, Artificial General Intelligence is inching closer than ever before and 2018 is expected to cover major ground in that direction. 2. Enterprise AI: Machine Learning adoption in enterprises will see rapid growth. 2017 saw a rise in cloud offerings by major tech players, such as the Amazon Sagemaker, Microsoft Azure Cloud, Google Cloud Platform, allowing business professionals and innovators to transfer labor-intensive research and analysis to the cloud. Cloud is a $130 billion industry as of now, and it is projected to grow. Statista carried out a survey to present the level of AI adoption among businesses worldwide, as of 2017. Almost 80% of the participants had incorporated some or other form of AI into their organizations or planned to do so in the future. Source: https://www.statista.com/statistics/747790/worldwide-level-of-ai-adoption-business/ According to a report from Deloitte, medium and large enterprises are set to double their usage of machine learning by the end of 2018. Apart from these, 2018 will see better data visualization techniques, powered by machine learning, which is a critical aspect of every business. Artificial intelligence is going to automate the cycle of report generation and KPI analysis, and also, bring in deeper analysis of consumer behavior. Also with abundant Big data sources coming into the picture, BI tools powered by AI will emerge, which can harness the raw computing power of voluminous big data for data models to become streamlined and efficient. 3. Transformative AI: We will turn to AI enabled solutions to solve mission-critical problems. 2018 will see the involvement of AI in more and more mission-critical problems that can have world-changing consequences: read enabling genetic engineering, solving the energy crisis, space exploration, slowing climate change, smart cities, reducing starvation through precision farming, elder care etc. Recently NASA revealed the discovery of a new exoplanet, using data crunched from Machine learning and AI. With this recent reveal, more AI techniques would be used for space exploration and to find other exoplanets. We will also see the real-world deployment of AI applications. So it will not be only about academic research, but also about industry readiness. 2018 could very well be the year when AI becomes real for medicine. According to Mark Michalski, executive director, Massachusetts General Hospital and Brigham and Women’s Center for Clinical Data Science, “By the end of next year, a large number of leading healthcare systems are predicted to have adopted some form of AI within their diagnostic groups.” We would also see the rise of robot assistants, such as virtual nurses, diagnostic apps in smartphones, and real clinical robots that can monitor patients, take care of the elderly, alert doctors, and send notifications in case of emergency. More research will be done on how AI enabled technology can help in difficult to diagnose areas in health care like mental health, the onset of hereditary diseases among others. Facebook's attempt at detection of potential suicidal messages using AI is a sign of things to come in this direction. As we explore AI enabled solutions to solve problems that have a serious impact on individuals and societies at large, considering the ethical and moral implications of such solutions will become central to developing them, let alone hard to ignore. 4. Safe AI: Safety, Ethics, and Transparency in AI applications will become integral to conversations on AI adoption and app design. The rise of machine learning capabilities has also given rise to forms of bias, stereotyping and unfair determination in such systems. 2017 saw some high profile news stories about gender bias, object recognition datasets like MS COCO, to racial disparities in education AI systems. At NIPS 2017, Kate Crawford talked about bias in machine learning systems which resonated greatly with the community and became pivotal to starting conversations and thinking by other influencers on how to address the problems raised. DeepMind also launched a new unit, the DeepMind Ethics & Society, to help technologists put ethics into practice, and to help society anticipate and direct the impact of AI for the benefit of all. Independent bodies like IEEE also pushed for standards in it’s ethically aligned design paper. As news about the bro culture in Silicon Valley and the lack of diversity in the tech sector continued to stay in the news all of 2017, it hit closer home as the year came to an end, when Kristian Lum, Lead Statistician at HRDAG, described her experiences with harassment as a graduate student at prominent stat conferences. This has had a butterfly effect of sorts with many more women coming forward to raise the issue in the ML/AI community. They talked about the formation of a stronger code of conduct by boards of key conferences such as NIPS among others. Eric Horvitz, a Microsoft research director, called Lum’s post a "powerful and important report." Jeff Dean, head of Google’s Brain AI unit applauded Lum for having the courage to speak about this behavior. Other key influencers from the ML and statisticians community also spoke in support of Lum and added their views on how to tackle the problem. While the road to recovery is long and machines with moral intelligence may be decades away, 2018 is expected to start that journey in the right direction by including safety, ethics, and transparency in AI/ML systems. Instead of just thinking about ML contributing to decision making in say hiring or criminal justice, data scientists would begin to think of the potential role of ML in the harmful representation of human identity. These policies will not only be included in the development of larger AI ecosystems but also in national and international debates in politics, businesses, and education. 5. Ubiquitous AI: AI will start redefining life as we know it, and we may not even know it happened. Artificial Intelligence will gradually integrate into our everyday lives. We will see it in our everyday decisions like what kind of food we eat, the entertainment we consume, the clothes we wear, etc. Artificially intelligent systems will get better at complex tasks that humans still take for granted, like walking around a room and over objects. We’re going to see more and more products that contain some form of AI enter our lives. AI enabled stuff will become more common and available. We will also start seeing it in the background for life-altering decisions we make such as what to learn, where to work, whom to love, who our friends are, whom should we vote for, where should we invest, and where should we live among other things. 6. Embedded AI: Mobile AI means a radically different way of interacting with the world. There is no denying that AI is the power source behind the next generation of smartphones. A large number of organizations are enabling the use of AI in smartphones, whether in the form of deep learning chips, or inbuilt software with AI capabilities. The mobile AI will be a combination of on-device AI and cloud AI. Intelligent phones will have end-to-end capabilities that support coordinated development of chips, devices, and the cloud. The release of iPhone X’s FaceID—which uses a neural network chip to construct a mathematical model of the user’s face— and self-driving cars are only the beginning. As 2018 rolls out we will see vast applications on smartphones and other mobile devices which will run deep neural networks to enable AI. AI going mobile is not just limited to the embedding of neural chips in smartphones. The next generation of mobile networks 5G will soon greet the world. 2018 is going to be a year of closer collaborations and increasing partnerships between telecom service providers, handset makers, chip markers and AI tech enablers/researchers. The Baidu-Huawei partnership—to build an open AI mobile ecosystem, consisting of devices, technology, internet services, and content—is an example of many steps in this direction. We will also see edge computing rapidly becoming a key part of the Industrial Internet of Things (IIoT) to accelerate digital transformation. In combination with cloud computing, other forms of network architectures such as fog and mist would also gain major traction. All of the above will lead to a large-scale implementation of cognitive IoT, which combines traditional IoT implementations with cognitive computing. It will make sensors capable of diagnosing and adapting to their environment without the need for human intervention. Also bringing in the ability to combine multiple data streams that can identify patterns. This means we will be a lot closer to seeing smart cities in action. 7. Data-sparse AI: Research into data efficient learning methods will intensify 2017 saw highly scalable solutions for problems in object detection and recognition, machine translation, text-to-speech, recommender systems, and information retrieval. The second conference on Machine Translation happened in September 2017. The 11th ACM Conference on Recommender Systems in August 2017 witnessed a series of papers presentations, featured keynotes, invited talks, tutorials, and workshops in the field of recommendation system. Google launched the Tacotron 2 for generating human-like speech from text. However, most of these researches and systems attain state-of-the-art performance only when trained with large amounts of data. With GDPR and other data regulatory frameworks coming into play, 2018 is expected to witness machine learning systems which can learn efficiently maintaining performance, but in less time and with less data. A data-efficient learning system allows learning in complex domains without requiring large quantities of data. For this, there would be developments in the field of semi-supervised learning techniques, where we can use generative models to better guide the training of discriminative models. More research would happen in the area of transfer learning (reuse generalize knowledge across domains), active learning, one-shot learning, Bayesian optimization as well as other non-parametric methods. In addition, researchers and organizations will exploit bootstrapping and data augmentation techniques for efficient reuse of available data. Other key trends propelling data efficient learning research are growing in-device/edge computing, advancements in robotics, AGI research, and energy optimization of data centers, among others. 8. Conversational AI: AI personal assistants will continue to get smarter AI-powered virtual assistants are expected to skyrocket in 2018. 2017 was filled to the brim with new releases. Amazon brought out the Echo Look and the Echo Show. Google made its personal assistant more personal by allowing linking of six accounts to the Google Assistant built into the Home via the Home app. Bank of America unveiled Erica, it’s AI-enabled digital assistant. As 2018 rolls out, AI personal assistants will find its way into an increasing number of homes and consumer gadgets. These include increased availability of AI assistants in our smartphones and smart speakers with built-in support for platforms such as Amazon’s Alexa and Google Assistant. With the beginning of the new year, we can see personal assistants integrating into our daily routines. Developers will build voice support into a host of appliances and gadgets by using various voice assistant platforms. More importantly, developers in 2018 will try their hands on conversational technology which will include emotional sensitivity (affective computing) as well as machine translational technology (the ability to communicate seamlessly between languages). Personal assistants would be able to recognize speech patterns, for instance, of those indicative of wanting help. AI bots may also be utilized for psychiatric counseling or providing support for isolated people. And it’s all set to begin with the AI assistant summit in San Francisco scheduled on 25 - 26 January 2018. It will witness talks by world's leading innovators in advances in AI Assistants and artificial intelligence. 9. AI Hardware: Race to conquer the AI optimized hardware market will heat up further Top tech companies (read Google, IBM, Intel, Nvidia) are investing heavily in the development of AI/ML optimized hardware. Research and Markets have predicted the global AI chip market will have a growth rate of about 54% between 2017 and 2021. 2018 will see further hardware designs intended to greatly accelerate the next generation of applications and run AI computational jobs. With the beginning of 2018 chip makers will battle it out to determine who creates the hardware that artificial intelligence lives on. Not only that, there would be a rise in the development of new AI products, both for hardware and software platforms that run deep learning programs and algorithms. Also, chips which move away from the traditional one-size-fits-all approach to application-based AI hardware will grow in popularity. 2018 would see hardware which does not only store data, but also transform it into usable information. The trend for AI will head in the direction of task-optimized hardware. 2018 may also see hardware organizations move to software domains and vice-versa. Nvidia, most famous for their Volta GPUs have come up with NVIDIA DGX-1, a software for AI research, designed to streamline the deep learning workflow. More such transitions are expected at the highly anticipated CES 2018. [dropcap]P[/dropcap]hew, that was a lot of writing! But I hope you found it just as interesting to read as I found writing it. However, we are not done yet. And here is part 2 of our 18 AI trends in ‘18.

0
0
23572

article-image-mine-popular-trends-github-python-part-2

Amey Varangaonkar

27 Dec 2017

1 min read

How to Mine Popular Trends on GitHub using Python - Part 2

Amey Varangaonkar

27 Dec 2017

1 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk. In this book, you will find widely used social media mining techniques for extracting useful insights to drive your business.[/box] In Part 1 of this series, we gathered the GitHub data for analysis. Here, we will analyze that data as per our requirements, to get interesting insights on the highest trending and popular tools and languages on GitHub. We have seen so far that the GitHub API provides interesting sets of information about the code repositories and metadata around the activity of its users around these repositories. In the following sections, we will analyze this data to find out which are the most popular repositories through the analysis of its descriptions and then drilling down to the watchers, forks, and issues submitted on the emerging technologies. Since, technology is evolving so rapidly, this approach could help us to stay on top of the latest trending technologies. In order to find out what are the trending technologies, we will perform the analysis in a few steps: Identifying top technologies First of all, we will use text analytics techniques to identify what are the most popular phrases related to technologies in repositories from 2017. Our analysis will be focused on the most frequent bigrams. We import a nltk.collocation module which implements n-gram search tools: import nltk from nltk.collocations import * Then, we convert the clean description column into a list of tokens: list_documents = df['clean'].apply(lambda x: x.split()).tolist() As we perform an analysis on documents, we will use the method from_documents instead of a default one from_words. The difference between these two methods lies in then input data format. The one used in our case takes as argument a list of tokens and searches for n-grams document-wise instead of corpus-wise. It protects against detecting bi-grams composed of the last word of one document and the first one of another one: bigram_measures = nltk.collocations.BigramAssocMeasures() bigram_finder = BigramCollocationFinder.from_documents(list_documents) We take into account only bi-grams which appear at least three times in our document set: bigram_finder.apply_freq_filter(3) We can use different association measures to find the best bi-grams, such as raw frequency, pmi, student t, or chi sq. We will mostly be interested in the raw frequency measure, which is the simplest and most convenient indicator in our case. We get top 20 bigrams according to raw_freq measure: bigrams = bigram_finder.nbest(bigram_measures.raw_freq,20) We can also obtain their scores by applying the score_ngrams method: scores = bigram_finder.score_ngrams(bigram_measures.raw_freq) All the other measures are implemented as methods of BigramCollocationFinder. To try them, you can replace raw_freq by, respectively, pmi, student_t, and chi_sq. However, to create a visualization we will need the actual number of occurrences instead of scores. We create a list by using the ngram_fd.items() method and we sort it in descending order. ngram = list(bigram_finder.ngram_fd.items()) ngram.sort(key=lambda item: item[-1], reverse=True) It returns a dictionary of tuples which contain an embedded tuple and its frequency. We transform it into a simple list of tuples where we join bigram tokens: frequency = [(" ".join(k), v) for k,v in ngram] For simplicity reasons we put the frequency list into a dataframe: df=pd.DataFrame(frequency) And then, we plot the top 20 technologies in a bar chart: import matplotlib.pyplot as plt plt.style.use('ggplot') df.set_index([0], inplace = True) df.sort_values(by = [1], ascending = False).head(20).plot(kind = 'barh') plt.title('Trending Technologies') plt.ylabel('Technology') plt.xlabel('Popularity') plt.legend().set_visible(False) plt.axvline(x=14, color='b', label='Average', linestyle='--', linewidth=3) for custom in [0, 10, 14]: plt.text(14.2, custom, "Neural Networks", fontsize = 12, va = 'center', bbox = dict(boxstyle='square', fc='white', ec='none')) plt.show() We've added an additional line which helps us to aggregate all technologies related to neural networks. It is done manually by selecting elements by indices, (0,10,14) in this case. This operation might be useful for interpretation. The preceding analysis provides us with an interesting set of the most popular technologies on GitHub. It includes topics for software engineering, programming languages, and artificial intelligence. An important thing to be noted is that technology around neural networks emerges more than once, notably, deep learning, TensorFlow, and other specific projects. This is not surprising, since neural networks, which are an important component in the field of artificial intelligence, have been spoken about and practiced heavily in the last few years. So, if you're an aspiring programmer interested in AI and machine learning, this is a field to dive into! Programming languages The next step in our analysis is the comparison of popularity between different programming languages. It will be based on samples of the top 1,000 most popular repositories by year. Firstly, we get the data for last three years: queries = ["created:>2017-01-01", "created:2015-01-01..2015-12-31", "created:2016-01-01..2016-12-31"] We reuse the search_repo_paging function to collect the data from the GitHub API and we concatenate the results to a new dataframe. df = pd.DataFrame() for query in queries: data = search_repo_paging(query) data = pd.io.json.json_normalize(data) df = pd.concat([df, data]) We convert the dataframe to a time series based on the create_at column df['created_at'] = df['created_at'].apply(pd.to_datetime) df = df.set_index(['created_at']) Then, we use aggregation method groupby which restructures the data by language and year, and we count the number of occurrences by language: dx = pd.DataFrame(df.groupby(['language', df.index.year])['language'].count()) We represent the results on a bar chart: fig, ax = plt.subplots() dx.unstack().plot(kind='bar', title = 'Programming Languages per Year', ax= ax) ax.legend(['2015', '2016', '2017'], title = 'Year') plt.show() The preceding graph shows a multitude of programming languages from assembly, C, C#, Java, web, and mobile languages, to modern ones like Python, Ruby, and Scala. Comparing over the three years, we see some interesting trends. We notice HTML, which is the bedrock of all web development, has remained very stable over the last three years. This is not something that will not be replaced in a hurry. Once very popular, Ruby now has a decrease in popularity. The popularity of Python, also our language of choice for this book, is going up. Finally, the cross-device programming language, Swift, initially created by Apple but now open source, is getting extremely popular over time. It could be interesting to see in the next few years, if these trends change or hold true for long. Programming languages used in top technologies Now we know what are the top programming languages and technologies quoted in repositories description. In this section we will try to combine this information and find out what are the main programming languages for each technology. We select four technologies from previous section and print corresponding programming languages. We look up the column containing cleaned repository description and create a set of the languages related to the technology. Using a set will assure that we have unique Values. technologies_list = ['software engineering', 'deep learning', 'open source', 'exercise practice'] for tech in technologies_list: print(tech) print(set(df[df['clean'].str.contains(tech)]['language'])) software engineering {'HTML', 'Java'} deep learning {'Jupyter Notebook', None, 'Python'} open source {None, 'PHP', 'Java', 'TypeScript', 'Go', 'JavaScript', 'Ruby', 'C++'} exercise practice {'CSS', 'JavaScript', 'HTML'} Following the text analysis of the descriptions of the top technologies and then extracting the programming languages for them we notice the following: You can do a lot more analysis with this GitHub data such as: Want to know how? You can check out our book Python Social Media Analytics to get a detailed walkthrough of these topics.

0
0
6861

article-image-storing-apache-storm-data-in-elasticsearch

Richa Tripathi

27 Dec 2017

6 min read

Storing Apache Storm data in Elasticsearch

Richa Tripathi

27 Dec 2017

6 min read

0
0
11609

article-image-hitting-the-right-notes-in-2017-ai-song-for-data-scientists

Aarthi Kumaraswamy

26 Dec 2017

3 min read

Hitting the right notes in 2017: AI in a song for Data Scientists

Aarthi Kumaraswamy

26 Dec 2017

3 min read

A lot, I mean lots and lots of great articles have been written already about AI’s epic journey in 2017. They all generally agree that 2017 sets the stage for AI in very real terms. We saw immense progress in academia, research and industry in terms of an explosion of new ideas (like capsNets), questioning of established ideas (like backprop, AI black boxes), new methods (Alpha Zero’s self-learning), tools (PyTorch, Gluon, AWS SageMaker), and hardware (quantum computers, AI chips). New and existing players gearing up to tap into this phenomena even as they struggled to tap into the limited talent pool at various conferences and other community hangouts. While we have accelerated the pace of testing and deploying some of those ideas in the real world with self-driving cars, in media & entertainment, among others, progress in building a supportive and sustainable ecosystem has been slow. We also saw conversations on AI ethics, transparency, interpretability, fairness, go mainstream alongside broader contexts such as national policies, corporate cultural reformation setting the tone of those conversations. While anxiety over losing jobs to robots keeps reaching new heights proportional to the cryptocurrency hype, we saw humanoids gain citizenship, residency and even talk about contesting in an election! It has been nothing short of the stuff, legendary tales are made of: struggle, confusion, magic, awe, love, fear, disgust, inspiring heroes, powerful villains, misunderstood monsters, inner demons and guardian angels. And stories worth telling must have songs written about them! Here’s our ode to AI Highlights in 2017 while paying homage to an all-time favorite: ‘A few of my favorite things’ from Sound of Music. Next year, our AI friends will probably join us behind the scenes in the making of another homage to the extraordinary advances in data science, machine learning, and AI. [box type="shadow" align="" class="" width=""] Stripes on horses and horsetails on zebras Bright funny faces in bowls full of rameN Brown furry bears rolled into pandAs These are a few of my favorite thinGs TensorFlow projects and crisp algo models Libratus’ poker faces, AlphaGo Zero’s gaming caboodles Cars that drive and drones that fly with the moon on their wings These are a few of my favorite things Interpreting AI black boxes, using Python hashes Kaggle frenemies and the ones from ML MOOC classes R white spaces that melt into strings These are a few of my favorite things When models don’t converge, and networks just forget When I am sad I simply remember my favorite things And then I don’t feel so bad[/box] PS: We had to leave out many other significant developments in the above cover as we are limited in our creative repertoire. We invite you to join in and help us write an extended version together! The idea is to make learning about data science easy, accessible, fun and memorable!

0
0
5666

article-image-clean-social-media-data-analysis-python

Amey Varangaonkar

26 Dec 2017

10 min read

How to effectively clean social media data for analysis

Amey Varangaonkar

26 Dec 2017

10 min read

[box type="note" align="" class="" width=""]This article is a book extract from Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk.[/box] Data cleaning and preprocessing is an essential - and often crucial - part of any analytical process. In this excerpt, we explain the different techniques and mechanisms for effective analysis of your social media data. Social media contains different types of data: information about user profiles, statistics (number of likes or number of followers), verbatims, and other media content. Quantitative data is very convenient for an analysis using statistical and numerical methods, but unstructured data such as user comments is much more challenging. To get meaningful information, one has to perform the whole process of information retrieval. It starts with the definition of the data type and data structure. On social media, unstructured data is related to text, images, videos, and sound and we will mostly deal with textual data. Then, the data has to be cleaned and normalized. Only after all these steps can we delve into the analysis. Social media Data type and encoding Comments and conversation are textual data that we retrieve as strings. In brief, a string is a sequence of characters represented by code points. Every string in Python is seen as a Unicode covering the numbers from 0 through 0x10FFFF (1,114,111 decimal). Then, the sequence has to be represented as a set of bytes (values from 0 to 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called encoding. Encoding plays a very important role in natural language processing because people use more and more characters such as emojis or emoticons, which replace whole words and express emotions. Moreover, in many languages, there are accents that go beyond the regular English alphabet. In order to deal with all the processing problems that might be caused by these, we have to use the right encoding, because comparing two strings with different encodings is actually like comparing apples and oranges. The most common one is UTF-8, used by default in Python 3, which can handle any type of character. As a rule of thumb always normalize your data to Unicode UTF-8. Structure of social media data Another question we'll encounter is, What is the right structure for our data? The most natural choice is a list that can store a sequence of data points (verbatims, numbers, and so on). However, the use of lists will not be efficient on large datasets and we'll be constrained to use sequential processing of the data. That is why a much better solution is to store the data in a tabular format in pandas dataframe, which has multiple advantages for further processing. First of all, rows are indexed, so search operations become much faster. There are also many optimized methods for different kinds of processing and above all it allows you to optimize your own processing by using functional programming. Moreover, a row can contain multiple fields with metadata about verbatims, which are very often used in our analysis. It is worth remembering that the dataset in pandas must fit into RAM memory. For bigger datasets, we suggest the use of SFrames. Pre-processing and text normalization Preprocessing is one of the most important parts of the analysis process. It reformats the unstructured data into uniform, standardized form. The characters, words, and sentences identified at this stage are the fundamental units passed to all further processing stages. The quality of the preprocessing has a big impact of the final result on the whole process. There are several stages of the process: from simple text cleaning by removing white spaces, punctuation, HTML tags and special characters up to more sophisticated normalization techniques such as tokenization, stemming or lemmatization. In general, the main aim is to keep all the characters and words that are important for the analysis and, at the same time, get rid of all others, and the text corpus should be maintained in one uniform format. We import all necessary libraries. import re, itertools import nltk from nltk.corpus import stopwords When dealing with raw text, we usually have a set of words including many details we are not interested in, such as whitespace, line breaks, and blank lines. Moreover, many words contain capital letters so programming languages misinterpret for example, "go" and "Go" as two different words. In order to handle such distinctions, we can convert all words to lowercase format with the following steps: Perform basic text mining cleaning. Remove all whitespaces: verbatim = verbatim.strip() Many text processing tasks can be done via pattern matching. We can find words containing a character and replace it with another one or just remove it. Regular expressions give us a powerful and flexible method for describing the character patterns we are interested in. They are commonly used in cleaning punctuation, HTML tags, and URLs paths. 3. Remove punctuation: verbatim = re.sub(r'[^ws]','',verbatim) 4. Remove HTML tags: verbatim = re.sub('<[^<]+?>', '', verbatim) 5. Remove URLs: verbatim = re.sub(r'^https?://.*[rn]*', '', verbatim, flags=re.MULTILINE) Depending on the quality of the text corpus, sometimes there is a need to implement some corrections. This refers to the text sources such as Twitter or forums, where emotions can play a role and the comments contain multiple letters words for example, "happpppy" instead of "happy" 6. Standardize words (remove multiple letters): verbatim = ''.join(''.join(s)[:2] for _, s in itertools.groupby(verbatim)) After removal of punctuation or white spaces, words can be attached. This happens especially when deleting the periods at the end of the sentences. The corpus might look like: "the brown dog is lostEverybody is looking for him". So there is a need to split "lostEverybody" into two separate words. 7. Split attached words: verbatim = " ".join(re.findall('[A-Z][^A-Z]*', verbatim)) Stop words are basically a set of commonly used words in any language: mainly determiners, prepositions, and coordinating conjunctions. By removing the words that are very commonly used in a given language, we can focus only on the important words instead, and improve the accuracy of the text processing. 8. Convert text to lowercase, lower(): verbatim = verbatim.lower() 9. Stop word removal: verbatim = ' '.join([word for word in verbatim.split() if word not in (stopwords.words('english'))]) 10. Stemming and lemmatization: The main aim of stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming reduces word forms to so-called stems, whereas lemmatization reduces word forms to linguistically valid lemmas. Some examples of stemming are cars -> car, men -> man, and went -> Go Such text processing can give added value in some domains, and may improve the accuracy of practical information extraction tasks Tokenization: Tokenization is the process of breaking a text corpus up into words (most commonly), phrases, or other meaningful elements, which are then called tokens. The tokens become the basic units for further text processing. tokens = nltk.word_tokenize(verbatim) Other techniques are spelling correction, domain knowledge, and grammar checking. Duplicate removal Depending on data source we might notice multiple duplicates in our dataset. The decision to remove duplicates should be based on the understanding of the domain. In most cases, duplicates come from errors in data collection process and it is recommended to remove them in order to reduce bias in our analysis, with the help of the following: df = df.drop_duplicates(subset=['column_name']) Knowing basic text cleaning techniques, we can now learn how to store the data in an efficient way. For this purpose, we will explain how to use one of the most convenient NoSQL databases—MongoDB. Capture: Once you have made a connection to your API you need to make a special request and receive the data at your end. This step requires you go through the data to be able to understand it. Often the data is received in a special format called JavaScript Object Notation (JSON). JSON was created to enable a lightweight data interchange between programs. The JSON resembles the old XML format and consists of a key-value pair. Normalization: The data received from platforms are not in an ideal format to perform analysis. With textual data there are many different approaches to normalization. One can be stripping whitespaces surrounding verbatims, or converting all verbatims to lowercase, or changing the encoding to UTF-8. The point is that if we do not maintain a standard protocol for normalization, we will introduce many unintended errors. The goal of normalization is to transform all your data in a consistent manner that ensures a uniform standardization of your data. It is recommended that you create wrapper functions for your normalization techniques, and then apply these wrappers on all your data input points so as to ensure that all the data in your analysis go through exactly the same normalization process. In general, one should always perform the following cleaning steps: Normalize the textual content: Normalization generally contains at least the following steps: Stripping surrounding whitespaces. Lowercasing the verbatim. Universal encoding (UTF-8). 2. Remove special characters (example: punctuation). 3. Remove stop words: Irrespective of the language stop words add no additional informative value to the analysis, except in the case of deep parsing where stop words can be bridge connectors between targeted words. 4. Splitting attached words. 5. Removal of URLs and hyperlinks: URLs and hyperlinks can be studied separately, but due to the lack of grammatical structure they are by convention removed from verbatims. 6. Slang lookups: This is a relatively difficult task, because here we would require a predefined vocabulary of slang words and their proper reference words, for example: luv maps to love. Such dictionaries are available on the open web, but there is always a risk of them being outdated. In the case of studying words and not phrases (or n-grams), it is very important to do the following: Tokenize verbatim Stemming and lemmatization (Optional): This is where different written forms of the same word do not hold additional meaning to your study Some advanced cleaning procedures are: Grammar checking: Grammar checking is mostly learning-based, a huge amount of proper text data is learned, and models are created for the purpose of grammar correction. There are many online tools that are available for grammar correction purposes. This is a very tricky cleaning technique because language style and structure can change from source to source (for example language on Twitter will not correspond with the language from published books). Wrongly correcting grammar can have negative effects on the analysis. Spelling correction: In natural language, misspelled errors are encountered. Companies, such as Google and Microsoft have achieved a decent accuracy level in automated spell correction. One can use algorithms such as the Levenshtein Distances, Dictionary Lookup, and so on, or other modules and packages to fix these errors. Again take spell correction with a grain of salt, because false positives can affect the results. Storing: Once the data is received, normalized, and/or cleaned, we need to store the data in an efficient storage database. In this book we have chosen MongoDB as the database as it's a modern and scalable database. It's also relatively easy to use and get started. However, other databases such as Cassandra or HBase could also be used depending on expertise and objectives. Data cleaning and preprocessing, although tedious, can simplify your data analysis work. With the effective Python packages like Numpy, SciPy, Pandas etc these tasks become so much easy and save a lot of your time. If you found this piece of information useful, make sure to check out our book Python Social Media Analytics, which will help you draw actionable insights from mining social media portals such as GitHub, Twitter, YouTube, and more!

0
1
43910

article-image-store-access-social-media-data-mongodb

Amey Varangaonkar

26 Dec 2017

6 min read

How to store and access social media data in MongoDB

Amey Varangaonkar

26 Dec 2017

6 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Python Social Media Analytics, co-authored by Siddhartha Chatterjee and Michal Krystyanczuk.[/box] Our article explains how to effectively perform different operations using MongoDB and Python to effectively access and modify the data. According to the official MongoDB page: MongoDB is free and open-source, distributed database which allows ad hoc queries, indexing, and real time aggregation to access and analyze your data. It is published under the GNU Affero General Public License and stores data in flexible, JSON-like documents, meaning fields can vary from document to document and the data structure can be changed over time. Along with ease of use, MongoDB is recognized for the following advantages: Schema-less design: Unlike traditional relational databases, which require the data to fit its schema, MongoDB provides a flexible schema-less data model. The data model is based on documents and collections. A document is essentially a JSON structure and a collection is a group of documents. One links data within collections using specific identifiers. The document model is quite useful in this subject as most social media APIs provide their data in JSON format. High performance: Indexing and GRIDFS features of MongoDB provide fast access and storage. High availability: Duplication feature that allows us to make various copies of databases in different nodes confirms high availability in the case of node failures. Automatic scaling: The Sharding feature of MongoDB scales large data sets Automatically. You can access information on the implementation of Sharding in the official documentation of MongoDB: https://docs.mongodb.com/v3.0/sharding/ Installing MongoDB MongoDB can be downloaded and installed from the following link: http://www.mongodb.org/downloads?_ga=1.253005644.410512988.1432811016. Setting up the environment MongoDB requires a data directory to store all the data. The directory can be created in your working directory: md datadb Starting MongoDB We need to go to the folder where mongod.exe is stored and and run the following command: cmd binmongod.exe Once the MongoDB server is running in the background, we can switch to our Python environment to connect and start working. MongoDB using Python MongoDB can be used directly from the shell command or through programming languages. For the sake of our book we'll explain how it works using Python. MongoDB is accessed using Python through a driver module named PyMongo. We will not go into the detailed usage of MongoDB, which is beyond the scope of this book. We will see the most common functionalities required for analysis projects. We highly recommend reading the official MongoDB documentation. PyMongo can be installed using the following command: pip install pymongo Then the following command imports it in the Python script from pymongo import MongoClient client = MongoClient('localhost:27017') The database structure of MongoDB is similar to SQL languages, where you have databases, and inside databases you have tables. In MongoDB you have databases, and inside them you have collections. Collections are where you store the data, and databases store multiple collections. As MongoDB is a NoSQL database, your tables do not need to have a predefined structure, you can add documents of any composition as long as they are a JSON object. But by convention is it best practice to have a common general structure for documents in the same collections. To access a database named scrapper we simply have to do the following: db_scrapper = db.scrapper To access a collection named articles in the database scrapper we do this: db_scrapper = db.scrapper collection_articles = db_scrapper.articles Once you have the client object initiated you can access all the databases and the collections very easily. Now, we will see how to perform different operations: Insert: To insert a document into a collection we build a list of new documents to insert into the database: docs = [] for _ in range(0, 10): # each document must be of the python type dict docs.append({ "author": "...", "content": "...", "comment": ["...", ... ] }) Inserting all the docs at once: db.collection.insert_many(docs) Or you can insert them one by one: for doc in docs: db.collection.insert_one(doc) You can find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/insert-documents/. Find: To fetch all documents within a collection: # as the find function returns a cursor we will iterate over the cursor to actually fetch # the data from the database docs = [d for d in db.collection.find()] To fetch all documents in batches of 100 documents: batch_size = 100 Iteration = 0 count = db.collection.count() # getting the total number of documents in the collection while iteration * batch_size < count: docs = [d for d in db.collection.find().skip(batch_size * iteration).limit(batch_size)] Iteration += 1 To fetch documents using search queries, where the author is Jean Francois: query = {'author': 'Jean Francois'} docs = [d for d in db.collection.find(query) Where the author field exists and is not null: query = {'author': {'$exists': True, '$ne': None}} docs = [d for d in db.collection.find(query)] There are many other different filtering methods that provide a wide variety of flexibility and precision; we highly recommend taking your time going through the different search operators. You can find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.find/ Update: To update a document where the author is Jean Francois and set the attribute published as True: query_search = {'author': 'Jean Francois'} query_update = {'$set': {'published': True}} db.collection.update_many(query_search, query_update) Or you can update just the first matching document: db.collection.update_one(query_search, query_update) Find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.update/ Remove: Remove all documents where the author is Jean Francois: query_search = {'author': 'Jean Francois'} db.collection.delete_many(query_search, query_update) Or remove the first matching document: db.collection.delete_one(query_search, query_update) Find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/remove-documents/ Drop: You can drop collections by the following: db.collection.drop() Or you can drop the whole database: db.dropDatabase() We saw how to store and access data from MongoDB. MongoDB has gained a lot of popularity and is the preferred database choice for many, especially when it comes to working with social media data. If you found our post to be useful, do make sure to check out Python Social Media Analytics, which contains useful tips and tricks on leveraging the power of Python for effective data analysis from various social media sites such as YouTube, GitHub, Twitter etc.

0
0
7489

article-image-mine-popular-trends-on-github-using-python-part-1

Amey Varangaonkar

26 Dec 2017

11 min read

Mine Popular Trends on GitHub using Python - Part 1

Amey Varangaonkar

26 Dec 2017

11 min read

[box type="note" align="" class="" width=""]This interesting article is an excerpt from the book Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk. The book contains useful techniques to gain valuable insights from different social media channels using popular Python packages.[/box] In this article, we explore how to leverage the power of Python in order to gather and process data from GitHub and make it analysis-ready. Those who love to code, love GitHub. GitHub has taken the widely used version controlling approach to coding to the highest possible level by implementing social network features to the world of programming. No wonder GitHub is also thought of as Social Coding. We thought a book on Social Network analysis would not be complete without a use case on data from GitHub. GitHub allows you to create code repositories and provides multiple collaborative features, bug tracking, feature requests, task managements, and wikis. It has about 20 million users and 57 million code repositories (source: Wikipedia). These kind of statistics easily demonstrate that this is the most representative platform of programmers. It's also a platform for several open source projects that have contributed greatly to the world of software development. Programming technology is evolving at such a fast pace, especially due to the open source movement, and we have to be able to keep a track of emerging technologies. Assuming that the latest programming tools and technologies are being used with GitHub, analyzing GitHub could help us detect the most popular technologies. The popularity of repositories on GitHub is assessed through the number of commits it receives from its community. We will use the GitHub API in this chapter to gather data around repositories with the most number of commits and then discover the most popular technology within them. For all we know, the results that we get may reveal the next great innovations. Scope and process GitHub API allows us to get information about public code repositories submitted by users. It covers lots of open-source, educational and personal projects. Our focus is to find the trending technologies and programming languages of last few months, and compare with repositories from past years. We will collect all the meta information about the repositories such as: Name: The name of the repository Description: A description of the repository Watchers: People following the repository and getting notified about its activity Forks: Users cloning the repository to their own accounts Open Issues: Issues submitted about the repository We will use this data, a combination of qualitative and quantitative information, to identify the most recent trends and weak signals. The process can be represented by the steps shown in the following figure: Getting the data Before using the API, we need to set the authorization. The API gives you access to all publicly available data, but some endpoints need user permission. You can create a new token with some specific scope access using the application settings. The scope depends on your application's needs, such as accessing user email, updating user profile, and so on. Password authorization is only needed in some cases, like access by user authorized applications. In that case, you need to provide your username or email, and your password. All API access is over HTTPS, and accessed from the https://api.github.com/ domain. All data is sent and received as JSON. Rate Limits The GitHub Search API is designed to help to find specific items (repository, users, and so on). The rate limit policy allows up to 1,000 results for each search. For requests using basic authentication, OAuth, or client ID and secret, you can make up to 30 requests per minute. For unauthenticated requests, the rate limit allows you to make up to 10 requests per minute. Connection to GitHub GitHub offers a search endpoint which returns all the repositories matching a query. As we go along, in different steps of the analysis we will change the value of the variable q (query). In the first part, we will retrieve all the repositories created since January 1, 2017 and then we will compare the results with previous years. Firstly, we initialize an empty list results which stores all data about repositories. Secondly, we build get requests with parameters required by the API. We can only get 100 results per request, so we have to use a pagination technique to build a complete dataset. results = [] q = "created:>2017-01-01" def search_repo_paging(q): url = 'https://api.github.com/search/repositories' params = {'q' : q, 'sort' : 'forks', 'order': 'desc', 'per_page' : 100} while True: res = requests.get(url,params = params) result = res.json() results.extend(result['items']) params = {} try: url = res.links['next']['url'] except: break In the first request we have to pass all the parameters to the GET method in our request. Then, we make a new request for every next page, which can be found in res.links['next']['url']. res. links contains a full link to the resources including all the other parameters. That is why we empty the params dictionary. The operation is repeated until there is no next page key in res.links dictionary. For other datasets we modify the search query in such a way that we retrieve repositories from previous years. For example to get the data from 2015 we define the following query: q = "created:2015-01-01..2015-12-31" In order to find proper repositories, the API provides a wide range of query parameters. It is possible to search for repositories with high precision using the system of qualifiers. Starting with main search parameters q, we have following options: sort: Set to forks as we are interested in finding the repositories having the largest number of forks (you can also sort by number of stars or update time) order: Set to descending order per_page: Set to the maximum amount of returned repositories Naturally, the search parameter q can contain multiple combinations of qualifiers. Data pull The amount of data we collect through GitHub API is such that it fits in memory. We can deal with it directly in a pandas dataframe. If more data is required, we would recommend storing it in a database, such as MongoDB. We use JSON tools to convert the results into a clean JSON and to create a dataframe. from pandas.io.json import json_normalize import json import pandas as pd import bson.json_util as json_util sanitized = json.loads(json_util.dumps(results)) normalized = json_normalize(sanitized) df = pd.DataFrame(normalized) The dataframe df contains columns related to all the results returned by GitHub API. We can list them by typing the following: Df.columns Index(['archive_url', 'assignees_url', 'blobs_url', 'branches_url', 'clone_url', 'collaborators_url', 'comments_url', 'commits_url', 'compare_url', 'contents_url', 'contributors_url', 'default_branch', 'deployments_url', 'description', 'downloads_url', 'events_url', 'Fork', 'forks', 'forks_count', 'forks_url', 'full_name', 'git_commits_url', 'git_refs_url', 'git_tags_url', 'git_url', 'has_downloads', 'has_issues', 'has_pages', 'has_projects', 'has_wiki', 'homepage', 'hooks_url', 'html_url', 'id', 'issue_comment_url', 'Issue_events_url', 'issues_url', 'keys_url', 'labels_url', 'language', 'languages_url', 'merges_url', 'milestones_url', 'mirror_url', 'name', 'notifications_url', 'open_issues', 'open_issues_count', 'owner.avatar_url', 'owner.events_url', 'owner.followers_url', 'owner.following_url', 'owner.gists_url', 'owner.gravatar_id', 'owner.html_url', 'owner.id', 'owner.login', 'Owner.organizations_url', 'owner.received_events_url', 'owner.repos_url', 'owner.site_admin', 'owner.starred_url', 'owner.subscriptions_url', 'owner.type', 'owner.url', 'private', 'pulls_url', 'pushed_at', 'releases_url', 'score', 'size', 'ssh_url', 'stargazers_count', 'stargazers_url', 'statuses_url', 'subscribers_url', 'subscription_url', 'svn_url', 'tags_url', 'teams_url', 'trees_url', 'updated_at', 'url', 'Watchers', 'watchers_count', 'year'], dtype='object') Then, we select a subset of variables which will be used for further analysis. Our choice is based on the meaning of each of them. We skip all the technical variables related to URLs, owner information, or ID. The remaining columns contain information which is very likely to help us identify new technology trends: description: A user description of a repository watchers_count: The number of watchers size: The size of repository in kilobytes forks_count: The number of forks open_issues_count: The number of open issues language: The programming language the repository is written in We have selected watchers_count as the criterion to measure the popularity of repositories. This number indicates how many people are interested in the project. However, we may also use forks_count which gives us slightly different information about the popularity. The latter represents the number of people who actually worked with the code, so it is related to a different group. Data processing In the previous step we structured the raw data which is now ready for further analysis. Our objective is to analyze two types of data: Textual data in description Numerical data in other variables Each of them requires a different pre-processing technique. Let's take a look at each type in Detail. Textual data For the first kind, we have to create a new variable which contains a cleaned string. We will do it in three steps which have already been presented in previous chapters: Selecting English descriptions Tokenization Stopwords removal As we work only on English data, we should remove all the descriptions which are written in other languages. The main reason to do so is that each language requires a different processing and analysis flow. If we left descriptions in Russian or Chinese, we would have very noisy data which we would not be able to interpret. As a consequence, we can say that we are analyzing trends in the English-speaking world. Firstly, we remove all the empty strings in the description column. df = df.dropna(subset=['description']) In order to remove non-English descriptions we have to first detect what language is used in each text. For this purpose we use a library called langdetect which is based on the Google language detection project (https://github.com/shuyo/language-detection). from langdetect import detect df['lang'] = df.apply(lambda x: detect(x['description']),axis=1) We create a new column which contains all the predictions. We see different languages, such as en (English), zh-cn (Chinese), vi (Vietnamese), or ca (Catalan). df['lang'] 0 en 1 en 2 en 3 en 4 en 5 zh-cn In our dataset en represents 78.7% of all the repositories. We will now select only those repositories with a description in English: df = df[df['lang'] == 'en'] In the next step, we will create a new clean column with pre-processed textual data. We execute the following code to perform tokenization and remove stopwords: import nltk from nltk import word_tokenize from nltk.corpus import stopwords def clean(text = '', stopwords = []): #tokenize tokens = word_tokenize(text.strip()) #lowercase clean = [i.lower() for i in tokens] #remove stopwords clean = [i for i in clean if i not in stopwords] #remove punctuation punctuations = list(string.punctuation) clean = [i.strip(''.join(punctuations)) for i in clean if i not in punctuations] return " ".join(clean) df['clean'] = df['description'].apply(str) #make sure description is a string df['clean'] = df['clean'].apply(lambda x: clean(text = x, stopwords = stopwords.words('english'))) Finally, we obtain a clean column which contains cleaned English descriptions, ready for analysis: df['clean'].head(5) 0 roadmap becoming web developer 2017 1 base repository imad v2 course application ple… 2 decrypted content eqgrp-auction-file.tar.xz 3 shadow brokers lost translation leak 4 learn design large-scale systems prep system d... Numerical data For numerical data, we will check statistically both what the distribution of values is and whether there are any missing values: df[['watchers_count','size','forks_count','open_issues']].describe() We see that there are no missing values in all four variables: watchers_count, size, forks_count, and open_issues. The watchers_count varies from 0 to 20,792 while the minimum number of forks is 33 and goes up to 2,589. The first quartile of repositories has no open issues while top 25% have more than 12. It is worth noticing that, in our dataset, there is a repository which has 458 open issues. Once we are done with the pre-processing of the data, our next step would be to analyze it, in order to get actionable insights from it. If you found this article to be useful, stay tuned for Part 2, where we perform analysis on the processed GitHub data and determine the top trending technologies. Alternatively, you can check out the book Python Social Media Analytics, to learn how to get valuable insights from various social media sites such as Facebook, Twitter and more.

0
0
24459

article-image-10-machine-learning-tools-to-look-out-for-in-2018

Amey Varangaonkar

26 Dec 2017

7 min read

10 Machine Learning Tools to watch in 2018

Amey Varangaonkar

26 Dec 2017

7 min read

0
0
22150

article-image-data-science-saved-christmas

Aaron Lazar

22 Dec 2017

9 min read

How Data Science saved Christmas

Aaron Lazar

22 Dec 2017

9 min read

It’s the middle of December and it’s shivery cold in the North Pole at -20°C. A fat old man sits on a big brown chair, beside the fireplace, stroking his long white beard. His face has a frown on it, quite his unusual self. Mr. Claus quips, “Ruddy mailman should have been here by now! He’s never this late to bring in the li'l ones’ letters.” [caption id="attachment_3284" align="alignleft" width="300"] Nervous Santa Claus on Christmas Eve, he is sitting on the armchair and resting head on his hands[/caption] Santa gets up from his chair, his trouser buttons crying for help, thanks to his massive belly. He waddles over to the window and looks out. He’s sad that he might not be able to get the children their gifts in time, this year. Amidst the snow, he can see a glowing red light. “Oh Rudolph!” he chuckles. All across the living room are pictures of little children beaming with joy, holding their presents in their hands. A small smile starts building and then suddenly, Santa gets a new-found determination to get the presents over to the children, come what may! An idea strikes him as he waddles over to his computer room. Now Mr. Claus may be old on the outside, but on the inside, he’s nowhere close! He recently set up a new rig, all by himself. Six Nvidia GTX Titans, coupled with sixteen gigs of RAM, a 40-inch curved monitor that he uses to keep an eye on who’s being naughty or nice, and a 1000 watt home theater system, with surround sound, heavy on the bass. On the inside, he’s got a whole load of software on the likes of the Python language (not the Garden of Eden variety), OpenCV - his all-seeing eye that’s on the kids and well, Tensorflow et al. Now, you might wonder what an old man is doing with such heavy software and hardware. A few months ago, Santa caught wind that there’s a new and upcoming trend that involves working with tonnes of data, cleaning, processing and making sense of it. The idea of crunching data somehow tickled the old man and since then, the jolly good master tinkerer and his army of merry elves have been experimenting away with data. Santa’s pretty much self-taught at whatever he does, be it driving a sleigh or learning something new. A couple of interesting books he picked up from Packt were, Python Data Science Essentials - Second Edition, Hands-On Data Science and Python Machine Learning, and Python Machine Learning - Second Edition. After spending some time on the internet, he put together a list of things he needed to set up his rig and got them from Amazon. [caption id="attachment_3281" align="alignright" width="300"] Santa Claus is using a laptop on the top of a house[/caption] He quickly boots up the computer and starts up Tensorflow. He needs to come up with a list of probable things that each child would have wanted for Christmas this year. Now, there are over 2 billion children in the world and finding each one’s wish is going to be more than a task! But nothing is too difficult for Santa! He gets to work, his big head buried in his keyboard, his long locks falling over his shoulder. So, this was his plan: Considering that the kids might have shared their secret wish with someone, Santa plans to tackle the problem from different angles, to reach a higher probability of getting the right gifts: He plans to gather email and Social Media data from all the kids’ computers - all from the past month It’s a good thing kids have started owning phones at such an early age now - he plans to analyze all incoming and outgoing phone calls that have happened over the course of the past month He taps into every country's local police department’s records to stream all security footage all over the world [caption id="attachment_3288" align="alignleft" width="300"] A young boy wearing a red Christmas hat and red sweater is writing a letter to Santa Claus. The child is sitting at a wooden table in front of a Christmas tree.[/caption] If you’ve reached till here, you’re probably wondering whether this article is about Mr.Claus or Mr.Bond. Yes, the equipment and strategy would have fit an MI6 or a CIA agent’s role. You never know, Santa might just be a retired agent. Do they ever retire? Hmm! Anyway, it takes a while before he can get all the data he needs. He trusts Spark to sort this data in order, which is stored in a massive data center in his basement (he’s a bit cautious after all the news about data breaches). And he’s off to work! He sifts through the emails and messages, snorting from time to time at some of the hilarious ones. Tensorflow rips through the data, picking out keywords for Santa. It takes him a few hours to get done with the emails and social media data alone! By the time he has a list, it’s evening and time for supper. Santa calls it a day and prepares to continue the next day. The next day, Santa gets up early and boots up his equipment as he brushes and flosses. He plonks himself in the huge swivel chair in front of the monitor, munching on freshly baked gingerbread. He starts tapping into all the phone company databases across the world, fetching all the data into his data center. Now, Santa can’t afford to spend the whole time analyzing voices himself, so he lets Tensorflow analyze voices and segregate the keywords it picks up from the voice signals. Every kid’s name to a possible gift. Now there were a lot of unmentionable things that got linked to several kids names. Santa almost fell off his chair when he saw the list. “These kids grow up way too fast, these days!” It’s almost 7 PM in the evening when Santa realizes that there’s way too much data to process in a day. A few days later, Santa returns to his tech abode, to check up on the progress of the call data processing. There’s a huge list waiting in front of him. He thinks to himself, “This will need a lot of cleaning up!” He shakes his head thinking, I should have started with this! He now has to munge through that camera footage! Santa had never worked on so much data before so he started to get a bit worried that he might be unable to analyze it in time. He started pacing around the room trying to think up a workaround. Time was flying by and he still did not know how to speed up the video analyses. Just when he’s about to give up, the door opens and Beatrice walks in. Santa almost trips as he runs to hug his wife! Beatrice is startled for a bit but then breaks into a smile. “What is it dear? Did you miss me so much?” Santa replies, “You can’t imagine how much! I’ve been doing everything on my own and I really need your help!” Beatrice smiles and says, “Well, what are we waiting for? Let’s get down to it!” Santa explains the problem to Beatrice in detail and tells her how far he’s reached in the analysis. Beatrice thinks for a bit and asks Santa, “Did you try using Keras on top of TensorFlow?” Santa, blank for a minute, nods his head. Beatrice continues, “Well from my experience, Keras gives TensorFlow a boost of about 10%, which should help quicken the analysis. Santa just looks like he’s made the best decision marrying Beatrice and hugs her again! “Bea, you’re a genius!” he cries out. “Yeah, and don’t forget to use Matplotlib!” she yells back as Santa hurries back to his abode. He’s off to work again, this time saddling up Keras to work on top of TensorFlow. Hundreds and thousands of terabytes of video data flowing into the machines. He channels the output through OpenCV and ties it with TensorFlow to add a hint of Deep Learning. He quickly types out some Python scripts to integrate both the tools to create the optimal outcome. And then the wait begins. Santa keeps looking at his watch every half hour, hoping that the processing happens fast. The hardware has begun heating up quite a bit and he quickly races over to bring a cooler that’s across the room. While he waits for the videos to finish up, he starts working on sifting out the data from the text and audio. He remembers what Beatrice said and uses Matplotlib to visualize it. Soon he has a beautiful map of the world with all the children’s names and their possible gifts beside. Three days later, the video processing gets done Keras truly worked wonders for TensorFlow! Santa now has another set of data to help him narrow down the gift list. A few hours later he’s got his whole list visualized on Matplotlib. [caption id="attachment_3289" align="alignleft" width="300"] Santa Claus riding on sleigh with gift box against snow falling on fir tree forest[/caption] There’s one last thing left to do! He suits up in red and races out the door to Rudolph and the other reindeer, unties them from the fence and leads them over to the sleigh. Once they’re fastened, he loads up an empty bag onto the sleigh and it magically gets filled up. He quickly checks it to see if all is well and they’re off! It’s Christmas morning and all the kids are racing out of bed to rip their presents open! There are smiles all around and everyone’s got a gift, just as the saying goes! Even the ones who’ve been naughty have gotten gifts. Back in the North Pole, the old man is back in his abode, relaxing in an easy chair with his legs up on the table. The screen in front of him runs real-time video feed of kids all over the world opening up their presents. A big smile on his face, Santa turns to look out the window at the glowing red light amongst the snow, he takes a swig of brandy from a hip flask. Thanks to Data Science, this Christmas is the merriest yet!

0
0
15622

article-image-how-to-stream-and-store-tweets-in-apache-kafka

Fatema Patrawala

22 Dec 2017

8 min read

How to stream and store tweets in Apache Kafka

Fatema Patrawala

22 Dec 2017

8 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Ankit Jain titled Mastering Apache Storm. This book explores various real-time processing functionalities offered by Apache Storm such as parallelism, data partitioning, and more.[/box] Today, we are going to cover how to stream tweets from Twitter using the twitter streaming API. We are also going to explore how we can store fetched tweets in Kafka for later processing through Storm. Setting up a single node Kafka cluster Following are the steps to set up a single node Kafka cluster: Download the Kafka 0.9.x binary distribution named kafka_2.10-0.9.0.1.tar.gz from http://apache.claz.org/kafka/0.9.0. or 1/kafka_2.10-0.9.0.1.tgz. Extract the archive to wherever you want to install Kafka with the following command: tar -xvzf kafka_2.10-0.9.0.1.tgz cd kafka_2.10-0.9.0.1 Change the following properties in the $KAFKA_HOME/config/server.properties file: log.dirs=/var/kafka- logszookeeper.connect=zoo1:2181,zoo2:2181,zoo3:2181 Here, zoo1, zoo2, and zoo3 represent the hostnames of the ZooKeeper nodes. The following are the definitions of the important properties in the server.properties file: broker.id: This is a unique integer ID for each of the brokers in a Kafka cluster. port: This is the port number for a Kafka broker. Its default value is 9092. If you want to run multiple brokers on a single machine, give a unique port to each broker. host.name: The hostname to which the broker should bind and advertise itself. log.dirs: The name of this property is a bit unfortunate as it represents not the log directory for Kafka, but the directory where Kafka stores the actual data sent to it. This can take a single directory or a comma-separated list of directories to store data. Kafka throughput can be increased by attaching multiple physical disks to the broker node and specifying multiple data directories, each lying on a different disk. It is not much use specifying multiple directories on the same physical disk, as all the I/O will still be happening on the same disk. num.partitions: This represents the default number of partitions for newly created topics. This property can be overridden when creating new topics. A greater number of partitions results in greater parallelism at the cost of a larger number of files. log.retention.hours: Kafka does not delete messages immediately after consumers consume them. It retains them for the number of hours defined by this property so that in the event of any issues the consumers can replay the messages from Kafka. The default value is 168 hours, which is 1 week. zookeeper.connect: This is the comma-separated list of ZooKeeper nodes in hostname:port form. Start the Kafka server by running the following command: > ./bin/kafka-server-start.sh config/server.properties [2017-04-23 17:44:36,667] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2017-04-23 17:44:36,668] INFO Kafka version : 0.9.0.1 (org.apache.kafka.common.utils.AppInfoParser) [2017-04-23 17:44:36,668] INFO Kafka commitId : a7a17cdec9eaa6c5 (org.apache.kafka.common.utils.AppInfoParser) [2017-04-23 17:44:36,670] INFO [Kafka Server 0], started (kafka.server.KafkaServer) If you get something similar to the preceding three lines on your console, then your Kafka broker is up-and-running and we can proceed to test it. Now we will verify that the Kafka broker is set up correctly by sending and receiving some test messages. First, let's create a verification topic for testing by executing the following command: > bin/kafka-topics.sh --zookeeper zoo1:2181 --replication-factor 1 --partition 1 --topic verification-topic --create Created topic "verification-topic". Now let's verify if the topic creation was successful by listing all the topics: > bin/kafka-topics.sh --zookeeper zoo1:2181 --list verification-topic The topic is created; let's produce some sample messages for the Kafka cluster. Kafka comes with a command-line producer that we can use to produce messages: > bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic verification-topic Write the following messages on your console: Message 1 Test Message 2 Message 3 Let's consume these messages by starting a new console consumer on a new console window: > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic verification-topic --from-beginning Message 1 Test Message 2 Message 3 Now, if we enter any message on the producer console, it will automatically be consumed by this consumer and displayed on the command line. Collecting Tweets We are assuming you already have a twitter account, and that the consumer key and access token are generated for your application. You can refer to: https://bdthemes.com/support/knowledge-base/generate-api-key-consumer-token-acc ess-key-twitter-oauth/ to generate a consumer key and access token. Take the following steps: Create a new maven project with groupId, com.storm advance and artifactId, kafka_producer_twitter. Add the following dependencies to the pom.xml file. We are adding the Kafka and Twitter streaming Maven dependencies to pom.xml to support the Kafka Producer and the streaming tweets from Twitter: <dependencies> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.10</artifactId> <version>0.9.0.1</version> <exclusions> <exclusion> <groupId>com.sun.jdmk</groupId> <artifactId>jmxtools</artifactId> </exclusion> <exclusion> <groupId>com.sun.jmx</groupId> <artifactId>jmxri</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> <version>2.0-beta9</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-1.2-api</artifactId> <version>2.0-beta9</version> </dependency>  <dependency> <groupId>org.twitter4j</groupId> <artifactId>twitter4j-stream</artifactId> <version>4.0.6</version> </dependency> </dependencies> 3. Now, we need to create a class, TwitterData, that contains the code to consume/stream data from Twitter and publish it to the Kafka cluster. We are assuming you already have a running Kafka cluster and topic, twitterData, created in the Kafka cluster. for information on the installation of the Kafka cluster and the creation of a Kafka please refer to . The class contains an instance of the twitter4j.conf.ConfigurationBuilder class; we need to set the access token and consumer keys in configuration, as mentioned in the source code.4. The twitter4j.StatusListener class returns the continuous stream of tweets inside the onStatus() method. We are using the Kafka Producer code inside the onStatus() method to publish the tweets in Kafka. The following is the source code for the TwitterData class: public class TwitterData { /** The actual Twitter stream. It's set up to collect raw JSON data */ private TwitterStream twitterStream; static String consumerKeyStr = "r1wFskT3q"; static String consumerSecretStr = "fBbmp71HKbqalpizIwwwkBpKC"; static String accessTokenStr = "298FPfE16frABXMcRIn7aUSSnNneMEPrUuZ"; static String accessTokenSecretStr = "1LMNZZIfrAimpD004QilV1pH3PYTvM"; public void start() { ConfigurationBuilder cb = new ConfigurationBuilder(); cb.setOAuthConsumerKey(consumerKeyStr); cb.setOAuthConsumerSecret(consumerSecretStr); cb.setOAuthAccessToken(accessTokenStr); cb.setOAuthAccessTokenSecret(accessTokenSecretStr); cb.setJSONStoreEnabled(true); cb.setIncludeEntitiesEnabled(true); // instance of TwitterStreamFactory twitterStream = new TwitterStreamFactory(cb.build()).getInstance(); final Producer<String, String> producer = new KafkaProducer<String, String>(getProducerConfig());// topicDetails CreateTopic("127.0.0.1:2181").createTopic("twitterData", 2, 1); /** Twitter listener **/ StatusListener listener = new StatusListener() { public void onStatus(Status status) { ProducerRecord<String, String> data = new ProducerRecord<String, String>("twitterData", DataObjectFactory.getRawJSON(status)); // send the data to kafka producer.send(data); } public void onException(Exception arg0) { System.out.println(arg0); } arg0) { }; public void onDeletionNotice(StatusDeletionNotice } public void onScrubGeo(long arg0, long arg1) { } public void onStallWarning(StallWarning arg0) { } public void onTrackLimitationNotice(int arg0) { } /** Bind the listener **/ twitterStream.addListener(listener); /** GOGOGO **/ twitterStream.sample(); } private Properties getProducerConfig() { Properties props = new Properties(); // List of kafka borkers. Complete list of brokers is not required as // the producer will auto discover the rest of the brokers. props.put("bootstrap.servers", "localhost:9092"); props.put("batch.size", 1); // new sending // Serializer used for sending data to kafka. Since we are // string, // we are using StringSerializer. props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("producer.type", "sync"); return props; } public static void main(String[] args) throws InterruptedException { new TwitterData().start(); } Use valid Kafka properties before executing the TwitterData. After executing the preceding class, the user will have a real-time stream of Twitter tweets in Kafka. In the next section, we are going to cover how we can use Storm to calculate the sentiments of the collected tweets. To summarize we covered how to install a single node Apache Kafka cluster and how to collect tweet from Twitter to store in a Kafka cluster If you enjoyed this post, check out the book Mastering Apache Storm to know more about different types of real time processing techniques used to create distributed applications.

0
0
8247

article-image-customizing-deep-learning-models-keras

Amey Varangaonkar

22 Dec 2017

8 min read

2 ways to customize your deep learning models with Keras

Amey Varangaonkar

22 Dec 2017

8 min read

[box type="note" align="" class="" width=""]The following extract is taken from the book Deep Learning with Keras, co-authored by Antonio Gulli and Sujit Pal. [/box] Keras has a lot of built-in functionality for you to build all your deep learning models without much need for customization. In this article, the authors explain how your Keras models can be customized for better and more efficient deep learning. As you will recall, Keras is a high level API that delegates to either a TensorFlow or Theano backend for the computational heavy lifting. Any code you build for your customization will call out to one of these backends. In order to keep your code portable across the two backends, your custom code should use the Keras backend API (https://keras.io/backend/), which provides a set of functions that act like a facade over your chosen backend. Depending on the backend selected, the call to the backend facade will translate to the appropriate TensorFlow or Theano call. The full list of functions available and their detailed descriptions can be found on the Keras backend page. In addition to portability, using the backend API also results in more maintainable code, since Keras code is generally more high-level and compact compared to equivalent TensorFlow or Theano code. In the unlikely case that you do need to switch to using the backend directly, your Keras components can be used directly inside TensorFlow (not Theano though) code as described in this Keras blog (https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html) Customizing Keras typically means writing your own custom layer or custom distance function. In this section, we will demonstrate how to build some simple Keras layers. You will see more examples of using the backend functions to build other custom Keras components, such as objectives (loss functions), in subsequent sections. Keras example — using the lambda layer Keras provides a lambda layer; it can wrap a function of your choosing. For example, if you wanted to build a layer that squares its input tensor element-wise, you can say simply: model.add(lambda(lambda x: x ** 2)) You can also wrap functions within a lambda layer. For example, if you want to build a custom layer that computes the element-wise euclidean distance between two input tensors, you would define the function to compute the value itself, as well as one that returns the output shape from this function, like so: def euclidean_distance(vecs): x, y = vecs return K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True)) def euclidean_distance_output_shape(shapes): shape1, shape2 = shapes return (shape1[0], 1) You can then call these functions using the lambda layer shown as follows: lhs_input = Input(shape=(VECTOR_SIZE,)) lhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(lhs_input) rhs_input = Input(shape=(VECTOR_SIZE,)) rhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(rhs_input) sim = lambda(euclidean_distance, output_shape=euclidean_distance_output_shape)([lhs, rhs]) Keras example - building a custom normalization layer While the lambda layer can be very useful, sometimes you need more control. As an example, we will look at the code for a normalization layer that implements a technique called local response normalization. This technique normalizes the input over local input regions, but has since fallen out of favor because it turned out not to be as effective as other regularization methods such as dropout and batch normalization, as well as better initialization methods. Building custom layers typically involves working with the backend functions, so it involves thinking about the code in terms of tensors. As you will recall, working with tensors is a two step process. First, you define the tensors and arrange them in a computation graph, and then you run the graph with actual data. So working at this level is harder than working in the rest of Keras. The Keras documentation has some guidelines for building custom layers (https://keras.io/layers/writing-your-own-keras-layers/), which you should definitely read. One of the ways to make it easier to develop code in the backend API is to have a small test harness that you can run to verify that your code is doing what you want it to do. Here is a small harness I adapted from the Keras source to run your layer against some input and return a result: from keras.models import Sequential from keras.layers.core import Dropout, Reshape def test_layer(layer, x): layer_config = layer.get_config() layer_config["input_shape"] = x.shape layer = layer.__class__.from_config(layer_config) model = Sequential() model.add(layer) model.compile("rmsprop", "mse") x_ = np.expand_dims(x, axis=0) return model.predict(x_)[0] And here are some tests with layer objects provided by Keras to make sure that the harness runs okay: from keras.layers.core import Dropout, Reshape from keras.layers.convolutional import ZeroPadding2D import numpy as np x = np.random.randn(10, 10) layer = Dropout(0.5) y = test_layer(layer, x) assert(x.shape == y.shape) x = np.random.randn(10, 10, 3) layer = ZeroPadding2D(padding=(1,1)) y = test_layer(layer, x) assert(x.shape[0] + 2 == y.shape[0]) assert(x.shape[1] + 2 == y.shape[1]) x = np.random.randn(10, 10) layer = Reshape((5, 20)) y = test_layer(layer, x) assert(y.shape == (5, 20)) Before we begin building our local response normalization layer, we need to take a moment to understand what it really does. This technique was originally used with Caffe, and the Caffe documentation (http://caffe.berkeleyvision.org/tutorial/layers/lrn.html), describes it as a kind of lateral inhibition that works by normalizing over local input regions. In ACROSS_CHANNEL mode, the local regions extend across nearby channels but have no spatial extent. In WITHIN_CHANNEL mode, the local regions extend spatially, but are in separate channels. We will implement the WITHIN_CHANNEL model as follows. The formula for local response normalization in the WITHIN_CHANNEL model is given by: The code for the custom layer follows the standard structure. The __init__ method is used to set the application specific parameters, that is, the hyperparameters associated with the layer. Since our layer only does a forward computation and doesn't have any learnable weights, all we do in the build method is to set the input shape and delegate to the superclass's build method, which takes care of any necessary book-keeping. In layers where learnable weights are involved, this method is where you would set the initial values. The call method does the actual computation. Notice that we need to account for dimension ordering. Another thing to note is that the batch size is usually unknown at design times, so you need to write your operations so that the batch size is not explicitly invoked. The computation itself is fairly straightforward and follows the formula closely. The sum in the denominator can also be thought of as average pooling over the row and column dimension with a padding size of (n, n) and a stride of (1, 1). Because the pooled data is averaged already, we no longer need to divide the sum by n. The last part of the class is the get_output_shape_for method. Since the layer normalizes each element of the input tensor, the output size is identical to the input size: from keras import backend as K from keras.engine.topology import Layer, InputSpec class LocalResponseNormalization(Layer): def __init__(self, n=5, alpha=0.0005, beta=0.75, k=2, **kwargs): self.n = n self.alpha = alpha self.beta = beta self.k = k super(LocalResponseNormalization, self).__init__(**kwargs) def build(self, input_shape): self.shape = input_shape super(LocalResponseNormalization, self).build(input_shape) def call(self, x, mask=None): if K.image_dim_ordering == "th": _, f, r, c = self.shape Else: _, r, c, f = self.shape squared = K.square(x) pooled = K.pool2d(squared, (n, n), strides=(1, 1), padding="same", pool_mode="avg") if K.image_dim_ordering == "th": summed = K.sum(pooled, axis=1, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=1) Else: summed = K.sum(pooled, axis=3, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=3) denom = K.pow(self.k + averaged, self.beta) return x / denom def get_output_shape_for(self, input_shape): return input_shape You can test this layer during development using the test harness we described here. It is easier to run this instead of trying to build a whole network to put this into, or worse, waiting till you have fully specified the layer before running it: x = np.random.randn(225, 225, 3) layer = LocalResponseNormalization() y = test_layer(layer, x) assert(x.shape == y.shape) Now that you have a good idea of how to build a custom Keras layer, you might find it instructive to look at Keunwoo Choi's melspectogram (https://keunwoochoi.wordpress.com/2016/11/18/for-beginners-writing-a-custom-keras-layer/) Though building custom Keras layers seems to be fairly commonplace for experienced Keras developers, but they may not be widely useful in a general context. Custom layers are usually built to serve a specific narrow purpose, depending on the use-case in question, and Keras gives you enough flexibility to do so with ease. If you found our post useful, make sure to check out our best selling title Deep Learning with Keras, for other intriguing deep learning concepts and their implementation using Keras.

0
1
28927

article-image-exploratory-data-analysis-eda-spark-sql

Amarabha Banerjee

21 Dec 2017

7 min read

How to perform Exploratory Data Analysis (EDA) with Spark SQL

Amarabha Banerjee

21 Dec 2017

7 min read

[box type="note" align="" class="" width=""]Below given post is a book excerpt taken from Learning Spark SQL written by Aurobindo Sarkar. This book will help you design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API.[/box] Our article aims to give you an understanding of how exploratory data analysis is performed with Spark SQL. What is Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other, more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify the possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that using a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a "feel", for the Dataset as a result of such exploration. More specifically, the graphical techniques allow us to efficiently select and validate appropriate models, test our assumptions, identify relationships, select estimators, detect outliers, and so on. EDA involves a lot of trial and error, and several iterations. The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement. For example, you should plot simple histograms and scatter plots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take a long time to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer, there are many good tools available, such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data, then this section will offer some good tools and techniques for data exploration. In this section, we will go through some basic data exploration exercises to understand a sample Dataset. We will use a Dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We'll use the bank-additional-full.csv file that contains 41,188 records and 20 input fields, ordered by date (from May 2008 to November 2010). The Dataset has been contributed by S. Moro, P. Cortez, and P. Rita, and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. As a first step, let's define a schema and read in the CSV file to create a DataFrame. You can use :paste command to paste initial set of statements in your Spark shell session (use Ctrl+D to exit the paste mode), as shown: 2. After the DataFrame has been created, we first verify the number of records: We can also define a case class called Call for our input records, and then create a strongly-typed Dataset, as follows: Identifying missing data Missing data can occur in Datasets due to reasons ranging from negligence to a refusal on the part of respondents to provide a specific data point. However, in all cases, missing data is a common occurrence in real-world Datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies to deal with it. In this section, we analyze the numbers of records with missing data fields in our sample Dataset. In order to simulate missing data, we will edit our sample Dataset by replacing fields containing "unknown" values with empty strings. First, we created a DataFrame/Dataset from our edited file, as shown: In the next section, we will compute some basic statistics for our sample Dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a Dataset containing a subset of fields from our original DataFrame. In the following example, we choose some of the numeric fields and the outcome field, that is, the "term deposit subscribed" field: Next, we use describe() compute the count, mean, stdev, min, and max values for the numeric columns in our Dataset. The describe() command gives a way to do a quick sense-check on your data. For example, the counts of rows of each of the columns selected matches the total number records in the DataFrame (no null or invalid rows),whether the average and range of values for the age column matching your expectations, and so on. Based on the values of the means and standard deviations, you can get select certain data elements for deeper analysis. For example, assuming normal distribution, the mean and standard deviation values for age suggest most values of age are in the range 30 to 50 years, for other columns the standard deviation values may be indicative of a skew in the data (as the standard deviation is greater than the mean). Identifying data outliers An outlier or an anomaly is an observation of the data that deviates significantly from other observations in the Dataset. These erroneous outliers can be due to errors in the data collection or variability in measurement. They can impact the results significantly so it is imperative to identify them during the EDA process. However, these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using statistical distributions, and the outliers are identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically does not have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. Spark MLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. For example, we can apply clustering algorithms and visualize the results to detect outliers in a combination columns. In the following example, we use the last contact duration, in seconds (duration), number of contacts performed during this campaign, for this client (campaign), number of days that have passed by after the client was last contacted from a previous campaign (pdays) and the previous: number of contacts performed before this campaign and for this client (prev) values to compute two clusters in our data by applying the k-means clustering algorithm: If you liked this article, please be sure to check out Learning Spark SQL which will help you learn more useful techniques on data extraction and data analysis.

0
0
12937

article-image-implementing-row-level-security-in-postgresql

Amey Varangaonkar

21 Dec 2017

7 min read

Implementing Row-level Security in PostgreSQL

Amey Varangaonkar

21 Dec 2017

7 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering PostgreSQL 9.6, authored by Hans-Jürgen Schönig. The book gives a comprehensive primer on different features and capabilities of PostgreSQL 9.6, and how you can leverage them efficiently to administer and manage your PostgreSQL database.[/box] In this article, we discuss the concept of row-level security and how effectively it can be implemented in PostgreSQL using a interesting example. Having the row-level security feature enables allows you to store data for multiple users in a single database and table. At the same time it sets restrictions on the row-level access, based on a particular user’s role or identity. What is Row-level Security? In usual cases, a table is always shown as a whole. When the table contains 1 million rows, it is possible to retrieve 1 million rows from it. If somebody had the rights to read a table, it was all about the entire table. In many cases, this is not enough. Often it is desirable that a user is not allowed to see all the rows. Consider the following real-world example: an accountant is doing accounting work for many people. The table containing tax rates should really be visible to everybody as everybody has to pay the same rates. However, when it comes to the actual transactions, you might want to ensure that everybody is only allowed to see his or her own transactions. Person A should not be allowed to see person B's data. In addition to that, it might also make sense that the boss of a division is allowed to see all the data in his part of the company. Row-level security has been designed to do exactly this and enables you to build multi-tenant systems in a fast and simple way. The way to configure those permissions is to come up with policies. The CREATE POLICY command is here to provide you with a means to write those rules: test=# h CREATE POLICY Command: CREATE POLICY Description: define a new row level security policy for a table Syntax: CREATE POLICY name ON table_name [ FOR { ALL | SELECT | INSERT | UPDATE | DELETE } ] [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ] [ USING ( using_expression ) ] [ WITH CHECK ( check_expression ) ] To show you how a policy can be written, I will first log in as superuser and create a table containing a couple of entries: test=# CREATE TABLE t_person (gender text, name text); CREATE TABLE test=# INSERT INTO t_person VALUES ('male', 'joe'), ('male', 'paul'), ('female', 'sarah'), (NULL, 'R2- D2'); INSERT 0 4 Then access is granted to the joe role: test=# GRANT ALL ON t_person TO joe; GRANT So far, everything is pretty normal and the joe role will be able to actually read the entire table as there is no RLS in place. But what happens if row-level security is enabled for the table? test=# ALTER TABLE t_person ENABLE ROW LEVEL SECURITY; ALTER TABLE There is a deny all default policy in place, so the joe role will actually get an empty table: test=> SELECT * FROM t_person; gender | name --------+------ (0 rows) Actually, the default policy makes a lot of sense as users are forced to explicitly set permissions. Now that the table is under row-level security control, policies can be written (as superuser): test=# CREATE POLICY joe_pol_1 ON t_person FOR SELECT TO joe USING (gender = 'male'); CREATE POLICY Logging in as the joe role and selecting all the data, will return just two rows: test=> SELECT * FROM t_person; gender | name --------+------ male | joe male | paul (2 rows) Let us inspect the policy I have just created in a more detailed way. The first thing you see is that a policy actually has a name. It is also connected to a table and allows for certain operations (in this case, the SELECT clause). Then comes the USING clause. It basically defines what the joe role will be allowed to see. The USING clause is therefore a mandatory filter attached to every query to only select the rows our user is supposed to see. Now suppose that, for some reason, it has been decided that the joe role is also allowed to see robots. There are two choices to achieve our goal. The first option is to simply use the ALTER POLICY clause to change the existing policy: test=> h ALTER POLICY Command: ALTER POLICY Description: change the definition of a row level security policy Syntax: ALTER POLICY name ON table_name RENAME TO new_name ALTER POLICY name ON table_name [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ] [ USING ( using_expression ) ] [ WITH CHECK ( check_expression ) ] The second option is to create a second policy as shown in the next example: test=# CREATE POLICY joe_pol_2 ON t_person FOR SELECT TO joe USING (gender IS NULL); CREATE POLICY The beauty is that those policies are simply connected using an OR condition.Therefore, PostgreSQL will now return three rows instead of two: test=> SELECT * FROM t_person; gender | name --------+------- male | joe male | paul | R2-D2 (3 rows) The R2-D2 role is now also included in the result as it matches the second policy. To show you how PostgreSQL runs the query, I have decided to include an execution plan of the query: test=> explain SELECT * FROM t_person; QUERY PLAN ---------------------------------------------------------- Seq Scan on t_person (cost=0.00..21.00 rows=9 width=64) Filter: ((gender IS NULL) OR (gender = 'male'::text)) (2 rows) As you can see, both the USING clauses have been added as mandatory filters to the query. You might have noticed in the syntax definition that there are two types of clauses: USING: This clause filters rows that already exist. This is relevant to SELECT and UPDATE clauses, and so on. CHECK: This clause filters new rows that are about to be created; so they are relevant to INSERT and UPDATE clauses, and so on. Here is what happens if we try to insert a row: test=> INSERT INTO t_person VALUES ('male', 'kaarel'); ERROR: new row violates row-level security policy for table "t_person" As there is no policy for the INSERT clause, the statement will naturally error out. Here is the policy to allow insertions: test=# CREATE POLICY joe_pol_3 ON t_person FOR INSERT TO joe WITH CHECK (gender IN ('male', 'female')); CREATE POLICY The joe role is allowed to add males and females to the table, which is shown in the next listing: test=> INSERT INTO t_person VALUES ('female', 'maria'); INSERT 0 1 However, there is also a catch; consider the following example: test=> INSERT INTO t_person VALUES ('female', 'maria') RETURNING *; ERROR: new row violates row-level security policy for table "t_person" Remember, there is only a policy to select males. The trouble here is that the statement will return a woman, which is not allowed because joe role is under a male only policy. Only for men, will the RETURNING * clause actually work: test=> INSERT INTO t_person VALUES ('male', 'max') RETURNING *; gender | name --------+------ male | max (1 row) INSERT 0 1 If you don't want this behavior, you have to write a policy that actually contains a proper USING clause. If you liked our post, make sure to check out our book Mastering PostgreSQL 9.6 - a comprehensive PostgreSQL guide covering all database administration and maintenance aspects.

0
0
12536

How-To Tutorials - Data

How to Implement a Neural Network with Single-Layer Perceptron

Getting started with Spark 2.0

18 striking AI Trends to watch in 2018 - Part 1

How to Mine Popular Trends on GitHub using Python - Part 2

Storing Apache Storm data in Elasticsearch

Hitting the right notes in 2017: AI in a song for Data Scientists

How to effectively clean social media data for analysis

How to store and access social media data in MongoDB

Mine Popular Trends on GitHub using Python - Part 1

10 Machine Learning Tools to watch in 2018

Trending Topics

How Data Science saved Christmas

How to stream and store tweets in Apache Kafka

2 ways to customize your deep learning models with Keras

How to perform Exploratory Data Analysis (EDA) with Spark SQL

Implementing Row-level Security in PostgreSQL

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading