Statistics for Data Science

4 (6 reviews total)
By James D. Miller
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Transitioning from Data Developer to Data Scientist

About this book

Data science is an ever-evolving field, which is growing in popularity at an exponential rate. Data science includes techniques and theories extracted from the fields of statistics; computer science, and, most importantly, machine learning, databases, data visualization, and so on.

This book takes you through an entire journey of statistics, from knowing very little to becoming comfortable in using various statistical methods for data science tasks. It starts off with simple statistics and then move on to statistical methods that are used in data science algorithms. The R programs for statistical computation are clearly explained along with logic. You will come across various mathematical concepts, such as variance, standard deviation, probability, matrix calculations, and more. You will learn only what is required to implement statistics in data science tasks such as data cleaning, mining, and analysis. You will learn the statistical techniques required to perform tasks such as linear regression, regularization, model assessment, boosting, SVMs, and working with neural networks.

By the end of the book, you will be comfortable with performing various statistical computations for data science programmatically.

Publication date:
November 2017
Publisher
Packt
Pages
286
ISBN
9781788290678

 

Chapter 1. Transitioning from Data Developer to Data Scientist

In this chapter (and throughout all of the chapters of this book), we will chart your course for starting and continuing the journey from thinking like a data developer to thinking like a data scientist.

Using developer terminologies and analogies, we will discuss a developer's objectives, what a typical developer mindset might be like, how it differs from a data scientist's mindset, why there are important differences (as well as similarities) between the two and suggest how to transition yourself into thinking like a data scientist. Finally, we will suggest certain advantages of understanding statistics and data science, taking a data perspective, as well as simply thinking like a data scientist.

In this chapter, we've broken things into the following topics:

  • The objectives of the data developer role
  • How a data developer thinks
  • The differences between a data developer and a data scientist
  • Advantages of thinking like a data scientist
  • The steps for transitioning into a data scientist mindset

So, let's get started!

 

Data developer thinking


Having spent plenty of years wearing the hat of a data developer, it makes sense to start out here with a few quick comments about data developers.

In some circles, a database developer is the equivalent of a data developer. But whether data or database, both would usually be labeled as an information technology (IT) professional. Both spend their time working on or with data and database technologies.

Note

We may see a split between those databases (data) developers that focus more on support and routine maintenance (such as administrators) and those who focus more on improving, expanding, and otherwise developing access to data (such as developers).

Your typical data developer will primarily be involved with creating and maintaining access to data rather than consuming that data. He or she will have input in or may make decisions on, choosing programming languages for accessing or manipulating data. We will make sure that new data projects adhere to rules on how databases store and handle data, and we will create interfaces between data sources.

In addition, some data developers are involved with reviewing and tuning queries written by others and, therefore, must be proficient in the latest tuning techniques, various query languages such as Structured Query Language (SQL), as well as how the data being accessed is stored and structured.

In summary, at least strictly from a data developer's perspective, the focus is all about access to valuable data resources rather than the consumption of those valuable data resources.

 

Objectives of a data developer


Every role, position, or job post will have its own list of objectives, responsibilities, or initiatives.

As such, in the role of a data developer, one may be charged with some of the following responsibilities:

  • Maintaining the integrity of a database and infrastructure
  • Monitoring and optimizing to maintain levels of responsiveness
  • Ensuring quality and integrity of data resources
  • Providing appropriate levels of support to communities of users
  • Enforcing security policies on data resources

As a data scientist, you will note somewhat different objectives. This role will typically include some of the objectives listed here:

  • Mining data from disparate sources
  • Identifying patterns or trending
  • Creating statistical models—modeling
  • Learning and assessing
  • Identifying insights and predicting

Do you perhaps notice a theme beginning here?

Note the keywords:

  • Maintaining
  • Monitoring
  • Ensuring
  • Providing
  • Enforcing

These terms imply different notions than those terms that may be more associated with the role of a data scientist, such as the following:

  • Mining
  • Trending
  • Modeling
  • Learning
  • Predicting

There are also, of course, some activities performed that may seem analogous to both a data developer and a data scientist and will be examined here.

Querying or mining

As a data developer, you will almost always be in the habit of querying data. Indeed, a data scientist will query data as well. So, what is data mining? Well, when one queries data, one expects to ask a specific question. For example, you might ask, What was the total number of daffodils sold in April? expecting to receive back a known, relevant answer such as in April, daffodil sales totaled 269 plants.

With data mining, one is usually more absorbed in the data relationships (or the potential relationships between points of data, sometimes referred to as variables) and cognitive analysis. A simple example might be: how does the average daily temperature during the month affect the total number of daffodils sold in April?

Another important distinction between data querying and data mining is that queries are typically historic in nature in that they are used to report past results (total sales in April), while data mining techniques can be forward thinking in that through the use of appropriate statistical methods, they can infer a future result or provide the probability that a result or event will occur. For example, using our earlier example, we might predict higher daffodil sales when the average temperature rises within the selling area.

Data quality or data cleansing

Do you think a data developer is interested in the quality of data in a database? Of course, a data developer needs to care about the level of quality of the data they support or provide access to. For a data developer, the process of data quality assurance (DQA) within an organization is more mechanical in nature, such as ensuring data is current and complete and stored in the correct format.

With data cleansing, you see the data scientist put more emphasis on the concept of statistical data quality. This includes using relationships found within the data to improve the levels of data quality. As an example, an individual whose age is nine, should not be labeled or shown as part of a group of legal drivers in the United States incorrectly labeled data.

Note

You may be familiar with the term munging data. Munging may be sometimes defined as the act of tying together systems and interfaces that were not specifically designed to interoperate. Munging can also be defined as the processing or filtering of raw data into another form for a particular use or need.

Data modeling

Data developers create designs (or models) for data by working closely with key stakeholders based on given requirements such as the ability to rapidly enter sales transactions into an organization's online order entry system. During model design, there are three kinds of data models the data developer must be familiar with—conceptual, logical, and physical—each relatively independent of each other.

Data scientists create models with the intention of training with data samples or populations to identify previously unknown insights or validate current assumptions.

Note

Modeling data can become complex, and therefore, it is common to see a distinction between the role of data development and data modeling. In these cases, a data developer concentrates on evaluating the data itself, creating meaningful reports, while data modelers evaluate how to collect, maintain, and use the data.

Issue or insights

A lot of a data developer's time may be spent monitoring data, users, and environments, looking for any indications of emerging issues such as unexpected levels of usage that may cause performance bottlenecks or outages. Other common duties include auditing, application integrations, disaster planning and recovery, capacity planning, change management, database software version updating, load balancing, and so on.

Data scientists spend their time evaluating and analyzing data, and information in an effort to discover valuable new insights. Hopefully, once established, insights can then be used to make better business decisions.

Note

There is a related concept to grasp; through the use of analytics, one can identify patterns and trends within data, while an insight is a value obtained through the use of the analytical outputs.

Thought process

Someone's mental procedures or cognitive activity based on interpretations, past experiences, reasoning, problem-solving, imagining, and decision making make up their way of thinking or their thought process.

One can only guess how particular individuals will actually think, or their exact thoughts at a given point of time or during an activity, or what thought process they will use to accomplish their objectives, but in general terms, a data developer may spend more time thinking about data convenience (making the data available as per the requirements), while data scientists are all about data consumption (concluding new ways to leverage the data to find insights into existing issues or new opportunities).

To paint a clearer picture, you might use the analogy of the auto mechanic and the school counselor.

An auto mechanic will use his skills along with appropriate tools to keep an automobile available to its owner and running well, or if there has been an issue identified with a vehicle, the mechanic will perform diagnosis for the symptoms presented and rectify the problem. This is much like the activities of a data developer.

With a counselor, he or she might examine a vast amount of information regarding a student's past performance, personality traits, as well as economic statistics to determine what opportunities may exist in a particular student's future. In addition, multiple scenarios may be studied to predict what the best outcomes might be, based on this individual student's resources.

Clearly, both aforementioned individuals provide valuable services but use (maybe very) different approaches and individual thought processes to produce the desired results.

Although there is some overlapping, when you are a data developer, your thoughts are normally around maintaining convenient access to appropriate data resources but not particularly around the data's substance, that is, you may care about data types, data volumes, and accessibility paths but not about whether or what cognitive relationships exist or the powerful potential uses for the data.

In the next section, we will explore some simple circumstances in an effort to show various contrasts between the data developer and the data scientist.

Developer versus scientist

To better understand the differences between a data developer and data scientist, let's take a little time here and consider just a few hypotheticals (yet still realistic) situations that may occur during your day.

New data, new source

What happens when new data or a new data source becomes available or is presented?

Here, new data usually means that more current or more up-to-date data has become available. An example of this might be receiving a file each morning of the latest month-to-date sales transactions, usually referred to as an actual update.

Note

In the business world, data can be either real (actual) as in the case of an authenticated sale, or sale transaction entered in an order processing system, or supposed as in the case of an organization forecasting a future (not yet actually occurred) sale or transaction.

You may receive files of data periodically from an online transactions processing system, which provide the daily sales or sales figures from the first of the month to the current date. You'd want your business reports to show the total sales numbers that include the most recent sales transactions.

The idea of a new data source is different. If we use the same sort of analogy as we used previously, an example of this might be a file of sales transactions from a company that a parent company newly acquired. Perhaps another example would be receiving data reporting the results of a recent online survey. This is the information that's collected with a specific purpose in mind and typically is not (but could be) a routine event.

Note

Machine (and otherwise) data is accumulating even as you are reading this, providing new and interesting data sources creating a market for data to be consumed. One interesting example might be Amazon Web Services (https://aws.amazon.com/datasets/). Here, you can find massive resources of public data, including the 1000 Genomes Project (the attempt to build the most comprehensive database of human genetic information) as well as NASA's database of satellite imagery of the Earth.

In the previous scenarios, a data developer would most likely be (should be) expecting updated files and have implemented the Extract, Transform, and Load (ETL) processes to automatically process the data, handle any exceptions, and ensure that all the appropriate reports reflect the latest, correct information. Data developers would also deal with transitioning a sales file from a newly acquired company but probably would not be a primary resource for dealing with survey results (or the 1000 Genomes Project).

Data scientists are not involved in the daily processing of data (such as sales) but will be directly responsible for a survey results project. That is, the data scientist is almost always hands-on with initiatives such as researching and acquiring new sources of information for projects involving surveying. Data scientists most likely would have input even in the designing of surveys as they are the ones who will be using that data in their analysis.

Quality questions

Suppose there are concerns about the quality of the data to be, or being, consumed by the organization. As we eluded to earlier in this chapter, there are different types of data quality concerns such as what we called mechanical issues as well as statistical issues (and there are others).

Note

Current trending examples of the most common statistical quality concerns include duplicate entries and misspellings, misclassification and aggregation, and changing meanings.

If management is questioning the validity of the total sales listed on a daily report or perhaps doesn't trust it because the majority of your customers are not legally able to drive in the United States, the number of the organizations repeat customers are declining, you have a quality issue:

Quality is a concern to both the data developer and the data scientist. A data developer focuses more on timing and formatting (the mechanics of the data), while the data scientist is more interested in the data's statistical quality (with priority given to issues with the data that may potentially impact the reliability of a particular study).

Querying and mining

Historically, the information technology group or department has been beseeched by a variety of business users to produce and provide reports showing information stored in databases and systems that are of interest.

These ad hoc reporting requests have evolved into requests for on-demand raw data extracts (rather than formatted or pretty printed reports) so that business users could then import the extracted data into a tool such as MS Excel (or others), where they could then perform their own formatting and reporting, or perform further analysis and modeling. In today's world, business users demand more self-service (even mobile) abilities to meet their organization's (or an individual's) analytical and reporting needs, expecting to have access to the updated raw data stores, directly or through smaller, focus-oriented data pools.

If business applications cannot supply the necessary reporting on their own, business users often will continue their self-service journey.                                                                                                     -Christina Wong (www.datainformed.com)

Creating ad hoc reports and performing extracts based on specific on-demand needs or providing self-service access to data falls solely to the role of the organization's data developer. However, take note that a data scientist will want to periodically perform his or her own querying and extracting—usually as part of a project they are working on. They may use these query results to determine the viability and availability of the data they need or as part of the process to create a sampling or population for specific statistical projects. This form of querying may be considered to be a form of data mining and goes much deeper into the data than queries might. This work effort is typically performed by a data scientist rather than a data developer.

Performance

You can bet that pretty much everyone is, or will be, concerned with the topic of performance. Some forms (of performance) are perhaps a bit more quantifiable, such as what is an acceptable response time for an ad hoc query or extract to complete? Or perhaps what are the total number of mouse-clicks or keystrokes required to enter a sales order? Others may be a bit more difficult to answer or address, such as why does it appear that there is a downward trend in the number of repeat customers?

It is the responsibility of the data developer to create and support data designs (even be involved with infrastructure configuration options) that consistently produce swift response times and are easy to understand and use.

Note

One area of performance responsibility that may be confusing is in the area of website performance. For example, if an organization's website is underperforming, is it because certain pages are slow to load or uninteresting and/or irrelevant to the targeted audience or customer? In this example, both a data developer and a data scientist may be directed to address the problem.

These individuals—data developers—would not play a part in survey projects. The data scientist, on the other hand, will not be included in day-to-day transactional (or similar) performance concerns but would be the key responsible person to work with the organization's stakeholders by defining and leading a statistical project in an effort to answer a question such as the one concerning repeat-customer counts.

Financial reporting

In every organization, there is a need to produce regular financial statements (such as an Income Statement, Balance Sheet, or Cash Flow statement). Financial reporting (or Fin reporting) is looking to answer key questions regarding the business, such as the following:

  • Are we making a profit or losing money?
  • How do assets compare to liabilities?
  • How much free cash do we have or need?

The process of creating, updating, and validating regular financial statements is a mandatory task for any business—profit or non-profit based—of just about any size, whether public or private. Organizations, still today, are not all using fully automated reporting solutions. This means that even the task of updating a single report with the latest data could be a daunting ordeal.

Financial reporting is one area that is (pretty) clearly defined within the industry as far as responsibilities go. A data developer would be the one to create and support the processing and systems that make the data available, ensure its correctness, and even (in some cases) create and distribute reports.

Over 83 percent of businesses in the world today utilize MS Excel for Month End close and reporting                                                                                                           -https://venasolutions.com/

Typically, a data developer would work to provide and maintain the data to feed these efforts.

Data scientists typically do not support an organization's routine processing and (financial) reporting efforts. A data scientist would, however, perform analysis of the produced financial information (and supporting data) to produce reports and visualizations indicating insights around management performance in profitability, efficiency, and risk (to name a few).

One particularly interesting area of statistics and data science is when a data scientist performs a vertical analysis to identify relationships of variables to a base amount within an organization's financial statement.

Visualizing

It is a common practice today to produce visualizations in a dashboard format that can show updated individual key performance indicators (KPI). Moreover, communicating a particular point or simplifying the complexities of mountains of data does not require the use of data visualization techniques, but in some ways, today's world may demand it.

Most would likely agree that scanning numerous worksheets, spreadsheets, or reports is mundane and tedious at best while looking at charts and graphs (such as a visualization) is typically much easier on the eyes. To that point, both the data developer and the data scientist will equally be found designing, creating, and using data visualizations. The difference will be found in the types of visualizations being created. Data developers usually focus on the visualization of repetitive data points (forecast versus actuals, to name a common example), while data scientists use visualizations to make a point as part of a statistical project.

Again, a data developer most likely will leverage visualizations to illustrate or highlight, for example, sales volumes, month-to-month for the year, while a data scientist may use visualizations to predict potential sales volumes, month-to-month for next year, given seasonality (and other) statistics.

Tools of the trade

The tools and technologies used by individuals to access and consume data can vary significantly depending upon an assortment of factors such as the following:

  • The type of business
  • The type of business problem (or opportunity)
  • Security or legal requirements
  • Hardware and software compatibilities and/or perquisites
  • The type and use of data
  • The specifics around the user communities
  • Corporate policies
  • Price

In an ever-changing technology climate, the data developer and data scientist have ever more, and perhaps overwhelming, choices including very viable open source options.

Note

Open source software is software developed by and for the user community. The good news is that open source software is used in the vast majority, or 78 percent, of worldwide businesses today—Vaughan-Nichols, http://www.zdnet.com/. Open source is playing a continually important role in data science.

When we talk about tools and technologies, both the data developer and the data scientist will be equally involved in choosing the correct tool or technology that best fits their individual likes and dislikes and meets the requirements of the project or objective.

 

Advantages of thinking like a data scientist


So why should you, a data developer, endeavor to think like (or more like) a data scientist? What is the significance of gaining an understanding of the ways and how's of statistics? Specifically, what might be the advantages of thinking like a data scientist?

The following are just a few notions supporting the effort for making the move into data science:

  • Developing a better approach to understanding data
  • Using statistical thinking during the process of program or database designing
  • Adding to your personal toolbox
  • Increased marketability
  • Perpetual learning
  • Seeing the future

Developing a better approach to understanding data

Whether you are a data developer, systems analyst, programmer/developer, or data scientist, or other business or technology professional, you need to be able to develop a comprehensive relationship with the data you are working with or designing an application or database schema for.

Some might rely on the data specifications provided to you as part of the overall project plan or requirements, and still, some (usually those with more experience) may supplement their understanding by performing some generic queries on the data, either way, this seldom is enough.

In fact, in industry case studies, unclear, misunderstood, or incomplete requirements or specifications consistently rank in the top five as reasons for project failure or added risk.

Profiling data is a process, characteristic of data science, aimed at establishing data intimacy (or a more clear and concise grasp of the data and its inward relationships). Profiling data also establishes context to which there are several general contextual categories, which can be used to augment or increase the value and understanding of data for any purpose or project.

These categories include the following:

  • Definitions and explanations: These help gain additional information or attributes about data points within your data
  • Comparisons: This help add a comparable value to a data point within your data
  • Contrasts: This help add an opposite to a data point to see whether it perhaps determines a different perspective
  • Tendencies: These are typical mathematical calculations, summaries, or aggregations
  • Dispersion: This includes mathematical calculations (or summaries) such as range, variance, and standard deviation, describing the average of a dataset (or group within the data)

Note

Think of data profiling as the process you may have used for examining data in a data file and collecting statistics and information about that data. Those statistics most likely drove the logic implemented in a program or how you related data in tables of a database.

Using statistical thinking during program or database designing

The process of creating a database design commonly involves several tasks that will be carried out by the database designer (or data developer). Usually, the designer will perform the following:

  1. Identify what data will be kept in the database.
  2. Establish the relationships between the different data points.
  3. Create a logical data structure to be used on the basis of steps 1 and 2.

Even during the act of application program designing, a thorough understanding of how the data works is essential. Without understanding average or default values, relationships between data points and grouping, and so on, the created application is at risk of failing.

One idea for applying statistical thinking to help with data designing is in the case where there is limited real data available. If enough data cannot be collected, one could create sample (test) data by a variety of sampling methods, such as probability sampling.

Note

A probability-based sample is created by constructing a list of the target population values, called a sample frame, then a randomized process for selecting records from the sample frame, which is called a selection procedure. Think of this as creating a script to generate records of sample data based on your knowledge of actual data as well as some statistical logic to be used for testing your designs.

Finally, approach any problem with scientific or statistical methods, and odds are you'll produce better results.

Adding to your personal toolbox

In my experience, most data developers tend to lock on to a technology or tool based upon a variety of factors (some of which we mentioned earlier in this chapter) becoming increasingly familiar with and (hopefully) more proficient with the product, tool, or technology—even the continuously released newer versions. One might suspect that (and probably would be correct) the more the developer uses the tool, the higher the skill level that he or she establishes. Data scientists, however, seem to lock onto methodologies, practices, or concepts more than the actual tools and technologies they use to implement them.

This turning of focus (from to tool to technique) changes one's mindset to the idea of thinking what tool best serves my objective rather than how this tool serves my objective.

Note

The more tools you are exposed to, the broader your thinking will become a developer or data scientist. The open source community provides outstanding tools you can download, learn, and use freely. One should adopt a mindset of what's next or new to learn, even if it's in an attempt to compare features and functions of a new tool to your preferred tool. We'll talk more about this in the perpetual learning section of this chapter.

An exciting example of a currently popular data developer or data enabling tool is MarkLogic (http://www.marklogic.com/). This is an operational and transactional enterprise NoSQL database that is designed to integrate, store, manage, and search more data than ever before. MarkLogic received the 2017 DAVIES Award for best Data Development Tools. R and Python seem to be at the top as options for the data scientists.

Note

It would not be appropriate to end this section without the mention of IBM Watson Analytics (https://www.ibm.com/watson/), currently transforming the way the industry thinks about statistical or cognitive thinking.

Increased marketability

Data science is clearly an ever-evolving field, with exponentially growing popularity. In fact, I'd guess that if you ask a dozen professionals, you'll most likely receive a dozen different definitions of what a data scientist is (and their place within a project or organization), but most likely, all would agree with their level of importance and that vast numbers of opportunities exist within the industry and the world today.

Data scientist face an unprecedented demand for more models, more insights...there's only one way to do that: They have to dramatically speed up the insights to action. In the future data Scientists, must become more productive. That's the only way they're going to get more value from the data.                                                                                                                                -Gualtieri        https://www.datanami.com/2015/09/18/the-future-of-data-science/

Data Scientist is relatively hard to find today. If you do your research, you will find that today's data scientists may have a mixed background consisting of mathematics, programming, and software design, experimental design, engineering, communication, and management skills. In practice, you'll see that most data scientists you find aren't specialists in any one aspect, rather they possess varying levels of proficiency in several areas or backgrounds.

The role of the data scientist has unequivocally evolved since the field of statistics of over 1200 years ago. Despite the term only existing since the turn of this century, it has already been labeled The Sexiest Job of the 21st Century, which understandably, has created a queue of applicants stretched around the block                                                                                                                                 -Pearson       https://www.linkedin.com/pulse/evolution-data-scientist-chris-pearson

 

Note

Currently, there is no official data scientist job description (or prerequisite list for that matter). This presents you with the opportunity to create your own flavour of the data scientist, delivering value in new ways to your organization.

Perpetual learning

The idea of continued assessment or perpetual learning is an important statistical concept to grasp. Consider learning enhanced skills of perception as a common definition. For example, in statistics, we can refer to the idea of cross-validation. This is a statistical approach for measuring (assessing) a statistical model's performance. This practice involves identifying a set of validation values and then running a model a set number of rounds (continuously), using sample datasets and then averaging the results of each round to ultimately see how good a model (or approach) might be in solving a particular problem or meeting an objective.

The expectation here is that given performance results, adjustments could be made to tweak the model so as to provide the ability to identify insights when used with a real or full population of data. Not only is this concept a practice the data developer should use for refining or fine-tuning a data design or data-driven application process, but this is great life advice in the form of try, learn, adjust, and repeat.

Note

The idea of model assessment is not unique to statistics. Data developers might consider this similar to the act of predicting SQL performance or perhaps the practice of an application walkthrough where an application is validated against the intent and purpose stated within its documented requirements.

Seeing the future

Predictive modeling uses the statistics of data science to predict or foresee a result (actually, a probable result). This may sound a lot like fortune telling, but it is more about putting to use cognitive reasoning to interpret information (mined from data) to draw a conclusion. In the way that a scientist might be described as someone who acts in a methodical way, attempting to obtain knowledge or to learn, a data scientist might be thought of as trying to make predictions, using statistics and (machine) learning.

Note

When we talk about predicting a result, it's really all about the probability of seeing a certain result. Probability deals with predicting the likelihood of future events, while statistics involves the analysis of the frequency of past events.

If you are a data developer who has perhaps worked on projects serving an organization's office of finance, you may understand why a business leader would find it of value to not just report on its financial results (even the most accurate of results are really still historical events) but also to be able to make educated assumptions on future performance.

Perhaps you can understand that if you have a background in and are responsible for financial reporting, you can now take the step towards providing statistical predictions to those reports!

Note

Statistical modeling techniques can also be applied to any type of unknown event, regardless of when it occurred, such as in the case of crime detection and suspect identification.

 

Transitioning to a data scientist


Let's start this section by taking a moment to state what I consider to be a few generally accepted facts about transitioning to a data scientist. We'll reaffirm these beliefs as we continue through this book:

  • Academia: Data scientists are not all from one academic background. They are not all computer science or statistics/mathematics majors. They do not all possess an advanced degree (in fact, you can use statistics and data science with a bachelor's degree or even less).
  • It's not magic-based: Data scientists can use machine learning and other accepted statistical methods to identify insights from data, not magic.
  • They are not all tech or computer geeks: You don't need years of programming experience or expensive statistical software to be effective.
  • You don't need to be experienced to get started. You can start today, right now. (Well, you already did when you bought this book!)

Okay, having made the previous declarations, let's also be realistic. As always, there is an entry-point for everything in life, and, to give credit where it is due, the more credentials you can acquire to begin out with, the better off you will most likely be. Nonetheless, (as we'll see later in this chapter), there is absolutely no valid reason why you cannot begin understanding, using, and being productive with data science and statistics immediately.

Note

As with any profession, certifications, and degrees carry the weight that may open the doors, while experience, as always, might be considered the best teacher. There are, however, no fake data scientists but only those with currently more desire than practical experience.

If you are seriously interested in not only understanding statistics and data science but eventually working as a full-time data scientist, you should consider the following common themes (you're likely to find in job postings for data scientists) as areas to focus on:

  • Education: Common fields of study are Mathematics and Statistics, followed by Computer Science and Engineering (also Economics and Operations research). Once more, there is no strict requirement to have an advanced or even related degree. In addition, typically, the idea of a degree or an equivalent experience will also apply here.
  • Technology: You will hear SAS and R (actually, you will hear quite a lot about R) as well as Python, Hadoop, and SQL mentioned as key or preferable for a data scientist to be comfortable with, but tools and technologies change all the time so, as mentioned several times throughout this chapter, data developers can begin to be productive as soon as they understand the objectives of data science and various statistical mythologies without having to learn a new tool or language.

Note

Basic business skills such as Omniture, Google Analytics, SPSS, Excel, or any other Microsoft Office tool are assumed pretty much everywhere and don't really count as an advantage, but experience with programming languages (such as Java, PERL, or C++) or databases (such as MySQL, NoSQL, Oracle, and so on.) does help!

  • Data: The ability to understand data and deal with the challenges specific to the various types of data, such as unstructured, machine-generated, and big data (including organizing and structuring large datasets).

Note

Unstructured data is a key area of interest in statistics and for a data scientist. It is usually described as data having no redefined model defined for it or is not organized in a predefined manner. Unstructured information is characteristically text-heavy but may also contain dates, numbers, and various other facts as well.

  • Intellectual curiosity: I love this. This is perhaps well defined as a character trait that comes in handy (if not required) if you want to be a data scientist. This means that you have a continuing need to know more than the basics or want to go beyond the common knowledge about a topic (you don't need a degree on the wall for this!)
  • Business acumen: To be a data developer or a data scientist you need a deep understanding of the industry you're working in, and you also need to know what business problems your organization needs to unravel. In terms of data science, being able to discern which problems are the most important to solve is critical in addition to identifying new ways the business should be leveraging its data.
  • Communication skills: All companies look for individuals who can clearly and fluently translate their findings to a non-technical team, such as the marketing or sales departments. As a data scientist, one must be able to enable the business to make decisions by arming them with quantified insights in addition to understanding the needs of their non-technical colleagues to add value and be successful.

Let's move ahead

So, let's finish up this chapter with some casual (if not common sense) advice for the data developer who wants to learn statistics and transition into the world of data science.

Following are several recommendations you should consider to be resources for familiarizing yourself with the topic of statistics and data science:

  • Books: Still the best way to learn! You can get very practical and detailed information (with examples) and advice from books. It's great you started with this book, but there is literally a staggering amount (and growing all the time) of written resources just waiting for you to consume.
  • Google: I'm a big fan of doing internet research. You will be surprised at the quantity and quality of open source and otherwise, free software libraries, utilities, models, sample data, white papers, blogs, and so on you can find out there. A lot of it can be downloaded and used directly to educate you or even as part of an actual project or deliverable.
  • LinkedIn: A very large percentage of corporate and independent recruiters use social media, and most use LinkedIn. This is an opportunity to see what types of positions are in demand and exactly what skills and experiences they require. When you see something you don't recognize, do the research to educate yourself on the topic. In addition, LinkedIn has an enormous number of groups that focus on statistics and data science. Join them all! Network with the members--even ask them direct questions. For the most part, the community is happy to help you (even if it's only to show how much they know).
  • Volunteer: A great way to build skills, continue learning, and expand your statistics network is to volunteer. Check out http://www.datakind.org/get-involved. If you sign up to volunteer, they will review your skills and keep in touch with projects that are a fit for your background or you are interested in coming up.
  • Internship: Experienced professionals may re-enlist as interns to test a new profession or break into a new industry (www.Wetfeet.com). Although perhaps unrealistic for anyone other than a recent college graduate, internships are available if you can afford to cut your pay (or even take no pay) for a period of time to gain some practical experience in statistics and data science. What might be more practical is interning within your own company as a data scientist apprentice role for a short period or for a particular project.
  • Side projects: This is one of my favorites. Look for opportunities within your organization where statistics may be in use, and ask to sit in meetings or join calls in your own time. If that isn't possible, look for scenarios where statistics and data science might solve a problem or address an issue, and make it a pet project you work on in your spare time. These kinds of projects are low risk as there will be no deadlines, and if they don't work out at first, it's not the end of the world.
  • Data: Probably one of the easiest things you can do to help your transition into statistics and data science is to get your hands on more types of data, especially unstructured data and big data. Additionally, it's always helpful to explore data from other industries or applications.
  • Coursera and Kaggle: Coursera is an online website where you can take Massive Online Open Curriculum (MOOCs) courses for a fee and earn a certification, while Kaggle hosts data science contests where you can not only evaluate your abilities as you transition against other members but also get access to large, unstructured big data files that may be more like the ones you might use on an actual statistical project.
  • Diversify: To add credibility to your analytic skills (since many companies are adopting numerous arrays of new tools every day) such as R, Python, SAS, Scala, (of course) SQL, and so on, you will have a significant advantage if you spend time acquiring knowledge in as many tools and technologies as you can. In addition to those mainstream data science tools, you may want to investigate some of the up-and-comers such as Paxada, MatLab, Trifacta, Google Cloud Prediction API, or Logical Glue.
  • Ask a recruiter: Taking the time to develop a relationship with a recruiter early in your transformation will provide many advantages, but a trusted recruiter can pass on a list of skills that are currently in demand as well as which statistical practices are most popular. In addition, as you gain experience and confidence, a recruiter can help you focus or fine-tune your experiences towards specific opportunities that may be further out on the horizon, potentially giving you an advantage over other candidates.
  • Online videos: Check out webinars and how to videos on YouTube. There are endless resources from both amateurs and professionals that you can view whenever your schedule allows.
 

Summary


In this chapter, we sketched how a database (or data) developer thinks on a day-to-day, problem-solving basis, comparing the mindsets of a data developer and a data scientist, using various practical examples.

We also listed some of the advantages of thinking as a data scientist and finally discussed common themes for you to focus on as you gain an understanding of statistics and transition into the world of data science.

In the next chapter, we will introduce and explain (again, from a developer's perspective) the basic objectives behind statistics for data science and introduce you to the important terms and key concepts (with easily understood explanations and examples) that are used throughout the book.

 

 

About the Author

  • James D. Miller

    James D. Miller is an IBM certified expert, Master Consultant, Application/System Architect with +35 years of applications & system design/development experience across multiple platforms, technologies and data formats, including Big Data.

    His experience includes IBM Planning Analytics, BI, Web architecture & design, systems analysis, GUI design & testing, Data modeling, design, and development of OLAP, Client/Server, Web & Mainframe applications and systems utilizing: Planning Analytics Workspace (PAW), IBM Watson Analytics, Cognos BI & TM1, Framework Manager, dynaSight/ArcPlan, ASP, DHTML, XML, MS Visual Basic, VBA, PERL, R, SPLUNK, MS SQL Server, ORACLE, etc.

    He has authored numerous books, including Implementing Splunk - Second Edition; Mastering Splunk; Hands-On Machine Learning with IBM Watson; IBM Watson Projects; Statistics for Data Science; Mastering Predictive Analytics with R - Second Edition and others.

    Project areas include those with Data Analytics, Planning Analytics, and FOPM projects, holding various roles from architect, developer, technical and project leader.

    Browse publications by this author

Latest Reviews

(6 reviews total)
Good one in my collection
Excellent product, great price!
no comments for it and yet 7 symbols
Statistics for Data Science
Unlock this book and the full library for FREE
Start free trial