After providing solid foundations for an understanding of the two basic linear models for regression and classification, we devote this chapter to a discussion about the data feeding the model. In the next pages, we will describe what can routinely be done to prepare the data in the best way and how to deal with more challenging situations, such as when data is missing or outliers are present.
Real-world experiments produce real data, which, in contrast to synthetic or simulated data, is often very varied. Real data is also quite messy, and frequently it proves wrong in ways that are obvious and some that are, initially, quite subtle. As a data practitioner, you will almost never find your data already prepared in the right form to be immediately analyzed for your purposes.
Writing a compendium of bad data and its remedies is outside the scope of this book, but our intention is to provide you with the basics to help you manage the majority of common data problems...