Problem 6 – using Python to analyze genetic data
Let’s shift focus to looking at a larger dataset. You’re working with laboratory mice and getting data for trisomy mice and protein expressions in these mice. We’ve truncated some of the data from the public domain file in Kaggle for this due to its huge size. We’re only focusing on six protein expressions for the mice and again, only the trisomy (down syndrome) mice in the study. The full file can be found on the Kaggle website at https://www.kaggle.com/ruslankl/mice-protein-expression. The truncated file can be found in this book’s GitHub repository.
Let’s say you don’t know where to start with this data. What should you even be looking at? Well, that’s often the first thing we encounter in data science. We don’t always get to be part of the study design or data collection process. Many times, we receive large data files and need to figure out what to look for...