Chapter 2 Getting Hands On

library(tidyverse)
library(mlbench)
library(ROCR)
library(DataExplorer)
library(caret)

Here is a high level over view of an ML work flow. Note that:

  1. It is a cycle, quite likely to be repeated multiple times before arriving at some actionable result
  2. The driving questions / hypotheses are subject to change, redefinition, or abandonment
  3. Multiple people might be involved

To get the ball rolling with a practical case, let’s consider the Pima Indians Data Frame. Read in a copy.

url <- "https://raw.githubusercontent.com/steviep42/bios534_spring_2020/master/data/pima.csv"
pm <- read.csv(url)

head(pm)
##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35       0 33.6    0.627  50      pos
## 2        1      85       66      29       0 26.6    0.351  31      neg
## 3        8     183       64       0       0 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
## 5        0     137       40      35     168 43.1    2.288  33      pos
## 6        5     116       74       0       0 25.6    0.201  30      neg

The description of the data set is as follows:

So we now have some data on which we can build a model. Our defining question or driving motivation might be how to predict whether someone will be postitive for diabetes based on other variables within the data. Is glucose, for example, an important variable to consider ?

In lookin at the data, there is a variable in the data called “diabetes” which indicates the disease / diabetes status (“pos” or “neg”) of the person. It would be good to come up with a model that we could use with incoming data to determine if someone has diabetes.

2.1 Important Terminology

In predictive modeling there are some common terms to consider:

2.2 Exploratory Plots

We’ll look use some stock plots from the DataExplorer package to get a feel for the data. Look at correlations between the variables to see if any are strongly correlated with the variable we wish to predict or any other variables.

plot_correlation(pm, type="continuous")

plot_bar(pm)

plot_histogram(pm)

plot_boxplot(pm,by="diabetes")