Agregate models with caretEnsemble
Introduction
Suppose you have a dataset, and you are narowing possible machine learning models to 2 or 3 models, but you still cant choose which you want : Will the benefit of understandability from my CART cost me too much compare to a random forest or some bootsting ?
Well you dont necessarily have to choose : juste agregate the models you have to make a better one. Typicaly, if you have models that dont uses the same features of the dataset, or give very different ansewrs but are still all good in term of a pre-selected metric (let’s say RMSE for regression, area under ROC for classification), ensembling them could be a good idea.
If you do you regression with the caret
package, which i recomand, you should take a look at the caretEnsemble
package. After introducing the dataset we’ll work with, i’ll talk a little more about ensembles.
Tidying the data
Let’s load the cu.summary
dataset, caret
and caretEnsemble
libraries, plus rpart
and ranger
for models.
data("cu.summary")
df <- cu.summary
rm(cu.summary)
head(df)
Price | Country | Reliability | Mileage | Type | |
---|---|---|---|---|---|
Acura Integra 4 | 11950 | Japan | Much better | NA | Small |
Dodge Colt 4 | 6851 | Japan | NA | NA | Small |
Dodge Omni 4 | 6995 | USA | Much worse | NA | Small |
Eagle Summit 4 | 8895 | USA | better | 33 | Small |
Ford Escort 4 | 7402 | USA | worse | 33 | Small |
Ford Festiva 4 | 6319 | Korea | better | 37 | Small |
This dataset looks a lot like the cars
dataset, but as more rows. The first thing to notice here is that the row names contains informations… Usualy, information should not be in row.names. let’s tidy a little this dataset :
df$name <- row.names(df)
row.names(df) <- NULL
df %<>%
{str_split(.$name," ",n = 2)} %>%
{do.call(rbind,.)} %>%
as.data.frame %>%
set_names(c("brand","car")) %>%
cbind(df) %>%
set_names(tolower(names(.))) %>%
as.tibble %>%
drop_na %>%
select(-name) %T>%
print
brand | car | price | country | reliability | mileage | type |
---|---|---|---|---|---|---|
Eagle | Summit 4 | 8895 | USA | better | 33 | Small |
Ford | Escort 4 | 7402 | USA | worse | 33 | Small |
Ford | Festiva 4 | 6319 | Korea | better | 37 | Small |
Honda | Civic 4 | 6635 | Japan/USA | Much better | 32 | Small |
Mazda | Protege 4 | 6599 | Japan | Much better | 32 | Small |
Mercury | Tracer 4 | 8672 | Mexico | better | 26 | Small |
Ok that’s better : we got 2 more regressors to play with.
Partitionning the data
We need to partition the dataset into a training and testing set to be able to assess performance of our models. While this is theoriticaly not needed, it’s good practice to left some line of the original data out of touch during the analysis.
We’ll use the createDataPartition
from caret
wich does exactly that : set 20% of the data apart. The function needs to know about the reponse variable (here, the price) to condition the splitting on response level (if the response is categorial) or response quantiles (if it’s continuous, wich is the case here), so that the training and testing sets are alike.
inTrain <- createDataPartition(y = df$price, p = .80, list = FALSE)
df.train <- df[ inTrain,]
df.test <- df[-inTrain,]
df.pred <- df.test %>% select(price)
rm(inTrain)
Fitting the different models
First, let’s declare the formula we’ll use to do the regression. Then we’ll declare the controls we want for the caret
regressions. For the sake of simplicity, i just took 10 bootstrap.
Here, we also have to specify the savePredictions
and index
parameters to be able to compare models.
formula = price ~ brand + country + reliability + mileage + type
controls <- trainControl(
method="boot", # On va fitter le modèle sur des echantillons bootstrap de la base de train
number=10, # On choisis le nombre d' echantillons bootstrapp
savePredictions="final",
index=createResample(df.train$price, 10),
verboseIter = TRUE
)
Then we’ll use caretEnsemble::caretList
to fit some models. Here, i choosed to use a glmStepAIC
and a simple ranger
, but you could also try a CART or anything else. The procedure take around 30secondes to run.
models <-
caretEnsemble::caretList(
formula,
data=df.train,
trControl=controls,
tuneList=list(
rpart=caretModelSpec(method="rpart"),
ranger=caretModelSpec(
method="ranger",
tuneGrid = expand.grid(
mtry = 1:20,
splitrule="variance",
min.node.size=1
),
verbose=TRUE,
importance = 'impurity'
)
)
)
Agregating models
Now that the models have run, we can look at the corelation between them :
modelCor(resamples(models))
Ok models are a little correlated, which is normal since it’s a cart and a radom forest based on carts. So maybe agregating them will help making a better prediction ? Let’s use the main function from the caretEnsemble
package, with does a linear combination of the models, weighted by their quality of prediction :
merge.glm <- caretEnsemble(
models,
trControl=trainControl(
method = "boot",
number=10,
verboseIter = TRUE
))
The variable importance is :
merge.glm
Now let’s compare the RMSE from both models :
merge.glm$models %>%
map("results") %>%
map_dfr(~filter(.x,RMSE == min(RMSE)) %>% select(RMSE)) %>%
mutate(name = names(merge.glm$models)) %>%
add_row(name = "merged",RMSE = merge.glm$ens_model$results$RMSE) %>%
select(name,RMSE)
Ok we lost a little compare to the ranger model. But it’s because we didnt choose wisely enough our models. MAybe if we try something else, it’ll work ? [Not finished…]
If we had choose better input, the result would have been much better.