Blending and Stacking models
IBM SPSS Modeler (or short: Modeler) is an orchestration based data mining workbench with ETL capabilities. The tool provides access to advanced models as Random Forest, Support Vector Machines and Neural Nets without a single line of coding, and without losing [too much of] flexibility or insight in the models.
One of the features of the tool is the ability to build complex pipelines while keeping it entirely transparent what is being done. This is mainly archived by the orchestration based interface and the ability of document each step of the process in the tool itself.
In this document, I will demonstrate the ability to build blending/stacking models. Blending or stacking refers to the method(s) where models take the predicted values of other models as predictors. It is often heard that monstrous ensembles are only good for winning Kaggle, but in practice, they never make it to production. Peeking ahead, I will end up with a combination of about 127 models, which score in 7ms (non-optimized, on a single 8-core virtual machine) when creating a scoring web service around it (again, without a single line of coding).
The interface of Modeler with the completed blended/stacked model is shown on top of the post. Modeler has a canvas, in which nodes are placed. This is called the stream. An end-node can be executed, upon which the data flows through the nodes, from start to end, being transformed along the way by the operations of the nodes.
Let’s start with the data. I simulated data with a data generation scheme to create a data set with interesting high order interactions. This is also done in Modeler, and the details can be found in the appendix. The purpose for this is that I can generate new data from the ‘population’ at will, and so I can always generate another validation set. The first step after getting data is looking at it. Figure 2 shows how this is achieved in Modeler (read the documentation of the stream) and Figure 3 shows (part of) the data audit output.
The data shows 13 mixed measurement level predictors with the continuous data being non-normal and skewed. There is a varying amount of missing data across the predictors. The binary target is unbalanced (5.4% in the True category). All with all, this is fairly realistic data for a small learning problem. Of course, I could have generated more data and more variables, but that would not have enhanced the demonstration of ensembles.
The first level models
With the choice of 16 different machine learning models that handle binary classification, it may be hard choose which one to use. There’s an auto-classifier node that chooses the best set from those available and ensembles the results automatically. We are interested in doing it a bit more verbose. For the purpose of demonstration, I choose the following 3 models (which I will keep constant for the creation of the ensembles, but this is not necessary).
- A neural network: a multilayer perceptron with a tanh activation function for the hidden layer, softmax for the last layer and cross-entropy loss. The number of hidden nodes can be set, but in this case, I leave it up to the simulated annealing algorithm to determine the optimal architecture. This has another advantage that by adding and removing nodes while doing the gradient descend, it is less likely to end up in a local minimum. On top, I bag this model 25 times and use voting to combine the bags. It may sound fairly technical, but the only thing to configure inside the neural net node is to switch on bagging; the rest is the default. The runtime on a a laptop is 13 seconds.
- A CHAID decision tree: recursive partitioning, 2+ nodes per split, Pearson Chi-square with Bonferroni correction as split criterion and automatic handling of continuous predictors. I kept the defaults of a tree that is 5 deep and has 2%/1% for parent and child nodes, respectively. On top, I bag this model 25 times and use voting to combine the bags. Again, sounds technical, but I only switched on bagging inside the node. Runtime: 13 seconds.
- A Support Vector Machine: I used an RBF (0.1) kernel with a regularization parameter of 10. There’s no automated bagging available for this model. Runtime 0.7 seconds.
For simplicity, I assume that the hyper parameters of the models are more or less fine. I could go into a tuning frenzy here, but the purpose is to demonstrate pipelining of models and blending/stacking.
In order to evaluate the models on new and unseen data, I generate another 10k cases from the population. Normally, this is data that you would set apart from your training data, and if you don’t have enough data, you need to make a choice of a percentage to set apart, or do n-fold cross-validation in order to maximize the use of your data. In this case, that is not needed. I copy the generated models and connect them to the validation data. Since the generated models contain the scoring code, this is all I have to do to score the models on new data.
The results show the Decision tree is by far the best model (Table 1). In terms of AUC, the other models failed, or are at best poor. The reason is clear: missing data. I could have (automatically) filled it (with an imputation model), but again, that’s not the purpose of this document. Interesting is to see that the performance of the ensemble is outperforming the individual models.
The second levels models
Now we scored the models on the unseen data, a natural question arises: can we use that data to build a second level model, that is, we take the predicted values of the three base models and use those as predictors. Note that this should never happen on the same data that the original models were trained on. If you would do so, you will heavily overfit the second level models, with a bad performance on unseen data as a result. In literature, there are stacking schemes which cross-fit partitions of data and combine that with a model fitted on the whole set. This also results in leakage, with overfitting as a result. Ensuring clean and reliable data is difficult enough. It would be my advice to keep the usage of the data as clean as possible during the modeling to prevent any leakage from happening.
I experimented with the stacking in 2 ways: using the predicted probabilities only and blending the first level (original) predictors in the model. The latter yielded much better results, and as such, these are the results reported. The results in Table 2 show that the Decision tree outperforms the other models. Even in the testing data, it is better than the ensemble model.
Table 3 shows that in comparison with the first level models, we have an substantial increase in performance. Apparently, by non-linear combination of the original predictors with the predicted probabilities of the 3 base models, a better model is the result.
The third level models
We can repeat this trick. Since we evaluated the level 2 models on a set of unseen data, the question arises, whether training a model using this data to combine the second level predictors, again, will increase the performance of the ensemble. Note that (in contrast with the first level models), the (simple) ensemble of the second level models did not increase the overall performance. One now wonders if some non-linear combination will do that. The resulting stream is shown at the start of the document. It shows how in the third line, the three models are trained again (on the second level predicted probabilities + the original predictors). The fourth line in the stream tests this model on unseen data. The results in Table 4 show that the performance of the Decision tree has increased once more. Note that no ensemble node has been included here, as this would make a fourth level model, and that is just the moment to stop (for no other reason than it would get too repetitive). The demonstration shows here that a third level model outperforms all models prior that that. Thus, the final model that will be used for scoring would consist of the first two layers having the 3 models each, and the final layer consisting of the Decision tree. That makes up (25+25+1) + (25+25+1)+25 = 127 models for the total stacking model.
Since the models were created using datasets of 10k, one wonders if this is not too small, i.e., is the performance consistent when scored on new (or larger) datasets. One argument is that it will be stable, since the number of variables is small. Table 5 shows the results on an (entirely) new set of data, now scored on 100k records. The conclusions remain the same, although the variation is in the order of the second decimal, hence the difference between the second level ensemble and the third level decision tree might not be large enough to choose for the more complex level 3 decision tree.
IBM SPSS Modeler works together with the automation and deployment tool IBM SPSS Collaboration and Deployment Services. This is a repository that runs on top of an application server such as WebSphere and allows easy model deployment. The modeling stream can be stored inside the repository (as simple as saving the stream after connecting to the repository). Once it resides inside the repository, the execution of the stream can be triggered a clock based schedule or a message based schedule (JMS/MQ) and can be tied into reporting, email and enterprise BPM steps. This is the batch mode of execution and would take data from a database (or other input), scores the model and stores the data back into a database (or other output).
It is more interesting to encapsulate the model in a web-service call to allow for real time execution. This is achieved without a single line of coding (though, not shown in here as screenshot). Once the web-service is in place, I use Soap-UI to call the web-service:
In SOAP-UI, it is possible to request a load test. On an 8-core virtual server, the average response time is 7ms per model request (or about 10k scores/minute). This is a very first start, since the web-service can be optimized further (for medium sized models, 20k scores/second have been observed). In addition, when creating a clustered setup of the real-time scoring engine, the scoring capacity goes up with every server added.
Through simulated data, it was shown how Modeler can be used for model blending and stacking. The main advantages are the coding free way of building advanced model stacks and the clear pipelining of results. Through deployment in Collaboration and Deployment Services, the stacked models are capable of being scored, batch and real-time.
Appendix: the data generation process
The data is generated (roughly) with the following steps:
- For the continuous variables, data is generated from uniform and normal distributions. The categorical variables are drawn from multinomial distributions. Then, the target was determined by computing the target using some 2, 3 and 4 level interaction terms as
- Then the target was scaled to be between 0 and 1 by exp(.)/(1+exp(.) and choosing a cutoff value to make the variable binary.
- The next step was to transform the continuous variables using sqrt, log and square transformations.
- Finally, missing data was generated under a Missing Completely At Random (MCAR) scheme.