Behavioral profiling: building individual preference models
Contents
Introduction
In many industries it is of interest to understand a customer’s individual preferences and traits in order to better serve the customer. Knowing the time that someone get hungry, may get you a better chance of selling a hamburger and using someone favorite color and brand on the front page of an e-commerce website increases turnover. For many of those preferences, it doesn’t work to ask the customer about this preference; they either don’t know or it depends on many other characteristics.
In this article I will demonstrate how to build an individual customer preference model using basic techniques as cluster analysis, PCA and (shallow) neural networks. The example revolves around seat heating in a car, but can be applied in many different contexts. The data is based on a real example, but is simulated in order to not disclose certain details (see Appendix for the data generation process). With simulated data, there’s the additional advantage to know the truth and here and there, this will be used to peek how well the models perform. In general, I find it useful to think of a data generating mechanisms as the reverse of modeling to try out certain modeling hypotheses.
Modern luxury cars have seat heating. A heated seat is something very comfortable in the colder climates. In the context of ‘making everything smarter’ one wonders if it is possible to learn from customers switching on the seat heating at certain temperatures and then use this extracted knowledge to automatically switch on the seat heating. The switching temperature is a personal preference: some like it hot others don’t, so a fixed, non-individual switch-on temperature would not be a good idea. Moreover, switching on the seat heating depends on external characteristics, in this case, mostly environment temperature and as such is not a static personal characteristic by itself, such as attention span or memory capacity. (Note that attention span and memory capacity are very interesting characteristics to infer, and they are –more or less- static). For the seat heating example, the simplest method could be: derive the switch point as the midpoint between the max observed temperature when the seat heating was switched on and the min temperature when the seat heating was not switched on. Yet, it may (or may not) also depend on more characteristics, for example, rather than temperature alone, one could think of inside vs outside temperature. Understanding how to combine the inside and outside temperature without a (machine learning) model is to making a guess about the decision boundary for an individual user rather than learning from data.
At this moment, pause for a second and think about the following points (for simplicity now, assume there’s a single driver per car):
- Since one tries to learn individual customer behavior, it seems logical to build one model per customer.
- Since one observes a customer using seat heating only after a car is in use, is seems necessary to put the learning algorithm itself in the car (complex, because iterative optimization), rather than only the scoring code (easy, because fixed ruleset or formula).
So, this means, one goes ahead and build a chip that contains some machine learning algorithm and apply some online learning algorithm. But, really is this needed? What does ‘learning from customer’s behavior’ really mean? What if the model itself contains a collection of decision boundaries and the scoring consists of classifying a customer’s past observations to its most optimal decision boundary? Then, no online learning is needed, and moreover, one would build a model across customers to learn the collection of decision boundaries, rather than to build one model per customer. That’s what we are heading to.
Let’s look at some data:
Figure 1: the data for the seat heating model
In Figure 1, a sample of the data is displayed. It shows how car 1 has 100 days of observations where the in- and outside temperature (in degree Celsius) of the car was stored, and then whether or not the owner switched on the seat heating (target). Since there are only two predictors in this case, the data is easy to visualize. In Figure 2, the following visualization are displayed from left to right: a random sample of the data, in which case the overlap between the red (“On”) and blue (“Off”) dots cannot really be seen. In the next figure, anomaly detection has been applied and only the ‘anomalous’ cases have been selected, in this case, the outer ring per state. As can be seen, the On and Off states heavily overlap. Yet, if the data is displayed per driver, it can be seen that the On and Off states are fairly separable and the decision boundary differ per car owner. It is of interest to build a model that estimates the decision boundary per car, yet, without the need to include the car id in the model. (That would make scoring infeasible)
Figure 2: visualizing the data
Model 1: a non-individual preference approach
The non-individual preference model takes the in- and outside temperature as predictors and predicts whether or not the seat heating will be switched on. The model used here is a simple neural network: the purpose of the article is to illustrate how to create an individual level model, rather than going into a model comparison scenario. Note that, in the classic regression context, the data are assumed to be independent observations. Clearly, this is not case, as a car owner is represented as multiple rows. As such, the data from one car owner is likely to be more alike than between car owners, and hence the non-independence. To correctly account for this, one could go into the direction of random effect models (or mixed models), however, that is not going to help to get to an individual preference model.
Table 1: model evaluation measure for the non-individual model
Partition |
AUC |
Training |
0.926 |
Testing |
0.920 |
Table 1 displays the AUC for training and testing which indicates that the model has an excellent fit, and is capable of generalizing to new data. The question is, however, what this number means in practice. The left panel in Figure 3 displays the percentage of correct ‘guesses’ of the model per car owner. As can be seen, most of the cases fall between 80% and 100%, which indicates that for most car owners, if they get into their car, 9 out of 10 times, the model guesses correctly whether or not the seat heating should be switched on. However, for some car owners, the model only guesses right in 50% of the cases. In the right panel, the generic decision boundary can be seen. If a probability cutoff value of 0.75 is used, the model seems to suggest the following rule (75% is the base rate in the data):
- Switch on the seat heating when either the inside or the outside temperature is below 10 degrees Celsius.
Figure 3: further understanding of the non-individual model
Model 2: going towards an individual model: clustering
In order to derive an individual model, the car id cannot be used. At the time that the model needs to be scored, the car id is a useless number. The observed temperatures, together with the switching behavior, somewhere hide the decision boundary that we are interested in. (Look at the last three panels of Figure 3: for each of them, you would be able to draw the decision boundary). Somehow, the data needs to be transformed to data that is at the level of one row per car to enable learning this overall behavior. Then the idea is the following: group or cluster car owners together with similar behavior and build a model per cluster (or take the cluster membership in as an additional predictor). Now the question is: how to transform the data from the rows to the columns? The following seems to be intuitive: create deciles from the temperatures (across subjects), then per car owner, compute the percentage of times that in a particular decile, seat heating was switched on (with the base of the percentage, the total number of times a temperature was observed in the decile at hand for that car owner). A quick look at Figure 4 shows that deciles (10 parts) is probably a fair number: with less categories, there would be less resolution, with more categories, many of them would have insufficient observations.
Figure 4: deciles per temperature
The resulting data after the transformation is displayed in Figure 5 (only the inside temperature; the outside temperature looks the same) and will be referred to as histogram predictors. For car 1, it shows when the inside temperature is low, the seat heating is always switched on, and when it becomes warmer, the seat heating is turned on less frequent. Note that since the transformation is taken per variable, we implicitly assume that the resulting decision boundaries are parallel to the temperature axes. Alternatively, one could create categories of the in- and outside temperatures together and bring those to the columns. That decision becomes more relevant when there are more observations per car owner available.
Figure 5: transformed data to derive the decision boundaries
The resulting 20 columns are input to a cluster analysis, which resulted in 8 clusters. Since the data is simulated (see Appendix for details), the true switch points are known (Normally, this is not known; to find those switch points is the whole purpose of the exercise). With the known switch points, we can peek if the cluster analysis did a good job, that is, if we display the true switching points, are the clusters concentrated and compact. As can be seen in the right panel of Figure 6, the clusters are fairly compact. Note that the clusters itself bear no labels to what the switching point should be; we are just peeking here to see if the data transformation resulted in something useful. Note that if this were a real problem, by thinking of the data generating mechanism, one could simulate data and still observe if the proposed data transformations make sense with the simulated data. As such, modeling and simulation go hand in hand.
Figure 6: result of the clustering
Figure 7 displays the data for modeling: now that per car_id, a cluster membership is available. This becomes an additional variable in the original modeling data. There are two options for modeling: either add the resulting cluster variable in the model as predictor, or build a model per cluster. Ideally, adding the cluster variable as predictor is the best thing to do: if the modeling technique is flexible enough (that is: capable of finding and fitting variable interactions), then the data will indicate to what extent the cluster variable will need to be incorporated. If the cluster variable is very important, the model will include the cluster variable as a full interaction with the other predictors, which mathematically is equivalent to building a separate model per cluster.
Figure 7: modeling data for the cluster seat heating model
The AUCs of the models that include the cluster variable are shown in Table 2. Compared to the non-individual model, the AUC has increased significantly. Also, the model per cluster is significantly stronger than the cluster as predictor model. As turns out, the one hidden layer neural network, although capable of automatically fitting interaction effects, does not have enough flexibility to incorporate full interaction between the clusters and the temperature variables, which is available in data, and hence the slight worse performance. Note that rather than one 8 cluster solution, a sequence of cluster models is run, ranging from 1 till 15 clusters and each cluster solution in turn is added to the seat heating model. A cluster model with 8 clusters lead to the best results on the hold out data.
Table 2: model evaluation measure for the 8 cluster seat heating model
Model |
Training AUC |
Testing AUC |
Cluster as predictor |
0.959 |
0.957 |
Model per cluster |
0.982 |
0.978 |
Figure 8 shows that with the new model, the percentage correct per car owner has increased as well. The AUC showed this, however, it always is a good idea to not rely on abstract measures of model fit, but to be able to translate the resulting model in a meaningful representation for the business.
Figure 8: percentage correct per car ID for unseen cars
Figure 9 shows observed temperatures for three individual cars along with the true state and their prediction. In addition, the decision boundary has been made visible. As can be seen, the three cars have different decision boundaries. Note, that there are only 8 decision boundaries, as there were 8 clusters.
Figure 9: cluster based decision boundaries
Model 3: a full individual model
The previous model used clusters to create decision boundaries for users with similar behavior. One thought might be: increase the number of clusters to make the model more individual, however, this would quickly lead the situation where there are too few cases per cluster, with a poor model as a result. A cluster model creates latent or unobserved groups in the data. Would it be possible to project the histogram predictors into a continuous space rather than in a categorical space? Instead of being member of a group, this would allow for a more fine grained control. Each set of the histogram predictors would have its individual point in some lower dimensional space, such that the histogram predictors that are similar are close together. (Compare this to the cluster technique: the histogram predictors that are similar belong to the same cluster.) There are many techniques for this, but one obvious one is PCA (or factor analysis). PCA and cluster analysis are mathematically closely related to each other (they can both be describes as low rank matrix factorization methods). Practically this can be seen by the fact that if one performs a cluster analysis and a factor analysis on the same set of input variables, and one plots the first two or three factors of the PCA in a scatterplot, colored by the points of the clusters, the clusters almost always are compact. Apart from enjoying this apparent ensemble, generally it can also be useful for understanding patterns in data. The scatterplot for the seat heating histogram data can be found in Figure 10. Indeed, the clusters are compact and not overlapping. Note once more that the cluster algorithm was run on the original histogram predictor data, not on the resulting factors. Moreover, in contrast to Figure 6, this plot is not peeking in the unknown data. Yet, with the knowledge that the unobserved truth is also two-dimensional, it is a straightforward choice to make the PCA have two factors, yet, this is not necessary. In fact, a series of neural networks that included as predictors the PCA 4, 3, 2 and 1-factor solution showed that the 2-factor solution was the best model (not shown in a table).
Figure 10: the factors from the PCA colors by the clusters from the cluster analysis
Table 3 shows that the full individual model shows the best performance so far. It should be noted that this model contains both the 2 PCA factors as well as still a separate model is fit for every cluster. Since the cluster analysis and the PCA contain so much the same information, one wonders why the neural network would not learn enough from the PCA factors alone. I attribute this to the fact that, similar to fitting a model per cluster rather than adding the cluster membership as predictor, the neural network is not able to fully incorporate the information. Adding the PCA factors per cluster, gives the neural network the ability to, within a cluster, apply a finer grained control to determine the exact shape of the individual decision boundary.
Table 3: Model evaluation of the individual model
Partition |
AUC |
Training |
0.985 |
Testing |
0.980 |
To confirm that intuition, a small experiment is conducted, again by peeking in the true switching values. Figure 11 shows various decision boundaries while carefully manipulating the true switching variables. The top panel shows a set of decision boundaries for cluster 5, while the middle and bottom panel show the decision boundaries for cluster 7. In the top and middle panel, the true inside temperatue is kept between 11 and 12 degree Celsius, while the graph shows the decision boundaries for various outside temperatures indicated as colors and per legend. The bottom graph keeps the outside temperature between 11 and 12 degrees Celsius and varies the inside temperature. As can be seen, truly the each combination of in- and outside temperature corresponds to a slightly different decision boundary and hence, this is a fully individual model.
Figure 11: individual decision boundaries
The PCA factors cannot be directly interpreted as the true switching temperatures, as the scale is different and the factors are linear transformation of the true switching temperatures (not shown, but confirmed by graphing). Yet, more or less, the full information of the true switching temperatures is in there, as can be seen from the high performance of the model.
One may wonder: why not build a model with all the original histogram predictors in there? Although the full information is available there, the neural network is a shallow one (only one hidden layer) and cannot capture the complexity of the required transformation. Basically, this is asking the neural network to perform a PCA or similar. Potentially a neural network with more layers is capable of doing this, yet, helping the model to find the right relation by preparing the data with a cluster analysis and a PCA certainly works as well. Table 4 shows the model performance where the full set of 20 predictors is used, in addition to the current in-and outside temperature. Clearly, this model is not a winner.
Table 4: model performance with the full histogram predictor set
Partition |
AUC |
Training |
0.830 |
Testing |
0.830 |
The scoring procedure comes down to a set of simple formulas: first, every new observation is classified in the corresponding decile and updates the percentage in there. Next, the factor analysis takes this as input and produces two factor scores and similar, the cluster model produces a cluster score. Finally, the current in and outside temperature, together with the cluster and factor scores feed into the neural network to produce a probability, which, with a chosen cutoff, can be used to decide whether or not to turn the seat heating. Thus, there’s no online learning, and there’s no use for one model per car. In fact, a small demonstration can show that one model per car is likely not a good idea.
The non-smooth curves in Figure 12 show the observed percentages per decile for one car. The dark blue line shows a dip in decile 6, as there are no observations in that interval. The dark red line goes up towards decile 10, maybe due to cold inside temperatures that were observed while the outside temperature was in decile 10. This is the quirkiness of real data. The PCA looks at the correlations between the deciles and transforms the data to a lower dimensional space. This can be seen as a smoothing operation: the random variations will be captured in the later factors rather than in the first two. Since those factors are discarded, the result is a smoothed curve. Indeed, this is the case: for the given car, the two-dimensional point can be transformed back to the original decile data, and the lighter blue and orange curve show the smooth curve. Also, note that, very naturally, the higher the decile, the lower the probability that the seat heating is switched on. As such, it is concluded that placing the data in a lower dimensional space, smoothens the volatility out of the data, and the availability of many cars together in the analysis helps this fact, as the PCA ‘learns’ from all the cars together.
Figure 12: Observed and estimated temperature curves
Concluding remarks
The purpose of the article was to show how to derive individual preferences or traits from data in order to better serve the customer. This was illustrated using an application to make the life of a customer easier by learning from his behavior and applying the learning automatically. The initial idea might have been to build a model per customer and use an online learning method to update the model when data becomes available, however an approach was presented where 1) modeling was done on the whole set of customers so that individual predictions improve by learning from neighboring data and 2) new data arriving changes the predictions as part of the scoring procedure due to the fact that the model contained all the decision boundaries after learning was done.
The logic can be applied in a wider range of cases, for example, in pre-paid churn in a telecom operator, one observes usages and charging behavior along with a customer’s decision to leave or stay. In a similar fashion as discussed, there may a decision boundary, personal to every customer, which, once crossed, the customer decides to leave. Another example maybe found in the retail industry, where browsing behavior in combination with attractiveness and pricing point make up the decision for a customer to buy. Understanding this unobserved preference can greatly increase turnover.
Apart from demonstrating the method, I hope to have entertained the reader, particular on the following points:
- Hypothesizing the data generating mechanism of certain processes allows to simulate data and to see if analytical methods pick up the desired effects.
- A combination of simple techniques can still give powerful models.
- Rather than focusing on abstract model evaluation measures, it is always a good idea to get deeper insight in the models and methods by visualizing the inner working of the models and by creating model evaluation measures that explain in business fashion how well the model works.
Appendix
Technical model details
Three models were used in the creation of the method
- A neural network: a multilayer perceptron with a tanh activation function for the hidden layer and a softmax for the last layer and a cross-entropy loss function. The number of hidden nodes can be set, but in this case, I left it up to the simulated annealing algorithm to determine the optimal architecture. This has another advantage that by adding nodes while doing the gradient descend, it is less likely to end up in a local minimum.
- PCA/Factor analysis: extraction was done on the correlation matrix; varimax rotation was applied after the number of resulting factors was chosen. The number of factors (although, known to be two), was varied and chosen to maximize the performance of the seat heating model.
- Cluster analysis was performed using the TwoStep cluster method. This is a scalable cluster analysis algorithm designed to handle very large data sets. It requires only one data pass. It has two steps 1) pre-cluster the cases (or records) into many small sub-clusters; 2) cluster the sub-clusters resulting from pre-cluster step into the desired number of clusters. It can also automatically select the number of clusters, however in the models above, the number of clusters was chosen such that it maximized the predictive performance of the seat heating model on unseen data. The pre-cluster step uses a sequential clustering approach. It scans the data records one by one and decides if the current record should be merged with the previously formed clusters or starts a new cluster based on the log-likelihood distance criterion. The second step, the cluster step takes sub-clusters (non-outlier sub-clusters if outlier handling is used) resulting from the pre-cluster step as input and then groups them into the desired number of clusters. TwoStep uses an agglomerative hierarchical clustering method. Since the number of sub-clusters is much less than the number of original records, traditional clustering methods can be used effectively.
The data generating mechanism
The training set is based on a sample of 1,000 cars. For each car, two true switching temperatures are sampled. The outside temperature is sampled from a Normal (5,5) and the inside temperature is sampled from a Normal (10,5). Then, for each car, 100 days of observations of external temperatures are sampled from a Normal (5,10) and a Normal (10,10) for the in- and outside temperature, respectively. Finally, a small noise term was sampled from Normal (0,1). In order to determine whether or not the seat heating was switched on, the following logic was applied:
- Target= (temp_in+noise)<temp_in_True or (temp_out+noise)<temp_out_True,
The _True temperatures were not used in the analysis, unless where mentioned that we peeked. In fact, given the observed temperatures, and the switching behavior of the customer, the purpose of the analysis was to uncover the true switching temperatures, along with the choice logic. Note that, for the observed temperatures, no correlation has been modeled between in- and outside temperature. This is unrealistic, however, any positive correlation would make the analysis easier: imagine a correlation of 1, then the challenge at hand would reduce to a one dimensional one.
The tooling
IBM SPSS Modeler (or short: Modeler) is an orchestration based data mining workbench with ETL capabilities. The tool provides access to advanced models as Random Forest, Support Vector Machines and Neural Nets without a single line of coding, and without losing [a great deal of] flexibility or insight in the models.
One of the features of the tool is the ability to build complex pipelines while keeping it entirely transparent what is being done. This is mainly archived by the orchestration based interface and the ability of document each step of the process in the tool itself.
Figure 13: IBM SPSS Modeler
The top part represents the data generation and the non-individual model. The two lines of nodes indicate the training and testing partition. The bottom part represents the data generation followed by the individual models. Again, the two lines of nodes indicate the training partition and the partition that generated new and unseen data to validate the model.