What are the different stages of building a model ?

Some ~~weeks~~ months ago I was ingesting my usual load of news related to data science when I saw one of those “frequently asked questions” that proliferate nowadays with the boom of interest in analytics topics.

While I could honestly answer most of the questions, I found that some topics required a memory refresh, and in doing so, I thought it would help others refresh theirs, while leaving the door open to visitors’ contributions, in a sort of online notebook for quick reference.

Let’s start the FAQ with the first question : “What are the different stages of building a model ?”

In general, building a (predictive) model involves 3 phases :

1. Question

During this phase explore data in order to answer a specific question such as understanding an online customer behaviour. We are trying to have an insight on which particular data and how much of it we need to tell something meaningful about our subject. This specific question is very often vague and it’s during this phase that we get a feel of what we might be looking for and how we could answer it.

As the amount of data might be overwhelming, we need to identify what is relevant and where we must focus, this is called feature selection.

Once we have some data to focus on, we will be trying to identify some structures or patterns underlying them. This can ultimately lead us to one or more models. However, we will very likely need several Question-Model-Validate iterations to have a better idea of the question… and the potential answer.

2. Model

During this stage we model the phenomena being studied, which can be periodic and present a clear pattern; or be unusual and non-periodic.

We actually analyze data in several ways to build a model of our studied phenomena.

The different kinds of data analysis can be categorized as follows :

Descriptive analysis: describe the data and its main features.
Exploratory analysis: find connections in the data.
Inferential analysis: estimating quantities and the uncertainty about them. Depends upon the population being studied and the sample at hand (and the sampling technique!).
Predictive analysis: the usage of historical data to predict future events.
Causal analysis: explains the relationship between variables; e.g. what happens to one variable when you change another.
Mechanistic analysis: determine the exact change in a variable leading to a change in another vriable. This is basically a mathematical model, a set of deterministic equations.

The two main faimilies of modeling techniques may be grouped in 2 :

Mathematical modeling: differential/integral calculus, algebra.
Statistical modeling: statistics & machine learning, linear algebra, information theory.

In practice we often use a combination of both.

3. Validation

Model validation is where we measure how well our model performs. We can thus compare different models to select the best one.

We partition (split) the data into 3 sets :

Training data: used to train a particular model.
Test data: used to measure the performance of a particular model.
Validation data: used to select a model.

Validation is not only the actual selection of a model, it is also the validation of certains assumptions about our data. For example, after feature selection, we ask ourselves if we have enough data points to recognize a particular pattern.

Key concepts in model validation :

Loss Function: estimate how far predicted values are from actual ones.

Examples: Squared Error Loss, Absolute Error Loss, Lp Loss, Kullback-Leibler Loss, etc

Risk: expected value of the Loss Function.

Cost function: sum of squared residuals, which we want to minimize and converge to a model, while taking into account the bias-variance tradeoff.

The tradeoff is, in other words, between how well the model fits the data measured by the variance and how well it generalizes to new data (bias).

Let $x_1$ , …, $x_n$ be a set of data points and $y_i$ a value associated with $x_i$ .

Let also be a function $f(x)$ such that $y_i = f(x_i) + \epsilon$ . $\epsilon$ representing a noise term that has the property of being normally distributed and having zero mean and variance $\delta^2$ .

When we model, we are looking for a function $\hat{f}(x)$ that approximates $y$ as best as possible, thus minimizing

$\left(y - \hat{f}(x)\right)^2$ for any data point in our sample, but also for new data that is not in our sample.

It turns out that, for any function $\hat{f}(x)$ that we use through a supervised learning algorithm, the expected error of the function is

$E\left[\left(y - \hat{f}(x)\right)^2\right] = \sigma^2 + Bias\left[\hat{f}(x)\right]^2 + Var\left[\hat{f}(x)\right]$ (Mean Square Error Loss).

Where

$Bias\left[\hat{f}(x)\right] = E\left[\hat{f}(x) - f(x)\right]$ and,

$Var\left[\hat{f}(x)\right] = E\left[\left(\hat{f}(x) - E\left[\hat{f}(x)\right]\right)^2\right]$

As we see, the Mean Square Error Loss is decomposed in an irreducible error term (inherent to data), a bias (error introduced by the simplifying assumptions made when applying a particular learning method) and a variance (how much $\hat{f}(x)$ fluctuates around its mean), we come to what is known as the Bias-Variance tradeoff. A low-complexity model (low number of features) will yield a low variance and high bias, whereas a more complex model (higher number of features) will yield a higher variance and a lower bias.