The modeling phase consists of four steps:
- Feature engineering and model selection
- Training the model
- Model validation and selection
- Applying the trained model to unseen data
1.Feature engineering and model selection
- Engineering features are variables obtained from a data set.
- In practice, features are often found independently, often scattered among different data sets.
- Models often need to be transformed or combined to achieve predictions.
- Interaction variables, such as vinegar and bleach, can have significant impacts when mixed.
- Modeling techniques can be used to derive features, often in text mining.
- One common mistake in model construction is availability bias, where features are only those easily available.
- Models with availability bias often fail when validated, as they represent a one-sided truth.
- An example of this is the case of plane fortification in WWII, where engineers ignored an important part of the data due to availability bias.
- Once initial features are created, a model can be trained to the data.
2.Training the model--Model Training and Validation Process
- Using appropriate predictors and modeling techniques.
- Presenting model data for learning.
- Common modeling techniques available in almost every programming language, including Python.
- Heavy mathematical calculations for advanced data science techniques.
- Testing model's extrapolation to reality through model validation.
3.Model validation and selection
- Models in data science have predictive power and generalizability to unexplored data.
- Error measures and validation strategies are crucial for model quality.
- Classification error rate and mean squared error are common error measures in machine learning.
- Classification error rate is the percentage of test data that the model mislabeled.
- Mean squared error measures the average error of prediction.
- Squaring the average error can cancel out wrong predictions in one direction.
- Bigger errors gain more weight, while small errors remain small or shrink.
- Many validation strategies exist, including the following common ones:
■ Dividing your data into a training set with X% of the observations and keeping the rest as a holdout data set (a data set that’s never used for model creation)—This is the most common technique.
■ K-folds cross validation—This strategy divides the data set into k parts and uses each part one time as a test data set while using the others as a training data set. This has the advantage that you use all the data available in the data set.
■ Leave-1 out—This approach is the same as k-folds but with k=1. You always leave one observation out and train on the rest of the data. This is used only on small data sets, so it’s more valuable to people evaluating laboratory experiments than to big data analysts
Machine Learning Regularization and Validation
- Regularization in machine learning involves penalizing extra variables used in model construction.
- L1 regularization aims for a model with as few predictors as possible for robustness.
- L2 regularization aims to keep variance between predictor coefficients as small as possible to increase interpretability.
- Regularization is used to prevent over-fitting by limiting the number of features used.
- Validation is crucial as it determines the model's effectiveness in real-life conditions.
- Models should be tested on data the model has never seen and should represent what it would encounter when applied to fresh observations.
- Instruments like the confusion matrix are beneficial for classification models.
- Once a model is constructed, it can be used to predict the future.
4.Applying the trained model to unseen data
- Successful implementation of the first three steps results in a model that generalizes to unseen data.
- Model scoring is the process of applying the model to new data.
- It involves preparing a new data set with features as defined by the model.
- The model is then applied to this new data set, resulting in a prediction.