Monday 12 February 2024

The modeling process - Machine Learning

 The modeling phase consists of four steps:

  1. Feature engineering and model selection
  2. Training the model
  3. Model validation and selection
  4. Applying the trained model to unseen data

1.Feature engineering and model selection

  • Engineering features are variables obtained from a data set.
  • In practice, features are often found independently, often scattered among different data sets.
  • Models often need to be transformed or combined to achieve predictions.
  • Interaction variables, such as vinegar and bleach, can have significant impacts when mixed.
  • Modeling techniques can be used to derive features, often in text mining.
  • One common mistake in model construction is availability bias, where features are only those easily available.
  • Models with availability bias often fail when validated, as they represent a one-sided truth.
  • An example of this is the case of plane fortification in WWII, where engineers ignored an important part of the data due to availability bias.
  • Once initial features are created, a model can be trained to the data.

 2.Training the model--Model Training and Validation Process

  • Using appropriate predictors and modeling techniques.
  • Presenting model data for learning.
  • Common modeling techniques available in almost every programming language, including Python.
  • Heavy mathematical calculations for advanced data science techniques.
  • Testing model's extrapolation to reality through model validation. 

3.Model validation and selection

  • Models in data science have predictive power and generalizability to unexplored data.
  • Error measures and validation strategies are crucial for model quality.
  • Classification error rate and mean squared error are common error measures in machine learning.
  • Classification error rate is the percentage of test data that the model mislabeled.
  • Mean squared error measures the average error of prediction.
  • Squaring the average error can cancel out wrong predictions in one direction.
  • Bigger errors gain more weight, while small errors remain small or shrink. 
  • Many validation strategies exist, including the following common ones:


■ Dividing your data into a training set with X% of the observations and keeping the rest as a holdout data set (a data set that’s never used for model creation)—This is the most common technique.


■ K-folds cross validation—This strategy divides the data set into k parts and uses each part one time as a test data set while using the others as a training data set. This has the advantage that you use all the data available in the data set.


■ Leave-1 out—This approach is the same as k-folds but with k=1. You always leave one observation out and train on the rest of the data. This is used only on small data sets, so it’s more valuable to people evaluating laboratory experiments than to big data analysts

 Machine Learning Regularization and Validation

  • Regularization in machine learning involves penalizing extra variables used in model construction.
  • L1 regularization aims for a model with as few predictors as possible for robustness.
  • L2 regularization aims to keep variance between predictor coefficients as small as possible to increase interpretability.
  • Regularization is used to prevent over-fitting by limiting the number of features used.
  • Validation is crucial as it determines the model's effectiveness in real-life conditions.
  • Models should be tested on data the model has never seen and should represent what it would encounter when applied to fresh observations.
  • Instruments like the confusion matrix are beneficial for classification models.
  • Once a model is constructed, it can be used to predict the future.

 4.Applying the trained model to unseen data

  • Successful implementation of the first three steps results in a model that generalizes to unseen data.
  • Model scoring is the process of applying the model to new data.
  • It involves preparing a new data set with features as defined by the model.
  • The model is then applied to this new data set, resulting in a prediction.

Where it (Machine Learning) is used in data science

 Although machine learning is mainly linked to the data-modeling step of the data science process, it can be used at almost every step. the data science process is shown below

The data modeling phase can’t start until you have qualitative raw data you can understand. But prior to that, the data preparation phase can benefit from the use of machine learning. An example would be cleansing a list of text strings; machine learning can group similar strings together so it becomes easier to correct spelling errors.

Machine learning is also useful when exploring data. Algorithms can root out underlying patterns in the data where they’d be difficult to find with only charts.

Given that machine learning is useful throughout the data science process, it shouldn’t come as a surprise that a considerable number of Python libraries were developed to make your life a bit easier.

Applications of machine learning in data science

 Regression and classification are of primary importance to a data scientist. To achieve these goals, one of the main tools a data scientist uses is machine learning. The uses for regression and automatic classification are wide ranging, such as the following:

  • Finding oil fields, gold mines, or archeological sites based on existing sites (classification and regression)
  • Finding place names or persons in text (classification)
  • Identifying people based on pictures or voice recordings (classification)
  • Recognizing birds based on their whistle (classification)
  • Identifying profitable customers (regression and classification)
  • Proactively identifying car parts that are likely to fail (regression)
  • Identifying tumors and diseases (classification)
  • Predicting the amount of money a person will spend on product X (regression)
  • Predicting the number of eruptions of a volcano in a period (regression)
  • Predicting your company’s yearly revenue (regression)
  • Predicting which team will win the Champions League in soccer (classification)

Occasionally data scientists build a model (an abstraction of reality) that provides insight to the underlying processes of a phenomenon. When the goal of a model isn’t prediction but interpretation, it’s called root cause analysis. Here are a few examples:

  • Understanding and optimizing a business process, such as determining which products add value to a product line
  • Discovering what causes diabetes
  • Determining the causes of traffic jams


This list of machine learning applications can only be seen as an appetizer because it’s ubiquitous within data science. Regression and classification are two important techniques, but the repertoire and the applications don’t end, with clustering as one other example of a valuable technique.

Advertisement

Follow US

Join 12,000+ People Following

Notifications

More

Results

More

Java Tutorial

More

Digital Logic design Tutorial

More

syllabus

More

ANU Materials

More

Advertisement

Top