Skip to main content

Build Classification Models

In classification problems, the machine learning models aim to predict if a given point of input data belongs to a category. If the column/feature you choose to predict is of a binary or categorical data type, then evoML will automatically detect this as a classification problem.

In this guide, we walk through the workflow required to create a classification task. It covers the entire workflow from data preparation, model configuration, training, evaluation, and deployment.

1. Prepare Dataset

  1. Upload: Start by uploading or importing your dataset.
  2. Explore: Selecting the uploaded dataset provides additional details on the dataset and allows users to understand the structure and quality of your dataset. Key steps include:
    • Examine Feature Types: evoML automatically detects the types of each feature (numeric, categorical, etc.). Review detected feature types to ensure proper handling.
    • Analyse Patterns: Investigate trends, seasonality, and patterns. Check tags generated by evoML.
    • Check Feature Distribution: analyse the distribution of features (including the target variable). To identify any skewed distribution or potential problems.
    • Correlation & Association: Compute correlation and association between features.

2. Create Trial

  1. Initiate Trial: Select New Trial to initiate a new model creation pipeline.
  2. Choose Dataset: Select the dataset you previously uploaded/imported.
  3. Specify Target Column: Choose the column to predict.

3. Task Configuration

  1. Auto-Detect ML Task: evoML automatically identifies the type of machine learning task (e.g., classification) based on the target column.

4. Multi-Objective Optimization

  1. Select Objective Function: Choose an appropriate function for the task. If needed, you can define a custom objective function to fine-tune performance metrics.
  2. Optimise Hyperparameters: evoML supports various optimizers to fine-tune model performance by adjusting hyperparameters.

5. Train/Test Split

  1. Define Split Ratio: evoML provides different splitting strategies for training and testing (e.g., 80/20 split). Make sure to choose the ratio that works best for your data.
  2. Stratified Splitting: If target column is imbalanced, use stratified splitting to maintain the same class distribution in both training and testing sets.

6. Model Validation Strategy

  1. Holdout Validation: This method splits the data into a training set and a separate testing set. It's straightforward and often used for quick evaluation.
  2. K-Fold Cross Validation: More reliable as it splits the data into multiple parts, ensuring each data point is used for training and testing, which reduces variance in performance evaluation.

For more details, refer to our guide on validation.

7. Handling Imbalanced Data

  1. Resampling Techniques: If you have an imbalanced dataset.
    • Oversampling: Increase the number of minority class samples using techniques like SMOTE.
    • Undersampling: Reduces the number of majority class samples to balance the dataset.

For full details on handling imbalanced data, refer to this guide.

8. Feature Engineering

  1. Preprocess: evoML supports automatic feature preprocessing like imputation, scaling, and encoding to prepare your data for modeling.
  2. Feature Selection: evoML can automatically select the most important features and remove irrelevant features. This improves model performance by reducing complexity and overfitting.
  3. Feature Generation: Consider using evoML's feature generation tool to create new features that may help uncover hidden patterns in your data.

9. Model Selection

  1. Select Models: evoML offers multiple model options. You can choose from traditional machine learning models (e.g., decision trees) to more advanced models (e.g., XGBoost).
  2. Tune Hyperparameters: You can manually adjust the model's hyperparameters, or evoML will tune them for you based on your defined range of values. Visit the model tuning guide for details.

10. Train the Model

Once all configurations are set, click Start to initiate model training:

  1. EvoML will automatically preprocess your data.
  2. The data will be split into training and test sets as per the defined strategy.
  3. Use the optimizer to tune hyperparameters and train the model on the training set.

11. Evaluate Model Performance

After training, evoML will evaluate models on the test set using the appropriate classification metrics.

Refer to the performance evaluation guide for more in-depth analysis and interpretation of these metrics.

12. Deploy Model

  1. Generate Deployment Pipeline: evoML can generate a deployment pipeline to help you integrate your model into production environments.
  2. Deploy to Your Preferred Environment: evoML supports easy deployment on different environments (e.g., local machine, cloud server)