Skip to main content

NLP Sequence Classification/Regression Model

This guide provides a comprehensive walkthrough for creating NLP models using EvoML. It covers the entire workflow, including data preparation, model configuration, training, evaluation, and deployment.

This section focuses on NLP Sequence Classifiers and NLP Sequence Regressors, both of which are transformer-based models. Users can tune model hyperparameters, upload models to the Model Hub, and use them within EvoML.

It's important to distinguish between:

  • NLP Sequence Classification/Regression: Uses transformer-based models on raw text data.
  • Standard Classification/Regression with text features: Uses embeddings of text data alongside other features in traditional ML models.

Key Considerations:

  • Sequence Models can only accept a single text column; all other columns must be dropped.
  • For NLP Sequence Classification and Regression tasks, only the specified text column is used, and a transformer-based model is applied directly to this text data.
  • For standard classification and regression tasks (non-sequence based):
    • EvoML generates text embeddings for string columns using predefined encoders.
    • These embeddings can be used alongside other features in typical classification and regression models (e.g., Random Forest, Gradient Boosting).
    • This approach allows for the incorporation of text data in traditional machine learning tasks without using specialized NLP sequence models.

1. Prepare Dataset

Upload & Explore Data
  1. Upload: Start by uploading or importing your dataset.
  2. Explore: Selecting the uploaded dataset provides additional details on the dataset and allows users to understand the structure and quality of your dataset. Key steps include:
    • Examine Feature Types: evoML automatically detects the types of each feature (numeric, categorical, etc.). Review detected feature types to ensure proper handling.
    • Analyse Patterns: Investigate trends, seasonality, and patterns. Check tags generated by evoML.
    • Check Feature Distribution: analyse the distribution of features (including the target variable). To identify any skewed distribution or potential problems.
    • Correlation & Association: Compute correlation and association between features.

2. Create Trial

  1. Initiate Trial: Select New Trial to initiate a new model creation pipeline.
  2. Choose Dataset: Select the dataset you previously uploaded/imported.
  3. Specify Target Column: Choose the column to predict.

Notice the warning message in the screenshot below. If EvoML identifies an NLP problem, it is recommended to:

  • Disable feature selection and feature generation under feature engineering.
  • Select only the NLP Sequence Classification model from the models tab.

nlp model warning

3. Task Configuration

  1. Auto-Detect ML Task: evoML automatically identifies the type of machine learning task (e.g., classification) based on the target column.

4. Multi-Objective Optimization

  1. Select Objective Function: Choose an appropriate function for the task. If needed, you can define a custom objective function to fine-tune performance metrics.
  2. Optimise Hyperparameters: evoML supports various optimizers to fine-tune model performance by adjusting hyperparameters.

5. Train/Test Split:

  1. Define Split Ratio: evoML provides different splitting strategies for training and testing (e.g., 80/20 split). Make sure to choose the ratio that works best for your data.

6. Model Validation Strategy

  1. Holdout Validation: This method splits the data into a training set and a separate testing set. It's straightforward and often used for quick evaluation.
  2. K-Fold Cross Validation: More reliable as it splits the data into multiple parts, ensuring each data point is used for training and testing, which reduces variance in performance evaluation.

For more details, refer to our guide on validation.

7. Model Selection

Custom Models
  • Users can upload custom models to the Model Hub.
Supported Models

The platform supports two text-based classification and regression models:

Predefined Transformers:
  • Default model: bert-base-uncased is the default pre-trained transformer for NLP tasks.
  • Fine-Tuning: Users can manually adjust the model's hyperparameters, or evoML will tune them for you based on your defined range of values. Visit the model tuning guide for details.

8. Train the Model

Once all configurations are set, click Start to initiate model training:

  1. EvoML will automatically preprocess your data.
  2. The data will be split into training and test sets as per the defined strategy.
  3. Use the optimizer to tune hyperparameters and train the model on the training set.

9. Evaluate Performance

After training, evaluate the model using relevant metrics classification metrics or regression metrics.

For more details, see the performance evaluation guide.

10. Deploy Model

  1. Generate Deployment Pipeline: evoML can generate a deployment pipeline to help you integrate your model into production environments.
  2. Deploy to Your Preferred Environment: evoML supports easy deployment on different environments (e.g., local machine, cloud server)