Encoding, scalers and imputation

EvoML provides a comprehensive suite of preprocessing techniques for various types of features. This documentation outlines the available encoders, scalers, and imputers for different feature types, along with customization options and best practices.

1. Categorical Features

Imputers

Constant: Fills missing values with a specified constant value.
Most Frequent: Replaces missing values with the most frequent value in the column.

Encoders

One-Hot Encoder: Creates binary columns for each category.
Label Encoder: Assigns a unique integer to each category.
Hash Encoder: Uses a hashing function to encode high-cardinality categorical variables.
Helmert Encoder: Compares each level with the mean of subsequent levels.
Target Encoder: Replaces categorical values with the mean target value for each category.
CatBoost Encoder: Uses a combination of target encoding and random noise to prevent overfitting.
Backward Difference Encoder: Compares the mean of the dependent variable for a level to the mean of the dependent variable for the prior level.
Ordinal Encoder: Assigns an integer to each category based on the order specified.

2. Numerical Features

Imputers

Constant: Fills missing values with a specified constant.
Most Frequent: Replaces missing values with the most frequent value.
Mean: Fills missing values with the mean of the column.
Median: Fills missing values with the median of the column.

Encoders (Float)

Log Encoder: Applies logarithmic transformation.
Power Encoder: Applies power transformation.
Square Encoder: Squares the values.
Quantile Transform Encoder: Transforms features to follow a uniform or normal distribution.
Reciprocal Encoder: Applies reciprocal (1/x) transformation.

Encoders (Integer, Currency, Percentage, Unit Number)

Log Encoder: Applies logarithmic transformation.
Power Encoder: Applies power transformation.

Scalers

Min-Max Scaler: Scales features to a fixed range, usually between 0 and 1.
Standard Scaler: Standardizes features by removing the mean and scaling to unit variance.
Max Abs Scaler: Scales each feature by its maximum absolute value.
Robust Scaler: Scales features using statistics that are robust to outliers.
Gauss Rank Scaler: Transforms features to follow a Gaussian distribution.

3. Binary Features

Encoders

Label Encoder: Assigns 0 and 1 to the two categories.

Text Features

EvoML supports various text encoding techniques for natural language processing tasks:

TF-IDF (Term Frequency-Inverse Document Frequency): A statistical measure that evaluates word importance in a document within a collection.
Sentence Transformers:
- sentence-transformers/all-mpnet-base-v2
- sentence-transformers/all-distilroberta-v1
- sentence-transformers/all-minilm-l6-v2
- sentence-transformers/all-minilm-l12-v2 These models are optimized for semantic similarity tasks and sentence embeddings.
BERT-based models:

bert-base-uncased: The original BERT model with uncased text
distilbert-base-uncased: A lighter, faster version of BERT
albert-base-v2: A lite BERT for self-supervised learning of language representations

XLNet-base-cased: An autoregressive pretraining method for language understanding
google/electra-base-discriminator: ELECTRA model trained as a discriminator
microsoft/codebert-base: A BERT-style model pre-trained on source code

4. Customization Options

EvoML allows users to:

Include or exclude specific features
Choose imputation strategies for each feature
Select encoders or scalers for each feature

This level of customization enables users to tailor the preprocessing pipeline to their specific dataset and machine learning task requirements.

5. Best Practices

Categorical Features

Use One-Hot Encoding for low-cardinality features
Consider Target Encoding or CatBoost Encoding for high-cardinality features
Use Ordinal Encoding when there's a clear order in categories

Numerical Features

Use Standard Scaler when the distribution is close to normal
Use Min-Max Scaler when you need values in a specific range
Consider Robust Scaler when dealing with outliers

Imputation

Use Mean or Median imputation for numerical features when missing values are random
Use Most Frequent imputation for categorical features
Consider advanced imputation techniques for complex datasets

Text Features

Use TF-IDF for simple text classification tasks
Leverage pre-trained models like BERT or Sentence Transformers for more complex NLP tasks

1. Categorical Features​

Imputers​

Encoders​

2. Numerical Features​

Imputers​

Encoders (Float)​

Encoders (Integer, Currency, Percentage, Unit Number)​

Scalers​

3. Binary Features​

Encoders​

Text Features​

4. Customization Options​

5. Best Practices​

Categorical Features​

Numerical Features​

Imputation​

Text Features​

1. Categorical Features

Imputers

Encoders

2. Numerical Features

Imputers

Encoders (Float)

Encoders (Integer, Currency, Percentage, Unit Number)

Scalers

3. Binary Features

Encoders

Text Features

4. Customization Options

5. Best Practices

Categorical Features

Numerical Features

Imputation

Text Features