Encoding, scalers and imputation
EvoML provides a comprehensive suite of preprocessing techniques for various types of features. This documentation outlines the available encoders, scalers, and imputers for different feature types, along with customization options and best practices.
1. Categorical Features
Imputers
- Constant: Fills missing values with a specified constant value.
- Most Frequent: Replaces missing values with the most frequent value in the column.
Encoders
- One-Hot Encoder: Creates binary columns for each category.
- Label Encoder: Assigns a unique integer to each category.
- Hash Encoder: Uses a hashing function to encode high-cardinality categorical variables.
- Helmert Encoder: Compares each level with the mean of subsequent levels.
- Target Encoder: Replaces categorical values with the mean target value for each category.
- CatBoost Encoder: Uses a combination of target encoding and random noise to prevent overfitting.
- Backward Difference Encoder: Compares the mean of the dependent variable for a level to the mean of the dependent variable for the prior level.
- Ordinal Encoder: Assigns an integer to each category based on the order specified.
2. Numerical Features
Imputers
- Constant: Fills missing values with a specified constant.
- Most Frequent: Replaces missing values with the most frequent value.
- Mean: Fills missing values with the mean of the column.
- Median: Fills missing values with the median of the column.
Encoders (Float)
- Log Encoder: Applies logarithmic transformation.
- Power Encoder: Applies power transformation.
- Square Encoder: Squares the values.
- Quantile Transform Encoder: Transforms features to follow a uniform or normal distribution.
- Reciprocal Encoder: Applies reciprocal (1/x) transformation.
Encoders (Integer, Currency, Percentage, Unit Number)
- Log Encoder: Applies logarithmic transformation.
- Power Encoder: Applies power transformation.
Scalers
- Min-Max Scaler: Scales features to a fixed range, usually between 0 and 1.
- Standard Scaler: Standardizes features by removing the mean and scaling to unit variance.
- Max Abs Scaler: Scales each feature by its maximum absolute value.
- Robust Scaler: Scales features using statistics that are robust to outliers.
- Gauss Rank Scaler: Transforms features to follow a Gaussian distribution.
3. Binary Features
Encoders
- Label Encoder: Assigns 0 and 1 to the two categories.
Text Features
EvoML supports various text encoding techniques for natural language processing tasks:
- TF-IDF (Term Frequency-Inverse Document Frequency): A statistical measure that evaluates word importance in a document within a collection.
- Sentence Transformers:
- sentence-transformers/all-mpnet-base-v2
- sentence-transformers/all-distilroberta-v1
- sentence-transformers/all-minilm-l6-v2
- sentence-transformers/all-minilm-l12-v2 These models are optimized for semantic similarity tasks and sentence embeddings.
- BERT-based models:
- bert-base-uncased: The original BERT model with uncased text
- distilbert-base-uncased: A lighter, faster version of BERT
- albert-base-v2: A lite BERT for self-supervised learning of language representations
- XLNet-base-cased: An autoregressive pretraining method for language understanding
- google/electra-base-discriminator: ELECTRA model trained as a discriminator
- microsoft/codebert-base: A BERT-style model pre-trained on source code
4. Customization Options
EvoML allows users to:
- Include or exclude specific features
- Choose imputation strategies for each feature
- Select encoders or scalers for each feature
This level of customization enables users to tailor the preprocessing pipeline to their specific dataset and machine learning task requirements.
5. Best Practices
Categorical Features
- Use One-Hot Encoding for low-cardinality features
- Consider Target Encoding or CatBoost Encoding for high-cardinality features
- Use Ordinal Encoding when there's a clear order in categories
Numerical Features
- Use Standard Scaler when the distribution is close to normal
- Use Min-Max Scaler when you need values in a specific range
- Consider Robust Scaler when dealing with outliers
Imputation
- Use Mean or Median imputation for numerical features when missing values are random
- Use Most Frequent imputation for categorical features
- Consider advanced imputation techniques for complex datasets
Text Features
- Use TF-IDF for simple text classification tasks
- Leverage pre-trained models like BERT or Sentence Transformers for more complex NLP tasks