Kaggle
Data-cleaning
- Outliners
- Remove id etc. non-sensical columns
- Log-transform to remove skewing in target
Feature engineering
- Percent of missing data by feature
- Correlation heatmap
- Transforming numerical variables that are really categorical (year, class etc.)
- Label encoding
- Summarize multiple related features into a new feature
- Box Cox transformation of (highly) skewed features
pd.get_dummies
is basically one-hot encoding
Regressions:
- Lasso is sensitive to outliners, so it might be useful to add a RobustScaler before it in a pipeline
- Simplest stacking is averaging
- Meta-model stacking: add a metal-model on averaged base models and use out-of-folds predications of these base models to train meta-model
- Ensembling stacking