Kaggle

Data-cleaning

  1. Outliners
  2. Remove id etc. non-sensical columns
  3. Log-transform to remove skewing in target

Feature engineering

  1. Percent of missing data by feature
  2. Correlation heatmap
  3. Transforming numerical variables that are really categorical (year, class etc.)
  4. Label encoding
  5. Summarize multiple related features into a new feature
  6. Box Cox transformation of (highly) skewed features
  7. pd.get_dummies is basically one-hot encoding

Regressions:

  • Lasso is sensitive to outliners, so it might be useful to add a RobustScaler before it in a pipeline
  • Simplest stacking is averaging
  • Meta-model stacking: add a metal-model on averaged base models and use out-of-folds predications of these base models to train meta-model
  • Ensembling stacking