Technology Talks | 19 May 2022

Applied Artificial intelligence (AI) in sales and marketing – Part 2 of 3

In the previous article, we discussed the TVS Motor way of data science for identyfing business problems that can be solved using AI/ML. We also explained how a lead classification model can be used to drive follow-up prioritisation and personalisation of marketing communications. In this article, we will talk about exploratory data analysis, feature engineering & model development and selection process.

Exploratory data analysis and feature engineering

We used open-source automated tools such as pandas-profiling , data-prep and  sweetviz to study the data quality and data distribution for all features. At this stage, we were looking for any gaps in the available data to come up with right imputation strategy and bi-variate analysis to understand the linkage between underlying features and the event rate. Since each tool has its own advantages and limitations, we used all of them depending on our requirement. The relative capabilities of these tools are shown in the chart below.

Considerations

Pandas Profiling

Sweetbiz

DataPrep

Number of data frames that can be visualised together
(useful to see data distribution across training, validation sets)

One

Two

Multiple

Backend graphing library

Matplotlib
(static)

Matplotlib
(static)

Bokeh
(interactive)

Auto EDA report generation

Slow

Fast

Fast

Univariate analysis

Yes

Yes

Yes

Bi-variate analysis

No

Yes

Yes

Open source

Yes

Yes

Yes

Some of the tasks we performed to get the final dataframe for model training are:

  1. Imputation strategy for missing data
    1. Handled missing data for categorical features
    2. For numerical features imputed missing values with the median
    3. Filled in missing data using predictive models
  2. Unbalanced sample – As the target variable was highly imbalanced we employed up-sampling, down-sampling and class weights techniques.
  3. Outlier treatment – From the model training experiments we observed that tree-based models are able to generalise the target much better than any other types of models. Since all the tree-based algorithms are robust enough to handle outliers by default there was no need for us perform outlier treatment explicitly.
  4. Variable transformation – Applied cyclic transformation to all the datetime features.
  5. Derived Features – Derived features such as lead’s proximity from dealership location, purchase propensity at hyperlocal geographies and other seasonal features such as season.
  6. Feature Encoding – Applied various types of categorical encoding techniques like Target Encoding, Catboost Encoding and Weight Evidence for categorical features based on the cardinality.

We used ELI5 framework to understand the relative importance of variables, including derived features, to ensure better model explainability.  Some of the important features of the training dataset are listed below in the order of importance.

Model Development and Evaluation

We built a linear model as a baseline and expanded to kernal based models and ensembles. We observed that gradient boosting algorithms such as XGBoostCatBoost LightGBM were generalising target better than others, specifically XGBoost being the best with a good f2 score.

We used the F2 score as an evaluation metric to assess the performance of the models because it balances recall and precision with more weightage to recall. Additionally, we used KS statistic to understand the rank ordering capability of the model.

We had divided the our data into train, test and out of time (OOT) datasets. We compared KS scores across these data sets to assess the model stability (as shown in the chart below) to finalize a production ready model.

Additionally, to ease and automate the model training, hyper-parameter tuning and selection, we used the fast library for AutoML (FLAML) with sklearn pipelines to parallelly train and automatically find best performing machine learning models.

Stay tuned to know about our deployment strategy and different pipelines we set up to monitor performance of the above solution.

Name is required field Name should be Alphabet only Name can not be morethan 30 characters!
Email is required field. Email is required field. Email can not be morethan 50 characters!
Comment is required field. Comment can not be morethan 300 characters!

Comments (1)

ZAP

23 Jun 2022