Optimizing Customer Advertisements in the Travel Industry with AI
Hassan Soliman
11/22/20224 min read
Introduction
Artificial Intelligence (AI) has revolutionized numerous industries, and the travel sector is no exception. With the increasing reliance on AI-powered tools, travel companies are enhancing processes ranging from travel planning to marketing optimization. This blog post delves into how AI can be leveraged to optimize customer advertisements for a travel provider, specifically focusing on predicting customers with a high tendency to book again—referred to as "repeaters."
Understanding which customers are likely to rebook enables companies to tailor marketing campaigns more effectively, optimizing advertising spend by targeting the most promising customers. This approach not only enhances customer engagement but also maximizes return on investment for marketing efforts.
Data Preparation
Data Overview
The analysis is based on three primary datasets:
Customers Dataset: Contains 166,932 entries with 14 features detailing customer information. The key identifier is kunden_id.
Bookings Dataset: Comprises 194,155 entries with 26 features related to booking information from 2014 to 2021. The key identifier is buchungsnr.
Regions Dataset: Includes 8,208 entries with 19 features providing regional information, identified by plz (postal code).
Label Creation
To predict repeaters, we first needed to define our target variable. Customers who made more than one booking were labeled as repeaters. We merged the Customers and Bookings datasets on kunden_id and then joined the resulting dataset with the Regions dataset on plz. Customers were grouped, and those with multiple bookings were assigned a label of 1 (repeater), while others were labeled 0 (non-repeater).
The distribution revealed that approximately 27% of the customers were repeaters, highlighting an imbalanced dataset—a crucial consideration for model training.
Data Splitting
Before proceeding with exploratory data analysis (EDA), the data was split into training and testing sets to prevent bias:
Training Set: 118,719 records (80% of the data)
Testing Set: 29,680 records (20% of the data)
A stratified split ensured the class distribution remained consistent across both sets.
Exploratory Data Analysis (EDA)
Data Cleaning
Initial data cleaning involved removing irrelevant identifiers like kunden_id and buchungsnr, which do not contribute to the predictive modeling.
Handling Missing Values
We inspected features with missing values to decide whether to impute or drop them:
Categorical Variables: Features like anrede (salutation) had significant missing values and were dropped due to the lack of a strong relationship with the target variable.
Continuous Variables: Features like entfernung (distance) showed no significant difference between repeaters and non-repeaters and were also dropped.
Data Preprocessing
Categorical Features
Nominal categorical variables were one-hot encoded:
destination: Despite having only five unique destinations, differences in repeater rates justified one-hot encoding this feature.
Datetime Features
Date-related features (buchungsdatum, anreisetag, abreisetag) were decomposed into day, month, and year components to extract temporal patterns.
Model Estimation
Feature Selection
To enhance model performance and reduce complexity, feature importance was assessed using an Extra Trees Classifier. The club_programm feature emerged as the most significant predictor, indicating that customers enrolled in the company's club program are more likely to be repeaters.
Highly correlated features were identified, and redundant ones were removed to prevent multicollinearity, ensuring a more robust model.
Class Balancing
Given the class imbalance, oversampling of the minority class (repeaters) was performed to equalize the number of repeater and non-repeater instances in the training set.
Model Training and Evaluation
A Logistic Regression model was chosen for its simplicity and interpretability. Two experiments were conducted:
Using All Features: Achieved an accuracy of 85% but with a lower precision for the repeater class.
Using Selected Features: Focused on the most important features, resulting in improved precision (73%) and accuracy (86%) for the repeater class.
Surprisingly, training the model on the unbalanced dataset yielded even better results, with a precision of 97% and an accuracy of 92% for the repeater class. This suggests that Logistic Regression can handle imbalanced data effectively without the need for oversampling.
Hyperparameter Tuning
Grid Search with cross-validation was employed to fine-tune hyperparameters, focusing on regularization techniques and strength. The optimal configuration used L1 regularization with a strength of 0.01, although performance gains were minimal compared to default settings.
The decision threshold was analyzed using a precision-recall curve, confirming that the default threshold of 0.5 was appropriate for this model.
Conclusion and Discussion
The analysis successfully demonstrated how AI can predict repeat customers in the travel industry, enabling more efficient allocation of marketing resources. The key takeaways include:
Model Performance: The Logistic Regression model achieved high precision and accuracy, especially when trained on unbalanced data.
Feature Importance: Enrollment in the club program is a strong indicator of repeat behavior.
Data Imbalance: Logistic Regression performed well without class balancing, contrary to initial expectations.
Business Implications
A precision of 97% for the repeater class means that the company can confidently target customers who are highly likely to rebook, minimizing wasted marketing spend. By focusing on these high-value customers, the company can enhance customer loyalty and increase revenue.
Future Work
To further improve the model and address related business problems, the following steps are proposed:
Feature Engineering: Incorporate additional temporal features like weekdays or holidays, which may influence booking behavior.
Advanced Modeling: Experiment with ensemble methods or AutoML tools to identify models that could outperform Logistic Regression.
Clustering: Segment customers based on their probability scores to tailor marketing strategies more effectively.
Statistical Validation: Conduct hypothesis testing to ensure that improvements over baseline models are statistically significant.