Introduction
Accurately predicting health insurance costs is a critical challenge for insurers. Traditional methods often fall short in accounting for individual nuances, leading to suboptimal pricing. This project harnesses the power of machine learning to develop a robust model for personalized health insurance cost prediction, aiming to enhance pricing precision, boost market competitiveness, improve customer satisfaction through fair premiums, and refine risk assessment processes. Our approach leverages data cleaning, exploratory data analysis (EDA), hypothesis testing, and advanced regression models, all built upon a comprehensive dataset provided by Scaler.
Exploring the Data: Unveiling Health Insights
Our journey began with an in-depth Exploratory Data Analysis (EDA) on a dataset comprising 11 key attributes, including age, various health conditions (diabetes, blood pressure problems, transplants, chronic diseases, allergies, family cancer history), physical measurements (height, weight), number of major surgeries, and the target variable: PremiumPrice.
The dataset was thoroughly examined for data types, missing values, and outliers. Key observations included:
*   The PremiumPrice distribution was largely normal but exhibited some left skewness, with minimal outliers.
*   A significant portion of individuals fell into ‘overweight’ and ‘obese’ BMI categories.
*   Correlation analysis (Pearson and Spearman) helped identify relationships between features and the target variable, indicating the most influential factors for premium pricing.
Uncovering Significant Factors: Hypothesis Testing
To understand the statistical significance of various factors, we conducted extensive hypothesis testing:
- T-Tests: Compared mean premium prices and other attributes across binary health conditions.
- Diabetes: Significantly influenced age and number of major surgeries, but not physical measurements.
 - Blood Pressure Problems: Strongly correlated with higher age and more major surgeries.
 - Any Transplants: Primarily affected premium price, with age becoming non-significant.
 - Chronic Diseases: Led to significantly higher premium prices, while age remained non-significant.
 - Known Allergies: Showed a significant difference only in the number of major surgeries.
 - Family History of Cancer: Associated with significantly higher premiums and more major surgeries.
 
 - ANOVA: Assessed the impact of categorical variables on premium prices.
- Number of Major Surgeries: Age significantly impacted prices, more so than BMI or physical measurements.
 - Age Group: Identified as the most substantial predictor of insurance costs.
 - Health Score: Still showed age as a factor, but its influence lessened when health scores were considered.
 
 - Chi-Squared Contingency Tests: Explored associations between binary health conditions.
- Diabetes: Strongly associated with blood pressure problems, chronic diseases, allergies, surgeries, age group, and health score.
 - Blood Pressure Problems: Linked to diabetes, number of surgeries, age group, and health score.
 - Any Transplants: Showed no significant association with any other health conditions, suggesting a random distribution.
 - Chronic Diseases: Significantly associated with diabetes, age group, and health score.
 - Known Allergies: Linked to diabetes, family cancer history, number of surgeries, and health score.
 - Family History of Cancer: Associated with allergies, number of surgeries, and health score.
 
 
Building Predictive Models: The Machine Learning Pipeline
Our modeling phase focused on developing regression models to predict continuous insurance premium values. The data underwent a rigorous preprocessing pipeline:
- Missing Value Imputation: Utilized a Random Forest Iterative Imputer, a sophisticated method that predicts missing values based on other features, improving accuracy over simple statistical imputations.
 - Feature Encoding: Ordinal encoding was applied to maintain the natural order in hierarchical categories like overall risk and BMI categories.
 - Feature Scaling: Standard Scaling transformed numerical features to have a mean of 0 and a standard deviation of 1, optimizing algorithm performance.
 
We implemented five distinct regression algorithms, chosen for their diverse approaches to uncover underlying patterns:
1.  Linear Regression (baseline)
2.  Decision Tree Regressor
3.  Random Forest Regressor
4.  Gradient Boosting Regressor
5.  XGBoost Regressor
A 5-fold cross-validation strategy with shuffling ensured robust performance estimation and mitigated overfitting. Model performance was evaluated using RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and R² Score, along with confidence intervals and residual analysis for statistical validation.
Unveiling the Champion: Model Performance
After training and rigorous evaluation, the Random Forest Regressor emerged as the top-performing model.
| Model | RMSE | MAE | R² | 
|---|---|---|---|
| Linear Regression | 3542.13 | 2419.16 | 0.678 | 
| Decision Tree | 3889.32 | 1147.06 | 0.612 | 
| Random Forest | 2858.16 | 1249.14 | 0.791 | 
| Gradient Boosting | 3109.07 | 1724.89 | 0.752 | 
| XGBoost | 3039.57 | 1509.54 | 0.763 | 
- Random Forest achieved the lowest RMSE (2858.16) and the highest R² score (0.791), indicating superior accuracy and the ability to explain nearly 80% of the variance in premium prices.
 - Ensemble methods (Random Forest, XGBoost, Gradient Boosting) consistently outperformed individual models, highlighting their power in predictive tasks.
 - All models demonstrated tight confidence intervals, signifying reliable prediction capabilities.
 
Bringing the Model to Life: Deployment Strategy
To make this predictive power accessible, a deployment strategy was devised using a modern application architecture:
- Frontend: An interactive web interface built with Streamlit.
 - Backend: Python with scikit-learn for model inference.
 - Deployment: Containerized using Docker for scalability and consistent environments.
 - Version Control: Managed with Git, following a structured repository organization.
 
The application features intuitive input forms with real-time validation, automatic calculation of derived features like BMI and health scores, and a robust prediction pipeline. Users receive premium estimates along with confidence intervals and risk analysis. Both local and Docker deployment options are provided, with considerations for scalability, monitoring, security, and maintenance for production environments.
Project Repository
The complete codebase, detailed analysis notebooks, trained model artifacts, and deployment scripts are openly available on GitHub: Insurance Cost Prediction Repository
This project demonstrates a comprehensive, data-driven approach to predicting health insurance costs, offering a valuable tool for insurers to enhance efficiency and fairness in their pricing strategies.