How I Ranked Top 0.5% on Kaggle: Predicting Human Personality with AI
4,329 teams. One goal: predict introvert vs extrovert from 8 behavioural features. Here is the exact playbook that got me to rank 21 — external dataset merging, GradientBoosting, and 97.27% accuracy explained from scratch.
Imagine the Olympics — but for data scientists. Thousands of competitors from around the world, all racing to build the most accurate AI model for the same problem, judged on a hidden test set. That is Kaggle. This competition (Playground Series S5E7) asked one question: given 8 behavioural features about a person, can you predict whether they are an introvert or an extrovert? I finished 21st out of 4,329 teams — top 0.5% globally, with 97.27% accuracy.
The Competition
Kaggle Playground Series S5E7 — Predict Introvert vs Extrovert from 8 behavioural survey features. Metric: Accuracy on hidden test set. 18,524 training rows · 6,175 test rows · 4,329 competing teams.
The 8 Features: What Tells You Someone is Introverted? 🕵️
The dataset had just 8 behavioural features — all survey responses about social habits. No demographics, no ages, no names. Just pure behaviour. And yet, these 8 features turned out to be incredibly powerful for predicting personality.
The 8 Behavioural Features
Time_spent_Alone → Hours alone per day (0–11) Stage_fear → Yes / No Social_event_attendance → Events per month (0–10) Going_outside → Times per week (0–7) Drained_after_socializing→ Yes / No Friends_circle_size → Number of close friends (0–15) Post_frequency → Social media posts per week (0–10) Target: Personality → Introvert or Extrovert
The Secret Weapon: External Dataset Merging 🔀
The competition provided 18,524 training rows. But there were two additional external personality datasets available on Kaggle with the same features. My key insight: merge these external datasets with the training data using all 7 feature columns as the join key. If a row in the training set matched a row in the external dataset, I got a free label cross-reference — extra signal the other teams ignored.
Load & Merge External Datasets
Loaded personality_datasert.csv and personality_dataset.csv (2,439 + 2,439 rows). Deduplicated on all 7 feature columns and merged with the competition training data via a left join — any matching rows got an additional 'match_p' label column.
Handle Missing Values
Numeric columns (Time_spent_Alone, Social_event_attendance etc.) imputed with median. Categorical columns (Stage_fear, Drained_after_socializing) imputed with most-frequent value. Test set uses the same fitted imputers — no data leakage.
OneHotEncoding for Categoricals
Stage_fear and Drained_after_socializing encoded with OneHotEncoder(drop='first') — this creates binary columns (Stage_fear_Yes, Drained_after_socializing_Yes) that gradient boosting can use directly.
GradientBoostingClassifier — 97.27% CV Accuracy
Trained sklearn's GradientBoostingClassifier on the enriched feature set. Cross-validation accuracy: 97.27%. The external dataset merge was the decisive factor — it gave the model richer pattern coverage on borderline cases.
The Winning Insight
Most teams used only the competition training data. By merging external personality datasets matched on all 7 feature columns, I effectively expanded the training signal on edge cases. The model saw more varied examples of the same behavioural patterns — and that is what pushed accuracy from ~95% to 97.27%.
21st
Final rank
4,329
Competing teams
97.27%
CV accuracy
Top 0.5%
Global percentile
The lesson: in competitions with small, clean datasets, data enrichment beats model complexity. I did not win with a fancy algorithm or 200 Optuna trials — I won by finding and merging external data that other teams overlooked. That same principle applies in industry: the best model on bad or incomplete data will always lose to a simple model on rich, complete data.