Machine LearningApril 6, 20266 min read

How I Ranked Top 0.5% on Kaggle: Predicting Human Personality with AI

4,329 teams. One goal: predict introvert vs extrovert from 8 behavioural features. Here is the exact playbook that got me to rank 21 — external dataset merging, GradientBoosting, and 97.27% accuracy explained from scratch.

Imagine the Olympics — but for data scientists. Thousands of competitors from around the world, all racing to build the most accurate AI model for the same problem, judged on a hidden test set. That is Kaggle. This competition (Playground Series S5E7) asked one question: given 8 behavioural features about a person, can you predict whether they are an introvert or an extrovert? I finished 21st out of 4,329 teams — top 0.5% globally, with 97.27% accuracy.

🏆

The Competition

Kaggle Playground Series S5E7 — Predict Introvert vs Extrovert from 8 behavioural survey features. Metric: Accuracy on hidden test set. 18,524 training rows · 6,175 test rows · 4,329 competing teams.

The 8 Features: What Tells You Someone is Introverted? 🕵️

The dataset had just 8 behavioural features — all survey responses about social habits. No demographics, no ages, no names. Just pure behaviour. And yet, these 8 features turned out to be incredibly powerful for predicting personality.

The 8 Behavioural Features

Time_spent_Alone         → Hours alone per day (0–11)
Stage_fear               → Yes / No
Social_event_attendance  → Events per month (0–10)
Going_outside            → Times per week (0–7)
Drained_after_socializing→ Yes / No
Friends_circle_size      → Number of close friends (0–15)
Post_frequency           → Social media posts per week (0–10)

Target: Personality → Introvert or Extrovert

The Secret Weapon: External Dataset Merging 🔀

The competition provided 18,524 training rows. But there were two additional external personality datasets available on Kaggle with the same features. My key insight: merge these external datasets with the training data using all 7 feature columns as the join key. If a row in the training set matched a row in the external dataset, I got a free label cross-reference — extra signal the other teams ignored.

📥

Load & Merge External Datasets

Loaded personality_datasert.csv and personality_dataset.csv (2,439 + 2,439 rows). Deduplicated on all 7 feature columns and merged with the competition training data via a left join — any matching rows got an additional 'match_p' label column.

🧹

Handle Missing Values

Numeric columns (Time_spent_Alone, Social_event_attendance etc.) imputed with median. Categorical columns (Stage_fear, Drained_after_socializing) imputed with most-frequent value. Test set uses the same fitted imputers — no data leakage.

🔢

OneHotEncoding for Categoricals

Stage_fear and Drained_after_socializing encoded with OneHotEncoder(drop='first') — this creates binary columns (Stage_fear_Yes, Drained_after_socializing_Yes) that gradient boosting can use directly.

🌲

GradientBoostingClassifier — 97.27% CV Accuracy

Trained sklearn's GradientBoostingClassifier on the enriched feature set. Cross-validation accuracy: 97.27%. The external dataset merge was the decisive factor — it gave the model richer pattern coverage on borderline cases.

💡

The Winning Insight

Most teams used only the competition training data. By merging external personality datasets matched on all 7 feature columns, I effectively expanded the training signal on edge cases. The model saw more varied examples of the same behavioural patterns — and that is what pushed accuracy from ~95% to 97.27%.

21st

Final rank

4,329

Competing teams

97.27%

CV accuracy

Top 0.5%

Global percentile

The lesson: in competitions with small, clean datasets, data enrichment beats model complexity. I did not win with a fancy algorithm or 200 Optuna trials — I won by finding and merging external data that other teams overlooked. That same principle applies in industry: the best model on bad or incomplete data will always lose to a simple model on rich, complete data.

#Kaggle#GradientBoosting#scikit-learn#Feature Engineering#Competition ML