Spam Prediction - 2003 Data Mining Cup
http://www.data-mining-cup.com/2003/Wettbewerb/Aufgabe/1190662447/
Problem:
- Minimize spam email through filter: 8000 training, 11000 test records
- Contest places constraint that no more than 1% of real mail is thrown away
Methodology:
- 1. Import data straight from contest
- 2. Optimize parameters using only training records
- 3. Build 400 random trees on 8000-sample training data, ap=0.05
- 4. Allow emails with 50% confidence or higher that email is non-spam
- 5. Complete analysis took less than 1 hour
Results:
- Our Spam Thru: 53 (1st place had 31, 2nd place had 68)
- No Spam Out(%) 0.66, i.e. <1% of real mail is thrown away
- We would have ranked 2nd out of 514 participants and 86 entries

Data Mining Cup 2003 Results