Spam Prediction - 2003 Data Mining Cup

http://www.data-mining-cup.com/2003/Wettbewerb/Aufgabe/1190662447/

Problem:

  • Minimize spam email through filter: 8000 training, 11000 test records
  • Contest places constraint that no more than 1% of real mail is thrown away

Methodology:

  1. 1. Import data straight from contest
  2. 2. Optimize parameters using only training records
  3. 3. Build 400 random trees on 8000-sample training data, ap=0.05
  4. 4. Allow emails with 50% confidence or higher that email is non-spam
  5. 5. Complete analysis took less than 1 hour

Results:

  • Our Spam Thru: 53 (1st place had 31, 2nd place had 68)
  • No Spam Out(%) 0.66, i.e. <1% of real mail is thrown away
  • We would have ranked 2nd out of 514 participants and 86 entries

Data Mining Cup 2003

Data Mining Cup 2003 Results