You will be working on a real dataset about SPAM emails. This dataset consists of the frequency of different words used in the emails. Each row
represents one email. The last attribute “is_spam” represents whether the email is a SPAM email or not.
1. Import the data from [login to view URL] file into RapidMiner.
2. Measure the accuracy of Decision Tree without any data pre-processing. Also try to change the parameters of the classifier, and see which
parameters give best results.
3. Measure accuracy after normalizing the data (using range and z-score normalizations).
4. Measure accuracy after discretizing the numeric attributes in 2, 3, and 4 bins (each).
5. Measure accuracy after reducing dimensions (use “Weight by ...” and “Select by weights” operators). Try different weighting schemes.
6. Measure accuracy by using all of the pre-processing steps excluding dimensionality reduction (tasks 3 to 4)
7. Measure accuracy by using all of the pre-processing steps including dimensionality reduction (tasks 3 to 5)
8. Try to find a combination of pre-processing steps which gives the best results.
9. Measure accuracy using Neural Network classifier and suitable data pre-processing steps.
10. Measure accuracy using any 3 classifiers not used in previous tasks using suitable data pre-processing steps.
11. For measuring accuracy in all the steps listed above, use 10-fold cross validation (X-Validation). Use your student id as the seed in
randomization (wherever possible)
1. Rapidminer process files (rmp file) for each task. Name of the file should be “[login to view URL]”. In case a task has multiple sub tasks (like tasks
3, 4, 5, 8, and 10) the names should be like “[login to view URL]”, “[login to view URL]” “[login to view URL]”, ...
2. Write a report about the dataset and its characteristics. Discuss in detail which pre-processing steps were useful and which classifier
produced the best results. The report should also include results of all the tasks in form of a table. You can generate the plots for
comparing results using Microsoft Excel.
3. PowerPoint presentation of your project.