HW3 PVA 1. Look at the original data PVA data (cup98lrn.jmp or cup98short.txt) and associated "data dictionary" in cup98DIC.txt. (a) Give three different ways that missing data are coded. (b) How long do you estimate it would take you to determine for all the features if they are real ("continuous" in JMP-speak) or categorical ("nominal" in JMP-speak). (Try it for the first dozen; can you see what each is?) How well does JMP do in guessing? Look e.g. at "TCODE", "Cluster", "ZIP", "Cluster2", "StateGov", "LocalGov" (c) Try running a linear regression. What happens? 2. Load the file cup98short.jmp or cup98.jmp into jmp. I coded some of the features as categorical ("nominal"). (I have not coded everything as categorical when it should be!) I have selected many (but not all) features as X's and targetB as Y. Note: targetB is left as a real number for now, even though technically it should be a Boolean ("nominal") Linear regression is much faster than logistic regression. a) Run a stepwise regression Select "run model" to get things set up. This will take a minute or two to do its preprocessing. Then click "go". Note the resulting R^2 and R^2-adj, and glance at the features selected. Then change to backward selection, and again click "go" Note the resulting R^2 and R^2-adj, and glance at the features selected. You have removed bunch of features. Was that a good thing to do? (We'll talk about this in class.) Select "make model" and the "run model" (on the window that pops up) to do the final regression on the features you have selected. b) Look at the features selected and their significances. (prob > |t|) gives a measure of how likely the coefficient is to have occurred by chance, *not* corrected for the fact that you looked at a lot of features Do the features that came in make sense? their signs? Look at the features "LASTDATE, LASTGIFT, FIRSTDATE, NEXTDATE" - or rather the subset of those features in the model. Which ones are in the model? Which one is most statistically significant? Does the sign makes sense? Which one is second most significant? Can you tell a reasonable story about its sign? Could you have told a reasonable story if the sign were the other way around? c) Make predictions Just like you did in statistics class: under the little red arrow, pick "save columns / prediction formula". This gives a real valued prediction. To make a send/don't send decision, you need to pick a threshold and call everything above the threshold "1" (send) and everything below it "0" (don't send). Pick 0.1 as the threshold. Make a new column called "Prediction" with the formula (cols/formula) ":Pred Formula TARGET_B>0.1". Look at the results by selecting "Tables/summary", and then putting the actual values (Target_B) and the predicted 0/1 values ("Prediction") into the "group" and clicking "OK" This will give a table that looks something like prediction target_B n_rows 1 . 0 0 2 . 1 0 3 0 0 35839 4 0 1 1710 5 1 0 2113 6 1 1 337 This would more traditionally be plotted in a "confusion matrix" Actual yes no Predicted yes 337 2,113 no 1,710 35,839 d) Calculate the lift at your threshold. How many "yes" predictions did you make? If you send a solicitation to all of those "yes"es, how many responses do you get? (e.g. 337 responses for 2,113+337 in the example above) If you were to pick an equal number of targets at random, what fraction would you expect to get? (e.g. (337+2113) * (337 +1,710) /40,000 = 187) The lift is then the ratio of these numbers. (e.g., 337/187 = 1.80, or an 80% lift). e) So far everything you have done has been on the training data. Now lets look at the results on the test set. Here is a trick for switching which features are excluded. - select "Rows/Row Selection/Select Excluded" - select "Rows/Exclude Include" - select "Rows/Row Selection/Invert Selection" - select "Rows/Exclude Include" You have now switched from training set to testing set. You already have the predictions you need, so as above. Look at the results by selecting "Tables/summary", and then putting the actual values (Target_B) and the predicted 0/1 values ("Prediction") into the "group" and clicking "OK" How good is the lift on the training set? What does this tell you? f) Analyze the same data set using trees. Select "Analyze/modeling/partion". Each time you click "split" it will add another split the tree. Do this about a dozen times. What features are being added to the model? Are they the same ones as were found for linear regression? You could (but I will not ask you do so) click on the little red square on the left, just as for regression, and save the predictions (or prediction formula) and compute the lift.