R Code Project

Description

Don't use plagiarized sources. Get Your Custom Essay on
R Code Project
Just from \$13/Page

Data Requirements:

You can pick any data you want as long as it is a classification problem.

Some sources are:

• Kaggle https://www.kaggle.com/datasets?tags=13302-Classification
• UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets.php?format=&task=cla&att=&area=&numAtt=&numIns=&type=&sort=nameUp&view=table

Read your data in R and call it df. For the rest of this document `y` refers to the variable you are predicting.

• The grading rubric can be found below:

R codeDecision/WhyCommunication of findingsPercentage of Assigned Points30%35%35%

• Decision/why?: Explain your reasoning behind your choice of the procedure, set of variables and such for the question.

Explain why you use the procedure/model/variable

• To exceed this criterion, describe steps taken to implement the procedure in a non technical way.

Communication of your findings: Explain your results in terms of training MSE, testing MSE, and prediction of the variable `Y`

Explain why you think one model is better than the other.

To exceed this criterion, explain your model and how it predicts `y` in a non technical way.

Part 1: Exploratory Data Analysis (20 points)

Check for existence of NA’s (missing data)

If necessary, classify all categorical variables except the one you are predicting as factors. Calculate the summary statistics of the entire data set.

For the numerical variables, plot box plots based on values of `y`. Do you see a difference between the box plots for any of the variables you choose?

For the categorical variables, plot bar charts for the different values of `y`. Do you see a difference between plots for any of the variables you choose?

Test/training separation: Separate your data into 80% training and 20% testing data. Do not forget to set seed. Please use the same separation for the whole assignment, as it is needed to be able to compare the models.

Part 2: Logistic Regression or LDA (15 points)

Develop a classification model where the variable `y` is the dependent variable using the Logistic Regression or LDA, rest of the variables, and your training data set.

Obtain the confusion matrix and compute the testing error rate based on the logistic regression classification.

Part 3: KNN (15 points)

Apply a KNN classification to the training data using.

Obtain the confusion matrix and compute the testing error rate based on the KNN classification.

Part 4: Tree Based Model (15 points)

Apply one of the following models to your training data: Classification Tree, Random Forrest, Bagging or Boosting

Obtain the confusion matrix and compute the testing error rate based on your chosen tree based model.

Part 5: SVM (15 points)

• Apply a SVM model to your training data.

Calculate the confusion matrix using the testing data.

• Part 6: Conclusion (20 points)
• (10 points) Based on the different classification models, which one do you think is the best model to predict `y`? Please consider the following in your response:

Accuracy/error rates

Do you think you can improve the model by adding any other information?

Order your essay today and save 20% with the discount code: ESSAYHELP

Order a unique copy of this paper

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
\$26