Data Science Week 09

Memorization Method

Classification and Regression

Classification is a task that predicts discrete event (class)

- is a e-mail spam or not (binary)
- does a patient have breast cancer or not (binary)

- predict letter grade a student expected to get for this class (multi-class, A, B, C, D, F)

Regressoin is a task that predicts continuous value (score)

- expected housing price

- expected GPA

알고 싶은 변수 : 목적 변수 (target variable)

Memorization Method

*The simplest methods that generate answers of

- a majority category (in the case of classification) (과거의 기록을 바탕으로)

- a average value (in the case of scoring)

*single variable models that use one variable to make answer

*multi-variable models that use more than one variable

- includes decision trees, k nearest neighbor and Naive Bayes methods.

*intuitive and straightforward

Sample Dataset

(Data originally extracted from 1994 Census database. Prediction task is to determine whether a person makes over 50K a year.)

Classification with Single variable model

*Given a single input variable, we predict if person's yearly income is more than 50K USD.
*We can choose predictor (input variable) from age, education, workclass, ..

Data Preparation (preprocessing)

load(url('adult.RData'))

set.seed(2020)

n_sample <- nrow(adult)

r_group <- runif(n_sample)

adult.train <- subset(adult, rgroup <= 0.8)

adult.test <- subset(adult, rgroup > 0.8)

dim(adult.train)

## [1] 26040 14

dim(adult.train)

## [1] 6521 14

We partition the dataset into two groups with ratio of 8:2

-train.df for building prediction model

-test.df is to evaluate our model

Building a Single Variable Model

we first choose "occupation" variable as predictor

tble <- table(adult.train$occupation, adult.train$income_mt_50k)

tble

prop.table(tble, margin = 1)

sv_model_job <- prop.table(tble, margin = 1)[, 2]

sort(sv_model_job, decreasing = T)

48% of executive-managers earn more than 50k yearly

none of private house servant earn more than 50k yearly

adult.train$set_prob <- sv_model_job[adult.train$occupation]

head(adult.train[, c('occupation', 'est_prob', 'income_mt_50k')], 10)

- We classify the group of people earning more than 50k, if their estimated probability is greater than threshold (0.4 here)

Accuracy

- Now we have predicted answers for training set.

- Let's see how accurate it is.

- Accuracy = # of correct predictions / # of all examples.

conf.table <- table(pred = adult.train$prediciton, actual = adult.train$income_mt_50k)

conf.table

accuracy <- sum(diag(conf.table)) / sum(conf.table)

accuracy

## [1] 0.7400154

## (16202 + 3068) / (16202 + 3068 + 3232 + 3538)

Prediction on Test Data

- Working well in the training dataset not necessarily guarantees it works well in real world.

- Since it can memorize training examples to make accurate prediction - Overfitting

- We need a prediction model that can be generalized.

- To see the generalized performance, we use test set which is unseen during the model training.

- We simulate the future data with the test data.

adult.test$est_prob <- sv_model_job[adult.test$occupation]

adult.test$prediction <- adult.test$est_prob > threshold

head(adult.test[, c('occupation', 'est_prob', 'prediction', 'income_mt_50k')], 10)

conf.table <- table(pred = adult.test$prediction, actual = adult.test$income_mt_50k)

conf.table

accuracy <- sum(diag(conf.table)) / sum(conf.table)

accuarcy

## [1] 0.7511118

TP - Positive 라고 예측했는데 참이다. True Positive.

TN - Negative 라고 예측했는데 참이다. True Negative.

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

Precision

- TP / (TP + FP)

- Among our prediction of charging more than 10k USD, 100 of them charge more than 10k USD.

Recall

- TP / (TP + FN)

- Among all customers charging more than 10k USD, we only find 43% of them.

ROC Curve

AUC (Area Under Curve)

90% excellent

80% good

50% fail

'공부 > R Programming' 카테고리의 다른 글

Data Science Week 10 (0)	2021.05.03
pums.sample R (0)	2021.04.17
[Week 06] Lectures (0)	2021.04.09
[Week 04] Lectures (0)	2021.03.28
[Week 03] Lectures (0)	2021.03.28

혼밥맨

Data Science Week 09