본문 바로가기
공부/R Programming

Data Science Week 09

by 혼밥맨 2021. 5. 3.
반응형

Data Science Week 09 

 

Memorization Method

Classification and Regression

Classification is a task that predicts discrete event (class)

 - is a e-mail spam or not (binary)
 - does a patient have breast cancer or not (binary)

 - predict letter grade a student expected to get for this class (multi-class, A, B, C, D, F)

 

Regressoin is a task that predicts continuous value (score)

 - expected housing price

 - expected GPA

 

알고 싶은 변수 : 목적 변수 (target variable)

 

 

Memorization Method

*The simplest methods that generate answers of 

 - a majority category (in the case of classification) (과거의 기록을 바탕으로)

 - a average value (in the case of scoring)

 

*single variable models that use one variable to make answer

*multi-variable models that use more than one variable 

 - includes decision trees, k nearest neighbor and Naive Bayes methods.

*intuitive and straightforward

 

 

Sample Dataset

(Data originally extracted from 1994 Census database. Prediction task is to determine whether a person makes over 50K a year.)

 

 

Classification with Single variable model

*Given a single input variable, we predict if person's yearly income is more than 50K USD.
*We can choose predictor (input variable) from age, education, workclass, ..

 

 

Data Preparation (preprocessing)

load(url('adult.RData'))

 

set.seed(2020)

n_sample <- nrow(adult)

r_group <- runif(n_sample)

 

adult.train <- subset(adult, rgroup <= 0.8)

adult.test <- subset(adult, rgroup > 0.8)

 

dim(adult.train)

## [1] 26040 14

 

dim(adult.train)

## [1] 6521 14

 

We partition the dataset into two groups with ratio of 8:2

 -train.df for building prediction model

 -test.df is to evaluate our model

 

 

 

 

Building a Single Variable Model

we first choose "occupation" variable as predictor

 

tble <- table(adult.train$occupation, adult.train$income_mt_50k)

tble

prop.table(tble, margin = 1)

 

sv_model_job <- prop.table(tble, margin = 1)[, 2]

sort(sv_model_job, decreasing = T)

 

48% of executive-managers earn more than 50k yearly

none of private house servant earn more than 50k yearly

 

 

 

adult.train$set_prob <- sv_model_job[adult.train$occupation]

 

head(adult.train[, c('occupation', 'est_prob', 'income_mt_50k')], 10)

 - We classify the group of people earning more than 50k, if their estimated probability is greater than threshold (0.4 here)

 

 

Accuracy

- Now we have predicted answers for training set.

- Let's see how accurate it is.

- Accuracy = # of correct predictions / # of all examples.

 

conf.table <- table(pred = adult.train$prediciton, actual = adult.train$income_mt_50k)

conf.table

 

accuracy <- sum(diag(conf.table)) / sum(conf.table)

accuracy

## [1] 0.7400154

## (16202 + 3068) / (16202 + 3068 + 3232 + 3538)

 

 

Prediction on Test Data

- Working well in the training dataset not necessarily guarantees it works well in real world.

- Since it can memorize training examples to make accurate prediction - Overfitting

- We need a prediction model that can be generalized.

- To see the generalized performance, we use test set which is unseen during the model training.

- We simulate the future data with the test data.

 

 

adult.test$est_prob <- sv_model_job[adult.test$occupation]

adult.test$prediction <- adult.test$est_prob > threshold

 

head(adult.test[, c('occupation', 'est_prob', 'prediction', 'income_mt_50k')], 10)

 

 

conf.table <- table(pred = adult.test$prediction, actual = adult.test$income_mt_50k)

conf.table

 

accuracy <- sum(diag(conf.table)) / sum(conf.table)

accuarcy

## [1] 0.7511118

 

 

 

 

 

TP - Positive 라고 예측했는데 참이다. True Positive.

TN - Negative 라고 예측했는데 참이다. True Negative.

 

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

 

 

Precision

- TP / (TP + FP)

- Among our prediction of charging more than 10k USD, 100 of them charge more than 10k USD.

 

Recall

- TP / (TP + FN)

- Among all customers charging more than 10k USD, we only find 43% of them.

 

 

 

ROC Curve

 

 

AUC (Area Under Curve)

90% excellent

80% good

50% fail

 

 

반응형

'공부 > R Programming' 카테고리의 다른 글

Data Science Week 10  (0) 2021.05.03
pums.sample R  (0) 2021.04.17
[Week 06] Lectures  (0) 2021.04.09
[Week 04] Lectures  (0) 2021.03.28
[Week 03] Lectures  (0) 2021.03.28

댓글