Data Science Week 09
Memorization Method
Classification and Regression
Classification is a task that predicts discrete event (class)
- is a e-mail spam or not (binary)
- does a patient have breast cancer or not (binary)
- predict letter grade a student expected to get for this class (multi-class, A, B, C, D, F)
Regressoin is a task that predicts continuous value (score)
- expected housing price
- expected GPA
알고 싶은 변수 : 목적 변수 (target variable)
Memorization Method
*The simplest methods that generate answers of
- a majority category (in the case of classification) (과거의 기록을 바탕으로)
- a average value (in the case of scoring)
*single variable models that use one variable to make answer
*multi-variable models that use more than one variable
- includes decision trees, k nearest neighbor and Naive Bayes methods.
*intuitive and straightforward
Sample Dataset
(Data originally extracted from 1994 Census database. Prediction task is to determine whether a person makes over 50K a year.)
Classification with Single variable model
*Given a single input variable, we predict if person's yearly income is more than 50K USD.
*We can choose predictor (input variable) from age, education, workclass, ..
Data Preparation (preprocessing)
load(url('adult.RData'))
set.seed(2020)
n_sample <- nrow(adult)
r_group <- runif(n_sample)
adult.train <- subset(adult, rgroup <= 0.8)
adult.test <- subset(adult, rgroup > 0.8)
dim(adult.train)
## [1] 26040 14
dim(adult.train)
## [1] 6521 14
We partition the dataset into two groups with ratio of 8:2
-train.df for building prediction model
-test.df is to evaluate our model
Building a Single Variable Model
we first choose "occupation" variable as predictor
tble <- table(adult.train$occupation, adult.train$income_mt_50k)
tble
prop.table(tble, margin = 1)
sv_model_job <- prop.table(tble, margin = 1)[, 2]
sort(sv_model_job, decreasing = T)
48% of executive-managers earn more than 50k yearly
none of private house servant earn more than 50k yearly
adult.train$set_prob <- sv_model_job[adult.train$occupation]
head(adult.train[, c('occupation', 'est_prob', 'income_mt_50k')], 10)
- We classify the group of people earning more than 50k, if their estimated probability is greater than threshold (0.4 here)
Accuracy
- Now we have predicted answers for training set.
- Let's see how accurate it is.
- Accuracy = # of correct predictions / # of all examples.
conf.table <- table(pred = adult.train$prediciton, actual = adult.train$income_mt_50k)
conf.table
accuracy <- sum(diag(conf.table)) / sum(conf.table)
accuracy
## [1] 0.7400154
## (16202 + 3068) / (16202 + 3068 + 3232 + 3538)
Prediction on Test Data
- Working well in the training dataset not necessarily guarantees it works well in real world.
- Since it can memorize training examples to make accurate prediction - Overfitting
- We need a prediction model that can be generalized.
- To see the generalized performance, we use test set which is unseen during the model training.
- We simulate the future data with the test data.
adult.test$est_prob <- sv_model_job[adult.test$occupation]
adult.test$prediction <- adult.test$est_prob > threshold
head(adult.test[, c('occupation', 'est_prob', 'prediction', 'income_mt_50k')], 10)
conf.table <- table(pred = adult.test$prediction, actual = adult.test$income_mt_50k)
conf.table
accuracy <- sum(diag(conf.table)) / sum(conf.table)
accuarcy
## [1] 0.7511118
TP - Positive 라고 예측했는데 참이다. True Positive.
TN - Negative 라고 예측했는데 참이다. True Negative.
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
Precision
- TP / (TP + FP)
- Among our prediction of charging more than 10k USD, 100 of them charge more than 10k USD.
Recall
- TP / (TP + FN)
- Among all customers charging more than 10k USD, we only find 43% of them.
ROC Curve
AUC (Area Under Curve)
90% excellent
80% good
50% fail
'공부 > R Programming' 카테고리의 다른 글
Data Science Week 10 (0) | 2021.05.03 |
---|---|
pums.sample R (0) | 2021.04.17 |
[Week 06] Lectures (0) | 2021.04.09 |
[Week 04] Lectures (0) | 2021.03.28 |
[Week 03] Lectures (0) | 2021.03.28 |
댓글