Binary classification with logistic regression
- 概率分布
- response value represents a probablity, between [0,1]
1 . 普通的线性回归假设响应变量呈正态分布,又称高斯分布或钟形曲线(bell curve)
2 . 若响应变量不满足正态分布,而是概率事件,则假设不满足
3 . 广义线性回归,用联连函数(link function)来描述解释变量和响应变量的关系
4 . 普通线性回归作为广义线性回归的特例使用的是恒等联连函数(identity link function), 将解释变量通过线性组合来联连服从正态分布的响应变量
5 . 对于逻辑回归,如果响应变量超过某个临界值,预测结果为阳性,否则为阴性
6 . The response variable is modeled as a function of a linear combination of the explanatory variables using the logistic function。the logistic function returns a value between 0 and 1
7 . For logistic function,t is equal to a linear combination of explanatory variables
Spam filtering(垃圾短信过滤)
1 . explore data and calculate some basic summary statics using pandas
import pandas as pd
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
print ('Number of spam messages:',df[df[0]=='spam'][0].count())
print ('Number of ham messages',df[df[0]=='ham'][0].count())
2 . create a TfidfVectorizer, then fit it with training messages, and transform both the training and test messages
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0]) #25%的比例为test集,type类型为Series
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw) #生成矩阵
X_test = vectorizer.transform(X_test_raw) #type为scipy的矩阵
3 . create an instance of LogisticRegression and train the model
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0]) #25%的比例为test集,type类型为Series
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw) #生成矩阵
X_test = vectorizer.transform(X_test_raw) #type为scipy的矩阵
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:5]):
print ('Prediction:%s. Truelabel:%s. Message:%s' % (prediction,y_test.iloc[i],X_test_raw.iloc[i]))
#此处必须使用iloc,基于位置的索引。若用X_test_raw[i]会报错,因为拆分训练、测试集时,索引也相应变了,尤其针对数字索引
Binary classification performance metrics(效果度量方法)
预测阳性 | 预测阴性 | |
---|---|---|
实际阳性 | True Positive | False Negative |
实际阴性 | False Positive | True Negative |
实际运行时如下,阳性在下
0 | 1 | ||
---|---|---|---|
0 | |||
1 |
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
y_test = [0,0,0,0,0,1,1,1,1,1]
y_pred = [0,1,0,0,0,0,0,1,1,1]
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
plt.matshow(confusion_matrix)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
Accuracy
- Accuracy measures a fraction of the classifier's predictions that are correct
from sklearn.metrics import accuracy_score
y_pred=[0,1,1,0]
y_true=[1,1,1,1]
print 'Accuracy:',accuracy_score(y_true,y_pred) #outcome is 0.5
- evaluate the classifier's accuracy
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
scores = cross_val_score(classifier, X_train, y_train, cv=5)
#y_pre=classifier.predict(X_test)
#for i,pre in enumerate(y_pre[:5]):
# print y_pre[i],y_test.iloc[i],X_test_raw.iloc[i]
print 'Accuracy',np.mean(scores), scores
#Outcome:Accuracy 0.955980861244 [ 0.94976077 0.95933014 0.96052632 0.96291866 0.94736842]
- Drawback
1 . accuracy can't distinguish between false positive errors and false negative errors
2 . accuracy is not an informative metrics if the proportions of the class are skewed(倾斜) in the population
Precision and recall 精确率和召回率
- definition
-
the precision is the fraction of positive predictions that are correct
-
recall is the fraction of truly positive instances that the classifier recognizes(被分类器识别出来的真阳性占所有阳性的比例)
- calculate SMS classifier's precision and recall
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision') #实际运行报错,不知为啥
print 'Precision', np.mean(precisions), precisions
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print 'Recall', np.mean(recalls), recalls
f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print 'F1:', np.mean(f1s), f1s
#Outcome:
Precision 0.989910506899 [ 0.98591549 1. 0.98850575 0.98795181 0.98717949]
Recall 0.685907046477 [ 0.60344828 0.69565217 0.74782609 0.71304348 0.66956522]
F1: 0.806840977066 [ 0.84102564 0.81675393 0.8042328 0.79144385 0.78074866]
1 . Precision=0.9899 means almost all of the messages that it predicted as spam were actually spam
2 . Recall=0.686 means it incorrectly classified approximately 32 precent of the spam messages as ham
Calculating the F1 measure
ROC AUC
- unlike accuracy,the ROC curve is insensitive to data sets with unbalanced class proportions
- ROC curves plot the classi er's recall against its fall-out
-
Fall-out, or the false positive rate, is the number of false positives divided by the total number of negatives
- AUC(area under curve)
which represents the expected performance of the classifier - plot the ROC curve for SMS spam
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import roc_curve, auc
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict_proba(X_test)
false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1]) #将y_test和预测值进行比较
roc_auc = auc(false_positive_rate, recall) #计算AUC的值
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc) #'b'表示蓝色线条
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
Tuning models with grid search(网格搜索调整模型)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}
if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
df = pd.read_csv('/Users/enniu/Desktop/sms.csv')
X, y, = df['message'], df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print 'Precision:', precision_score(y_test, predictions)
print 'Recall:', recall_score(y_test, predictions)
# The following is the output of the script:
Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 4.7s
[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 23.8s
[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 52.3s
[Parallel(n_jobs=-1)]: Done 792 tasks | elapsed: 1.6min
[Parallel(n_jobs=-1)]: Done 1242 tasks | elapsed: 2.5min
[Parallel(n_jobs=-1)]: Done 1792 tasks | elapsed: 3.7min
[Parallel(n_jobs=-1)]: Done 2442 tasks | elapsed: 5.1min
[Parallel(n_jobs=-1)]: Done 3192 tasks | elapsed: 6.8min
[Parallel(n_jobs=-1)]: Done 4042 tasks | elapsed: 11.2min
[Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed: 12.4min finished
Best score: 0.985
Best parameters set:
clf__C: 10
clf__penalty: 'l2'
vect__max_df: 0.25
vect__max_features: 2500
vect__ngram_range: (1, 2)
vect__norm: 'l2'
vect__stop_words: None
vect__use_idf: True
Accuracy: 0.98493543759
Precision: 0.983333333333
Recall: 0.907692307692
Multi-class classification
- One-vs.-all classification uses one binary classifier for each of the possible classes. The class that is predicted with the greatest confidence is assigned to the instance