情感分析的新方法

转载自:https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis
中文翻译:http://datartisan.com/article/detail/48.html

Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment quantification has enjoyed many useful implementations, such as businesses gaining understanding about consumer reactions to a product, or detecting hateful speech in online comments.
The simplest form of sentiment analysis is to use a dictionary of good and bad words. Each word in a sentence has a score, typically +1 for positive sentiment and -1 for negative. Then, we simply add up the scores of all the words in the sentence to get a final sentiment total. Clearly, this has many limitations, the most important being that it neglects context and surrounding words. For example, in our simple model the phrase “not good” may be classified as 0 sentiment, given “not” has a score of -1 and “good” a score of +1. A human would likely classify “not good” as negative, despite the presence of “good”.
Another common method is to treat a text as a “bag of words”. We treat each text as a 1 by N vector, where N is the size of our vocabulary. Each column is a word, and the value is the number of times that word appears. For example, the phrase “bag of bag of words” might be encoded as [2, 2, 1]. This could then be fed into a machine learning algorithm for classification, such as logistic regression or SVM, to predict sentiment on unseen data. Note that this requires data with known sentiment to train on in a supervised fashion. While this is an improvement over the previous method, it still ignores context, and the size of the data increases with the size of the vocabulary.
Word2Vec and Doc2Vec
Recently, Google developed a method called Word2Vec that captures the context of words, while at the same time reducing the size of the data. Word2Vec is actually two different methods: Continuous Bag of Words (CBOW) and Skip-gram. In the CBOW method, the goal is to predict a word given the surrounding words. Skip-gram is the converse: we want to predict a window of words given a single word (see Figure 1). Both methods use artificial neural networks as their classification algorithm. Initially, each word in the vocabulary is a random N-dimensional vector. During training, the algorithm learns the optimal vector for each word using the CBOW or Skip-gram method.

Figure 1: Architecture for the CBOW and Skip-gram method, taken from Efficient Estimation of Word Representations in Vector Space. W(t) is the current word, while w(t-2), w(t-1), etc. are the surrounding words.
These word vectors now capture the context of surrounding words. This can be seen by using basic algebra to find word relations (i.e. “king” – “man” + “woman” = “queen”). These word vectors can be fed into a classification algorithm, as opposed to bag-of-words, to predict sentiment. The advantage is that we now have some word context, and our feature space is much lower (typically ~300 as opposed to ~100,000, which is the size of our vocabulary). We also had to do very little manual feature creation since the neural network was able to extract those features for us. Since text have varying length, one might take the average of all word vectors as the input to a classification algorithm to classify whole text documents.
However, even with the above method of averaging word vectors, we are ignoring word order. As a way to summarize bodies of text of varying length, Quoc Le and Tomas Mikolov came up with the Doc2Vec method. This method is almost identical to Word2Vec, except we now generalize the method by adding a paragraph/document vector. Like Word2Vec, there are two methods: Distributed Memory (DM) and Distributed Bag of Words (DBOW). DM attempts to predict a word given its previous words and a paragraph vector. Even though the context window moves across the text, the paragraph vector does not (hence distributed memory) and allows for some word-order to be captured. DBOW predicts a random group of words in a paragraph given only its paragraph vector (see Figure 2).

Figure 2: Architecture for Doc2Vec, taken from Distributed Representations of Sentences and Documents.
Once it has been trained, these paragraph vectors can be fed into a sentiment classifier without the need to aggregate words. This method is currently the state-of-the-art when it comes to sentiment classification on the IMDB movie review data set, achieving only a 7.42% error rate. Of course, none of this is useful if we cannot actually implement them. Luckily, a very-well optimized version of Word2Vec and Doc2Vec is available in gensim, a Python library.
Word2Vec Example in Python
In this section we show how one might use word vectors in a sentiment classification task. The gensim library comes standard with the Anaconda distribution or can be installed using pip. From there you can train word vectors on your own corpus (a dataset of text documents) or import pre-trained vectors from C text or binary format:
from gensim.models.word2vec import Word2Vec

model = Word2Vec.load_word2vec_format('vectors.txt', binary=False) #C text format
model = Word2Vec.load_word2vec_format('vectors.bin', binary=True) #C binary format
I find this especially useful when loading Google’s pre-trained word vectors trained over ~100 billion words from the Google News dataset found in the "Pre-trained word and phrase vectors" section here. Note that the file is ~3.5 GB unzipped. Using the Google word vectors we can see some interesting relationships between words:
from gensim.models.word2vec import Word2Vec

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

[(u'queen', 0.711819589138031),
(u'monarch', 0.618967592716217),
(u'princess', 0.5902432799339294),
(u'crown_prince', 0.5499461889266968),
(u'prince', 0.5377323031425476)]
What's interesting is that it can find grammatical relationships, for example identifying superlatives or verb stems:
"biggest" - "big" + "small" = "smallest"
model.most_similar(positive=['biggest','small'], negative=['big'], topn=5)

[(u'smallest', 0.6086569428443909),
(u'largest', 0.6007465720176697),
(u'tiny', 0.5387299656867981),
(u'large', 0.456944078207016),
(u'minuscule', 0.43401968479156494)]
"ate" - "eat" + "speak" = "spoke"
model.most_similar(positive=['ate','speak'], negative=['eat'], topn=5)

[(u'spoke', 0.6965223550796509),
(u'speaking', 0.6261293292045593),
(u'conversed', 0.5754593014717102),
(u'spoken', 0.570488452911377),
(u'speaks', 0.5630602240562439)]
It's clear from the above examples that Word2Vec is able to learn non-trivial relationships between words. This is what makes them powerful for many NLP tasks, and in our case sentiment analysis. Before we move on to using them in sentiment analysis, let us first examine Word2Vec's ability to separate and cluster words. We will use three example word sets: food, sports, and weather words taken from a wonderful website called Enchanted Learning. Since these vectors are 300 dimensional, we will use Scikit-Learn's implementation of a dimensionality reduction algorithm called t-SNE in order to visualize them in 2D.
First we have to obtain the word vectors as follows:
import numpy as np

with open('food_words.txt', 'r') as infile:
food_words = infile.readlines()

with open('sports_words.txt', 'r') as infile:
sports_words = infile.readlines()

with open('weather_words.txt', 'r') as infile:
weather_words = infile.readlines()

def getWordVecs(words):
vecs = []
for word in words:
word = word.replace('\n', '')
try:
vecs.append(model[word].reshape((1,300)))
except KeyError:
continue
vecs = np.concatenate(vecs)
return np.array(vecs, dtype='float') #TSNE expects float type values

food_vecs = getWordVecs(food_words)
sports_vecs = getWordVecs(sports_words)
weather_vecs = getWordVecs(weather_words)
We can then use TSNE and matplotlib to visualize the clusters with the following code:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

ts = TSNE(2)
reduced_vecs = ts.fit_transform(np.concatenate((food_vecs, sports_vecs, weather_vecs)))

color points by word group to see if Word2Vec can separate them

for i in range(len(reduced_vecs)):
if i < len(food_vecs):
#food words colored blue
color = 'b'
elif i >= len(food_vecs) and i < (len(food_vecs) + len(sports_vecs)):
#sports words colored red
color = 'r'
else:
#weather words colored green
color = 'g'
plt.plot(reduced_vecs[i,0], reduced_vecs[i,1], marker='o', color=color, markersize=8)
The result is as follows:

Figure 3: T-SNE projected clusters of food words (blue), sports words (red), and weather words (green).
We can see from the above that Word2Vec does a good job of separating unrelated words, as well as clustering together like words.
Analyzing the Sentiment of Emoji Tweets
Now we will move on to an example in sentiment analysis with tweets gathered using emojis as search terms. We use these emojis as "fuzzy" labels for our data; a smiley emoji (:-)) corresponds to positive sentiment, and a frowny (:-() to negative. The data consists of an even split between positive and negative with a total of ~400,000 tweets. We randomly sample positive and negative tweets to construct an 80/20, train/test, split. We then train the Word2Vec model on the train tweets. In order to prevent data leakage from the test set, we do not train Word2Vec on the test set until after our classifier has been fit on the training set. To construct inputs for our classifier, we take the average of all word vectors in a tweet. We will be using Scikit-Learn to do a lot of the machine learning.
First we import our data and train the Word2Vec model.
from sklearn.cross_validation import train_test_split
from gensim.models.word2vec import Word2Vec

with open('twitter_data/pos_tweets.txt', 'r') as infile:
pos_tweets = infile.readlines()

with open('twitter_data/neg_tweets.txt', 'r') as infile:
neg_tweets = infile.readlines()

use 1 for positive sentiment, 0 for negative

y = np.concatenate((np.ones(len(pos_tweets)), np.zeros(len(neg_tweets))))

x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_tweets, neg_tweets)), y, test_size=0.2)

Do some very minor text preprocessing

def cleanText(corpus):
corpus = [z.lower().replace('\n','').split() for z in corpus]
return corpus

x_train = cleanText(x_train)
x_test = cleanText(x_test)

n_dim = 300

Initialize model and build vocab

imdb_w2v = Word2Vec(size=n_dim, min_count=10)
imdb_w2v.build_vocab(x_train)

Train the model over train_reviews (this may take several minutes)

imdb_w2v.train(x_train)
Next we have to build word vectors for input text in order to average the value of all word vectors in the tweet using the following function:

Build word vector for training set by using the average value of all word vectors in the tweet, then scale

def buildWordVector(text, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in text:
try:
vec += imdb_w2v[word].reshape((1, size))
count += 1.
except KeyError:
continue
if count != 0:
vec /= count
return vec
Scaling moves our data set is part of the process of standardization where we move our dataset into a gaussian distribution with a mean of zero, meaning that values above the mean will be positive, and those below the mean will be negative. Many ML models require scaled datasets to perform effectively, especially those with many features (like text classifiers).
from sklearn.preprocessing import scale
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])
train_vecs = scale(train_vecs)

Train word2vec on test tweets

imdb_w2v.train(x_test)
Finally we have to build our test set vectors and scale them for evaluation.

Build test tweet vectors then scale

test_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_test])
test_vecs = scale(test_vecs)
Next we want to validate our classifier by calculating the prediction accuracy on test data, as well as examining its Receiver Operating Characteristic (ROC) curve. ROC curves measure the true-positive rate vs. the false-positive rate of a classifier while adjusting a parameter of the model. In our case, we adjust the cut-off threshold probability for classifying a tweet as positive or negative sentiment. Generally, the larger the Area Under the Curve (AUC), the better our model does at maximizing true positives while minimizing false positives. More on ROC curves can be found here.
To start we'll train our classifier, in this case using Stochastic Gradient Descent for Logistic Regression.

Use classification algorithm (i.e. Stochastic Logistic Regression) on training set, then assess model performance on test set

from sklearn.linear_model import SGDClassifier

lr = SGDClassifier(loss='log', penalty='l1')
lr.fit(train_vecs, y_train)

print 'Test Accuracy: %.2f'%lr.score(test_vecs, y_test)
We'll then create the ROC curve for evaluation using matplotlib and the roc_curve method of Scikit-Learn's metric package.

Create ROC curve

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

pred_probas = lr.predict_proba(test_vecs)[:,1]

fpr,tpr,_ = roc_curve(y_test, pred_probas)
roc_auc = auc(fpr,tpr)
plt.plot(fpr,tpr,label='area = %.2f' %roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc='lower right')

plt.show()
The resulting curve is as follows:

Figure 4: ROC Curve for a logistic classifier on our training data of tweets.
Without any type of feature creation and minimal text preprocessing we can achieve 73% test accuracy using a simple linear model provided by Scikit-Learn. Interestingly, removing punctuation actually causes the accuracy to suffer, suggesting Word2Vec can find interesting features when characters such as "?" and "!" are present. Treating these as individual words, training for longer, doing more preprocessing, and adjusting parameters in both Word2Vec and the classifier could all help in improving accuracy. I have found that using Artificial Neural Networks (ANNs) can improve the accuracy by about 5% when using word vectors. Note that Scikit-Learn does not provide an implementation of ANN classifiers so I used a custom library I created:
from NNet import NeuralNet

nnet = NeuralNet(100, learn_rate=1e-1, penalty=1e-8)
maxiter = 1000
batch = 150
_ = nnet.fit(train_vecs, y_train, fine_tune=False, maxiter=maxiter, SGD=True, batch=batch, rho=0.9)

print 'Test Accuracy: %.2f'%nnet.score(test_vecs, y_test)
The resulting accuracy is 77%. As with any machine learning task, picking the right model is usually more a matter of art than science. If you'd like to use my custom library you can find it on my github. Be warned, it is likely very messy and not regularly maintained! If you would like to contribute please feel free to fork the repository. It could definitely use some TLC!
Using Doc2Vec to Analyze Movie Reviews
Using the averages of word vectors worked fine in the case of tweets. This is because tweets are typically only a few to tens of words in length, which allows us to preserve the relevant features even when averaging. Once we go to the paragraph scale, however, we risk throwing away rich features when we ignore word order and context. In this case it is better to use Doc2Vec to create our input features. As an example we will use the IMDB movie review dataset to test the usefulness of Doc2Vec in sentiment analysis. The data consists of 25,000 positive movie reviews, 25,000 negative, and 50,000 unlabeled reviews. We first train Doc2Vec over the unlabeled reviews. The methodology then identically follows that of the Word2Vec example above, except now we will use both DM and DBOW vectors as inputs by concatenating them.
import gensim

LabeledSentence = gensim.models.doc2vec.LabeledSentence

from sklearn.cross_validation import train_test_split
import numpy as np

with open('IMDB_data/pos.txt','r') as infile:
pos_reviews = infile.readlines()

with open('IMDB_data/neg.txt','r') as infile:
neg_reviews = infile.readlines()

with open('IMDB_data/unsup.txt','r') as infile:
unsup_reviews = infile.readlines()

use 1 for positive sentiment, 0 for negative

y = np.concatenate((np.ones(len(pos_reviews)), np.zeros(len(neg_reviews))))

x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_reviews, neg_reviews)), y, test_size=0.2)

Do some very minor text preprocessing

def cleanText(corpus):
punctuation = """.,?!:;(){}[]"""
corpus = [z.lower().replace('\n','') for z in corpus]
corpus = [z.replace('<br />', ' ') for z in corpus]

#treat punctuation as individual words
for c in punctuation:
    corpus = [z.replace(c, ' %s '%c) for z in corpus]
corpus = [z.split() for z in corpus]
return corpus

x_train = cleanText(x_train)
x_test = cleanText(x_test)
unsup_reviews = cleanText(unsup_reviews)

Gensim's Doc2Vec implementation requires each document/paragraph to have a label associated with it.

We do this by using the LabeledSentence method. The format will be "TRAIN_i" or "TEST_i" where "i" is

a dummy index of the review.

def labelizeReviews(reviews, label_type):
labelized = []
for i,v in enumerate(reviews):
label = '%s_%s'%(label_type,i)
labelized.append(LabeledSentence(v, [label]))
return labelized

x_train = labelizeReviews(x_train, 'TRAIN')
x_test = labelizeReviews(x_test, 'TEST')
unsup_reviews = labelizeReviews(unsup_reviews, 'UNSUP')
These create LabeledSentence type objects:
<gensim.models.doc2vec.LabeledSentence at 0xedd70b70>
Next we instantiate our two Doc2Vec models, DM and DBOW. The gensim documentation suggests training over the data multiple times and either adjusting the learning rate or randomizing the order of input at each pass. We then collect the movie review vectors learned by the models.
import random

size = 400

instantiate our DM and DBOW models

model_dm = gensim.models.Doc2Vec(min_count=1, window=10, size=size, sample=1e-3, negative=5, workers=3)
model_dbow = gensim.models.Doc2Vec(min_count=1, window=10, size=size, sample=1e-3, negative=5, dm=0, workers=3)

build vocab over all reviews

model_dm.build_vocab(np.concatenate((x_train, x_test, unsup_reviews)))
model_dbow.build_vocab(np.concatenate((x_train, x_test, unsup_reviews)))

We pass through the data set multiple times, shuffling the training reviews each time to improve accuracy.

all_train_reviews = np.concatenate((x_train, unsup_reviews))
for epoch in range(10):
perm = np.random.permutation(all_train_reviews.shape[0])
model_dm.train(all_train_reviews[perm])
model_dbow.train(all_train_reviews[perm])

Get training set vectors from our models

def getVecs(model, corpus, size):
vecs = [np.array(model[z.labels[0]]).reshape((1, size)) for z in corpus]
return np.concatenate(vecs)

train_vecs_dm = getVecs(model_dm, x_train, size)
train_vecs_dbow = getVecs(model_dbow, x_train, size)

train_vecs = np.hstack((train_vecs_dm, train_vecs_dbow))

train over test set

x_test = np.array(x_test)

for epoch in range(10):
perm = np.random.permutation(x_test.shape[0])
model_dm.train(x_test[perm])
model_dbow.train(x_test[perm])

Construct vectors for test reviews

test_vecs_dm = getVecs(model_dm, x_test, size)
test_vecs_dbow = getVecs(model_dbow, x_test, size)

test_vecs = np.hstack((test_vecs_dm, test_vecs_dbow))
Now we are ready to train a classifier over our review vectors. We will again use sklearn's SGDClassifier.
from sklearn.linear_model import SGDClassifier

lr = SGDClassifier(loss='log', penalty='l1')
lr.fit(train_vecs, y_train)

print 'Test Accuracy: %.2f'%lr.score(test_vecs, y_test)
This model gives us a test accuracy of 0.86. We can also build a ROC curve for this classifier as follows:

Create ROC curve

from sklearn.metrics import roc_curve, auc
%matplotlib inline
import matplotlib.pyplot as plt

pred_probas = lr.predict_proba(test_vecs)[:,1]

fpr,tpr,_ = roc_curve(y_test, pred_probas)
roc_auc = auc(fpr,tpr)
plt.plot(fpr,tpr,label='area = %.2f' %roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc='lower right')

plt.show()

Figure 5: ROC Curve for a logistic classifier on our training data of IMDB movie reviews.
The original paper claimed they saw an improvement when using a 50 node neural network over a simple logistic regression classifier:
from NNet import NeuralNet

nnet = NeuralNet(50, learn_rate=1e-2)
maxiter = 500
batch = 150
_ = nnet.fit(train_vecs, y_train, fine_tune=False, maxiter=maxiter, SGD=True, batch=batch, rho=0.9)

print 'Test Accuracy: %.2f'%nnet.score(test_vecs, y_test)
Interestingly, here we see no such improvement. The test accuracy is 0.85, and we do not approach their claimed test error of 7.42%. This could be for many reasons: we did not train for enough epochs over the training/test data, their implementation of Doc2Vec/ANN is different, their hyperparameters are different, etc. It's hard to know exactly which since the paper does not go into great detail. In any case, we were able to obtain an 86% test accuracy with very little pre-processing and no feature creation/selection. No fancy convolutions or treebanks necessary!
Conclusion
I hope you have seen not only the utility but ease of use for Word2Vec and Doc2Vec using standard tools like Python and gensim. With a very simple algorithm we can gain rich word and paragraph vectors that can be used in all kinds of NLP applications. What's even better is Google's release of their own pre-trained word vectors trained on a much larger data set than anyone else can hope to obtain. If you want to train your own vectors over large data sets there is already an implementation of Word2Vec in Apache Spark's MLlib. Happy NLP'ing!

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 210,914评论 6 490
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 89,935评论 2 383
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 156,531评论 0 345
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,309评论 1 282
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,381评论 5 384
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,730评论 1 289
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,882评论 3 404
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,643评论 0 266
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,095评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,448评论 2 325
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,566评论 1 339
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,253评论 4 328
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,829评论 3 312
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,715评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,945评论 1 264
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,248评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,440评论 2 348

推荐阅读更多精彩内容

  • 白九九墨一阅读 170评论 0 1
  • 未及天亮,便下起了淅淅沥沥的小雨,气温猛然下降了许多,不像前几日那般闷热,然而这种天气也实在不能让人心情愉快。 昨...
    王韬博阅读 188评论 0 0