情感分析初战

from snownlp import SnowNLP
import pandas as pd
import numpy as np

traindata=pd.read_csv('/Users/xuyizhou/Desktop/trainData.csv')

报错：
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 8: inva
subline查看文件乱码，修改后不是乱码
报错：
ParserError: Error tokenizing data. C error: Expected 5 fields in line 17077, saw 7
\r的错

try another way

df=pd.read_xlsx('/Users/xuyizhou/Desktop/trainData.xlsx')

wrong

df=pd.read_excel('/Users/xuyizhou/Desktop/trainData.xlsx')

fundamental operation

df.head()
df.head(1)
df.dtypes
df.index
df.describe
df.iloc[3:5,1:4]

NLP
object->string

eg.

import json
data = [ { 'a':'A', 'b':(2, 4), 'c':3.0 } ]
data_string = json.dumps(data)
print 'ENCODED:', data_string
decoded = json.loads(data_string)
print 'DECODED:', decoded
print 'ORIGINAL:', type(data[0]['b'])
print 'DECODED :', type(decoded[0]['b'])

take the content[1] for example

s.words
Out[68]: 
['热水器',
 '加',
 '热',
 '时间',
 '太',
 '长',
 '，',
 '安装',
 '费',
 '太',
 '贵',
 '，',
 '预留',
 '太阳能',
 '口',
 '摆设',
 '，',
 '根本',
 '用',
 '不',
 '到',
 '，',
 '没有',
 '水位',
 '指示器',
 '，',
 '加',
 '满',
 '热水',
 '的',
 '指示',
 '灯',
 '放在',
 '了',
 '最',
 '侧面',
 '，',
 '不',
 '方便',
 '用户',
 '看',
 '指示',
 '灯',
 '，',
 '必须',
 '斜',
 '着',
 '看',
 '才',
 '能',
 '看到',
 '，']

the train data use the

theme-主题                加热时间;安装费;用户;
sentiment_word-情感关键词      太长;太贵;不方便;

use a cycle

successfully split the words

..to be continue 1102

最后编辑于：2017.12.11 07:32:28

情感分析初战

推荐阅读更多精彩内容