from snownlp import SnowNLP
import pandas as pd
import numpy as np
traindata=pd.read_csv('/Users/xuyizhou/Desktop/trainData.csv')
报错:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 8: inva
subline查看文件乱码,修改后不是乱码报错:
ParserError: Error tokenizing data. C error: Expected 5 fields in line 17077, saw 7
\r的错
try another way
df=pd.read_xlsx('/Users/xuyizhou/Desktop/trainData.xlsx')
wrong
df=pd.read_excel('/Users/xuyizhou/Desktop/trainData.xlsx')
fundamental operation
df.head()
df.head(1)
df.dtypes
df.index
df.describe
df.iloc[3:5,1:4]
NLP
object->string
eg.
import json
data = [ { 'a':'A', 'b':(2, 4), 'c':3.0 } ]
data_string = json.dumps(data)
print 'ENCODED:', data_string
decoded = json.loads(data_string)
print 'DECODED:', decoded
print 'ORIGINAL:', type(data[0]['b'])
print 'DECODED :', type(decoded[0]['b'])
take the content[1] for example
s.words
Out[68]:
['热水器',
'加',
'热',
'时间',
'太',
'长',
',',
'安装',
'费',
'太',
'贵',
',',
'预留',
'太阳能',
'口',
'摆设',
',',
'根本',
'用',
'不',
'到',
',',
'没有',
'水位',
'指示器',
',',
'加',
'满',
'热水',
'的',
'指示',
'灯',
'放在',
'了',
'最',
'侧面',
',',
'不',
'方便',
'用户',
'看',
'指示',
'灯',
',',
'必须',
'斜',
'着',
'看',
'才',
'能',
'看到',
',']
the train data use the
theme-主题 加热时间;安装费;用户;
sentiment_word-情感关键词 太长;太贵;不方便;
use a cycle
successfully split the words
..to be continue 1102