- eval解析适用于Python的数据类型
- 有些数据你咋眼望去以为他是json,其实不是,用json解析会报错,但是可以用eval方法解析
s = '{"a":None,"b":[1,2,3],2:"jk"}'
eval(s)
- 正则提取文本中所有IP
import re
s = '<a>67.17.12.56sjao22&k89.121.45.200.1s<div>111.0.89.12</div>'
re.findall('\D(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\D',s)
- 正则提取文本中所有URL
- 下面这个html是淘宝主页中的一段代码
html = '''
<li data-closeper="" aria-label="查看更多" role="menuitem" aria-haspopup="true" data-groupid="107" class="J_Cat a-all">
<a href="https://www.taobao.com/markets/coolcity/coolcityHome" data-cid="1" data-dataid="222880">运动</a> /
<a href="https://www.taobao.com/markets/coolcity/coolcityHome" data-cid="1" data-dataid="222913">户外</a> /
<a href="https://www.taobao.com/markets/amusement/home" data-cid="1" data-dataid="222910">乐器</a>
</span>
<i aria-hidden="true" class="tb-ifont service-arrow"></i>
</li>
<li data-closeper="" aria-label="查看更多" role="menuitem" aria-haspopup="true" data-groupid="108" class="J_Cat a-all">
<a href="https://s.taobao.com/search?q=%E6%B8%B8%E6%88%8F&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306" data-cid="1" data-dataid="222882">游戏</a> /
<a href="https://s.taobao.com/search?q=%E5%8A%A8%E6%BC%AB&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20181010&ie=utf8" data-cid="1" data-dataid="222883">动漫</a> /
<a href="https://www.taobao.com/markets/acg/yingshi" data-cid="1" data-dataid="222921">影视</a>
</span>
<i aria-hidden="true" class="tb-ifont service-arrow"></i>
</li>
<li data-closeper="" aria-label="查看更多" role="menuitem" aria-haspopup="true" data-groupid="109" class="J_Cat a-all">
<a href="https://s.taobao.com/search?q=%E7%BE%8E%E9%A3%9F&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180724&ie=utf8" data-cid="1" data-dataid="222899">美食</a> /
<a href="https://s.taobao.com/search?ie=utf8&initiative_id=staobaoz_20180724&stats_click=search_radio_all%3A1&js=1&imgfile=&q=%E7%94%9F%E9%B2%9C&suggest=history_1&_input_charset=utf-8&wq=%E7%94%9F%E9%B2%9C&suggest_query=%E7%94%9F%E9%B2%9C&source=suggest" data-cid="1" data-dataid="222905">生鲜</a> /
<a href="https://s.taobao.com/search?q=%E9%9B%B6%E9%A3%9F&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306" data-cid="1" data-dataid="222881">零食</a>
</span>
<i aria-hidden="true" class="tb-ifont service-arrow"></i>
</li>
<li data-closeper="" aria-label="查看更多" role="menuitem" aria-haspopup="true" data-groupid="110" class="J_Cat a-all">
<a href="https://s.taobao.com/search?q=%E5%9B%AD%E8%89%BA&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170419" data-cid="1" data-dataid="222911">鲜花</a> /
<a href="https://s.taobao.com/search?ie=utf8&initiative_id=staobaoz_20170419&stats_click=search_radio_all%3A1&js=1&imgfile=&q=%E8%BF%9B%E5%8F%A3%E7%8B%97%E7%B2%AE&suggest=history_3&_input_charset=utf-8&wq=&suggest_query=&source=suggest" data-cid="1" data-dataid="222894">宠物</a> /
<a href="https://s.taobao.com/search?q=%E5%86%9C%E8%B5%84&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170221" data-cid="1" data-dataid="222920">农资</a>
</span>
<i aria-hidden="true" class="tb-ifont service-arrow"></i>
</li>
<li data-closeper="" aria-label="查看更多" role="menuitem" aria-haspopup="true" data-groupid="111" class="J_Cat a-all">
<a href="https://wujin.taobao.com/?spm=a21bo.2017.201867-links-10.1.5af911d97XswKm" data-cid="1" data-dataid="222914">工具</a> /
<a href="https://s.taobao.com/list?spm=a21bo.50862.201867-links-10.27.iQWRJS&source=youjia&cat=50097129" data-cid="1" data-dataid="222877">装修</a> /
<a href="https://www.jiyoujia.com/markets/youjia/zhuangxiucailiao" data-cid="1" data-dataid="222919">建材</a>
</span>
<i aria-hidden="true" class="tb-ifont service-arrow"></i>
</li>
<li data-closeper="" aria-label="查看更多" role="menuitem" aria-haspopup="true" data-groupid="112" class="J_Cat a-all">
<a href="https://s.taobao.com/list?spm=a21bo.7932212.202572.1.rtUtMQ&source=youjia&q=%E5%AE%B6%E5%85%B7" data-cid="1" data-dataid="222915">家具</a> /
<a href="https://s.taobao.com/list?source=youjia&cat=50065206%2C50065205" data-cid="1" data-dataid="222922">家饰</a> /
<a href="https://s.taobao.com/list?spm=a21bo.50862.201867-links-11.80.K6jN68&source=youjia&cat=50008163&bcoffset=0&s=240" data-cid="1" data-dataid="222884">家纺</a>
</span>'''
import re
re.findall('https?://[a-zA-Z0-9_\./\?=&%\-]+',html)
-
以下结果还是比较满意的,因为不可能完全适配所有URL,可以自行调整参数调试,如果你会写算法训练当然也行。
- 正则提取文本中所有中文字符
s = 'nihao27919&阿尔法狗**【】‘’ssuajk^&*!@@#{}||请你说中文'
re.findall("[\u4e00-\u9fa5]",s)
- 正则匹配文本中非中文字符串
s = 'nihao27919&阿尔法狗**【】‘’ssuajk^&*!@@#{}||请你说中文'
re.findall("[^\u4e00-\u9fa5]",s)
- 多出“\”的unicode字符转换
- 如果遇到
\\u4f60\\u597d\\u5417
这种字符,Python是无法直接把\\
替换为\
的。当然你也可以自己试试。
- 但是用强大的json库就能轻松解析。
import json
s = '\\u4f60\\u597d\\u5417'
json.loads('"{}"'.format(s))