比如下面一段html想提取出图片路径与影片名称,主演,上映时间通过一行正则表达式怎么提取.
<dd>
<i class="board-index board-index-1">1</i>
<a href="//www.greatytc.com/films/1203" title="霸王别姬" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c" alt="霸王别姬" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="//www.greatytc.com/films/1203" title="霸王别姬" data-act="boarditem-click" data-val="{movieId:1203}">霸王别姬</a></p>
<p class="star">
主演:张国荣,张丰毅,巩俐
</p>
<p class="releasetime">上映时间:1993-01-01</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-2">2</i>
<a href="/films/1297" title="肖申克的救赎" class="image-link" data-act="boarditem-click" data-val="{movieId:1297}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg@160w_220h_1e_1c" alt="肖申克的救赎" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/1297" title="肖申克的救赎" data-act="boarditem-click" data-val="{movieId:1297}">肖申克的救赎</a></p>
<p class="star">
主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿
</p>
<p class="releasetime">上映时间:1994-09-10(加拿大)</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>
</div>
</div>
</div>
</dd>
提取图片路径,
<dd>.*? src="(.*?)".*?</dd>
re模块默认的是贪婪模式,适当的加上?可以防止过多的匹配.
提取图片路径与影片名称
<dd>.*? src="(.*?)".*?boarditem-click.*?>(.*?)</a>.*?</dd>
我匹配影片名称是通过<p class="name"><a href="//www.greatytc.com/films/1203" title="霸王别姬" data-act="boarditem-click" data-val="{movieId:1203}">霸王别姬</a></p>
这一行来匹配的.
因为p标签很多,为了能定位到这一行,加上了boarditem-click.*?>
,这样就能找到离boarditem-click
最近的>
然后就能匹配到名称了.
主演与上映时间同理可得
<dd>.*?<img src="(.*?)".*?"boarditem-click".*?>(.*?)</a>.*?"star".*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?</dd>
最后程序
import re
test_str = '''
<dd>
<i class="board-index board-index-1">1</i>
<a href="//www.greatytc.com/films/1203" title="霸王别姬" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c" alt="霸王别姬" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="//www.greatytc.com/films/1203" title="霸王别姬" data-act="boarditem-click" data-val="{movieId:1203}">霸王别姬</a></p>
<p class="star">
主演:张国荣,张丰毅,巩俐
</p>
<p class="releasetime">上映时间:1993-01-01</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>
</div>
</div>
</div>
</dd>
<dd>
<i class="board-index board-index-2">2</i>
<a href="/films/1297" title="肖申克的救赎" class="image-link" data-act="boarditem-click" data-val="{movieId:1297}">
<img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg@160w_220h_1e_1c" alt="肖申克的救赎" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/1297" title="肖申克的救赎" data-act="boarditem-click" data-val="{movieId:1297}">肖申克的救赎</a></p>
<p class="star">
主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿
</p>
<p class="releasetime">上映时间:1994-09-10(加拿大)</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">5</i></p>
</div>
</div>
</div>
</dd>
'''
result = re.findall('<dd>.*?<img src="(.*?)".*?"boarditem-click".*?>(.*?)</a>.*?"star".*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?</dd>',test_str,re.S)
for item in result:
print('图片路径:{}'.format(item[0]))
print('影片名称:{}'.format(item[1]))
print('主演:{}'.format(re.sub('\s|主演:','',item[2])))
print('上映时间:{}'.format(item[3].replace('上映时间:','')))
print('\n')
结果
图片路径://s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png
影片名称:霸王别姬
主演:张国荣,张丰毅,巩俐
上映时间:1993-01-01
图片路径://s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png
影片名称:肖申克的救赎
主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿
上映时间:1994-09-10(加拿大)