1、需要下载pdfminer库
git clone https://github.com/pdfminer/pdfminer.six.git
2、解析
file = 'test.pdf'
from pdfminer.high_level import extract_pages
for page_layout in extract_pages(infile):
for element in page_layout:
print(type(element), element.x0)
3、根据字体和字体大小来抽取特定内容
# 抽取标题实例
ele_set = set()
from pdfminer.high_level import extract_pages
for page_layout in extract_pages(infile):
for element in page_layout:
if isinstance(element, LTTextBoxHorizontal):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
# print(character.get_text())
# print(character.fontname)
# print(character.size)
if character.fontname == 'ATDGLO+CheltenhamITCbyBT-Bold' and round(character.size, 1)==10.0:
# print(element.get_text())
# if element.x0 == 40.8189:
# print(element.get_text(), element.x0)
ele_set.add(element.get_text().strip())
# print(element.get_text())
4、结果
print(ele_set)
{'Acknowledgment',
'Air temperature mapping techniques',
'Introduction',
'References\n[1]',
'Regression techniques',
'Simulation techniques',
'Summary and conclusions'}
总之,这个库还存在很多问题,至少对于学术论文,由于格式排版不同,能解析的PDF不多。
如果是格式比较固定的PDF,采用这个包会产生非常好的效果。