上一篇文章(//www.greatytc.com/p/c35ed87ebb30)说到,利用pymupdf和pillow模块将A3尺寸的PDF转换为两张A4大小的页面,基本满足了使用要求。但是,效果仍然不够理想。因为pdf转化成图片的过程中必然存在数据的压缩,这将导致文件的清晰度降低。放大4倍后的对比如下,转换后的图片放大后明显模糊:
最理想的方法是像“A-PDF Page Cut”软件那样直接在原文件上裁剪。我一直认为:“对于大多数人来说,只要是你能想到的,那么世界上大概率早有实现了。”本着精益求精的态度,继续挖掘python的潜力,皇天不负有心人,终于又找到了一个PyPDF2模块。与pymupdf模块相比,PyPDF2的功能很简单,本身只实现了四个类,但是却能够直接对pdf文件进行裁剪,可谓小而精。需要注意的一点是,这个模块的坐标系和Javascript不一样,它是以左下角作为原点,而JS的原点在左上角。
具体实现代码如下:
# -*- coding: UTF-8 -*-
from PyPDF2 import PdfFileReader, PdfFileWriter
import os
from pathlib import Path
def split_pdf(infile, out_path):
if not os.path.exists(out_path):
os.makedirs(out_path)
with open(infile, 'rb') as infile:
pdfReader = PdfFileReader(infile)
number_of_pages = pdfReader.getNumPages()
for i in range(number_of_pages):
page = pdfReader.getPage(i)
width = float(page.mediaBox.getWidth())
height = float(page.mediaBox.getHeight())
pdfReader=PdfFileReader(infile)
pdfWriter = PdfFileWriter()
page_top = pdfReader.getPage(i)
page_top.mediaBox.lowerLeft = (0,0)
page_top.mediaBox.lowerRight = (width/2,0)
page_top.mediaBox.upperLeft = (0,height)
page_top.mediaBox.upperRight = (width/2,height)
pdfWriter.addPage(page_top)
out_file_name = out_path + Path(str(infile)).stem+str(i+1)+'_left.pdf'
with open(out_file_name, 'wb') as outfile:
pdfWriter.write(outfile)
#bottom page
pdfReader=PdfFileReader(infile)
pdfWriter = PdfFileWriter()
page_buttom = pdfReader.getPage(i)
page_buttom.mediaBox.lowerLeft = (width/2,0)
page_buttom.mediaBox.lowerRight = (width,0)
page_buttom.mediaBox.upperLeft = (width/2,height)
page_buttom.mediaBox.upperRight = (width,height)
pdfWriter.addPage(page_buttom)
out_file_name = out_path + Path(str(infile)).stem + str(i+1)+'_right.pdf'
with open(out_file_name, 'wb') as outfile:
pdfWriter.write(outfile)
if __name__ == '__main__':
p=Path.cwd()
filelist = list(p.glob('*.pdf'))
for i in filelist:
in_File = Path(str(i))
out_Path = './Single/'
split_pdf(in_File, out_Path)
新方法放大4倍后的效果如图:
新方法的文字清晰度与原图完全一致。至此,可以说完美解决了问题。
其实上一版的解决方案也不是毫无用处,如果碰到不想让别人复制文件里的文字,希望至少给提取增加点难度的情况时,可以将pdf页面转成图片,然后再打印成pdf。那样对方要提取内容就只能靠OCR了,文件的保密性还是能够提高一点点的。
下一步计划仔细研究一下这个模块的源代码,看看能不能移植到C++语言里编译成exe程序,python源码打包后巨大的体积实在不利于传播。
补充
- 闲下来的时候又仔细看了一下pymupdf模块的文档,细看之下,不禁虎躯一震,看文档不仔细,颇觉惭愧。文件无损转换方案赫然在目!源码如下:
"""
Create a PDF copy with split-up pages (posterize)
---------------------------------------------------
License: GNU AFFERO GPL V3
(c) 2018 Jorj X. McKie
Usage
------
python posterize.py input.pdf
Result
-------
A file "poster-input.pdf" with 4 output pages for every input page.
Notes
-----
(1) Output file is chosen to have page dimensions of 1/4 of input.
(2) Easily adapt the example to make n pages per input, or decide per each
input page or whatever.
Dependencies
------------
PyMuPDF 1.12.2 or later
"""
import fitz, sys
infile = sys.argv[1] # input file name
src = fitz.open(infile)
doc = fitz.open() # empty output PDF
for spage in src: # for each page in input
r = spage.rect # input page rectangle
d = fitz.Rect(spage.cropbox_position, # CropBox displacement if not
spage.cropbox_position) # starting at (0, 0)
#--------------------------------------------------------------------------
# example: cut input page into 2 x 2 parts
#--------------------------------------------------------------------------
r1 = r / 2 # top left rect
r2 = r1 + (r1.width, 0, r1.width, 0) # top right rect
r3 = r1 + (0, r1.height, 0, r1.height) # bottom left rect
r4 = fitz.Rect(r1.br, r.br) # bottom right rect
rect_list = [r1, r2, r3, r4] # put them in a list
for rx in rect_list: # run thru rect list
rx += d # add the CropBox displacement
page = doc.new_page(-1, # new output page with rx dimensions
width = rx.width,
height = rx.height)
page.show_pdf_page(
page.rect, # fill all new page with the image
src, # input document
spage.number, # input page number
clip = rx, # which part to use of input page
)
# that's it, save output file
doc.save("poster-" + src.name,
garbage=3, # eliminate duplicate objects
deflate=True, # compress stuff where possible
)
2. 还有一个比较简单的方案
这个方法比较适合没有编程经验的小白,下载安装mupdf软件(免费的),在命令行输入如下命令:
mutool poster [-x |-y] [number] input.pdf output.pdf
可以快速分割当前文件,其中各参数含义是:
- -x:沿水平方向平均分割
- -y: 沿垂直方向平均分割
- number:分割块数
- input.pdf:原文件名
- output.pdf:分割后的文件名