【记录】尝试使用pyPdf将不可复制的PDF转换为文本或HTML

【背景】

折腾:

【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据

期间,去试试使用pyPdf去把一个不可复制的PDF文件,转换为文本或HTML。

 

【折腾过程】

1.参考:

Convert PDF to text with pyPDF and PDFMiner: First Impression | victorwyee

去找到:

pyPdf

并下载:

pyPdf-1.13.win32.exe

2.但是安装时找不到Python:

 

看来是:

我此处安装的x64的python,此处无法识别啊。。。

3.重新下载:

pyPdf-1.13.zip

然后去解压安装:

D:\tmp\dev_tools\python\pdf\pyPdf-1.13\pyPdf-1.13>python setup.py install
running install
running build
running build_py
creating build
creating build\lib
creating build\lib\pyPdf
copying pyPdf\filters.py -> build\lib\pyPdf
copying pyPdf\generic.py -> build\lib\pyPdf
copying pyPdf\pdf.py -> build\lib\pyPdf
copying pyPdf\utils.py -> build\lib\pyPdf
copying pyPdf\xmp.py -> build\lib\pyPdf
copying pyPdf\__init__.py -> build\lib\pyPdf
running install_lib
creating D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\filters.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\generic.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\pdf.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\utils.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\xmp.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\__init__.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\filters.py to filters.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\generic.py to generic.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\pdf.py to pdf.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\utils.py to utils.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\xmp.py to xmp.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\__init__.py to __init__.pyc
running install_egg_info
Writing D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf-1.13-py2.7.egg-info

然后去试试。

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据
http://www.crifan.com/non_copy_pdf_table_data_export_to_xml

Author:     Crifan Li
Version:    2014-01-26
Contact:    http://www.crifan.com/about/me
"""

import os
import glob
from pyPdf import PdfFileReader

def pdf_table_to_xml():
    """Operate PDF file, extract table data, save to xml"""
    parent = "D:/tmp/tmp_dev_root/python/answer_question/self/pdf_table_to_xml/pdf"
    os.chdir(parent)
    pdfFilename = "spec183r21.0.pdf";
    filename = os.path.abspath(pdfFilename)

    input = PdfFileReader(file(filename, "rb"))
    for page in input.pages:
        print page.extractText()

if __name__ == "__main__":
    pdf_table_to_xml();

结果运行出错,说是没解密:

D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml>pdf_table_to_xml.py
Traceback (most recent call last):
  File "D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf_table_to_xml.py", line 29, in <module>
    pdf_table_to_xml();
  File "D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf_table_to_xml.py", line 25, in pdf_table_to_xml
    for page in input.pages:
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\utils.py", line 78, in __getitem__
    len_self = len(self)
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\utils.py", line 73, in __len__
    return self.lengthFunction()
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 431, in getNumPages
    self._flatten()
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\generic.py", line 165, in getObject
    return self.pdf.getObject(self).getObject()
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 655, in getObject
    raise Exception, "file has not been decrypted"
Exception: file has not been decrypted

4.然后再去解决上述问题:

没找到解决办法。

其中:

How can I read a pdf web page? | DaniWeb

说是,其代码对于其他pdf正常,所以无视此bug。。。

 

【总结】

目前也是无法通过pyPdf将上述不可拷贝的pdf转换为想要的文本或html。



3 Thoughts on “【记录】尝试使用pyPdf将不可复制的PDF转换为文本或HTML

  1. 其实用PyPDF2就可以轻松解密这种受保护pdf。
    不知道PyPDF可不可以。

  2. 请问,如果我要处理大量在线pdf,不想一一下载下来再处理,有什么好方法么?

  3. 文件被加密了。。。。还是密码加密,不允许提取文档内容 — ,这可咋整,解密有那么容易么

发表评论

电子邮件地址不会被公开。 必填项已用*标注

无觅相关文章插件,快速提升流量