【记录】尝试用xpdf将不可复制的PDF转换为文本或HTML

【背景】

折腾:

【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据

期间,去试试用xpdf,将一个不可拷贝的pdf文件,转换为文本或html。

 

【折腾过程】

1.参考:

PDFTOHTML conversion program

去:

xpdf 2.02

->

http://www.foolabs.com/xpdf/

->

Xpdf: Download

->

xpdfbin-win-3.03.zip

->

http://gd.tuwien.ac.at/publishing/xpdf/

->

xpdfbin-win-3.03.zip

然后去运行:

D:\tmp\dev_tools\python\pdf\xpdfbin-win-3.03\xpdfbin-win-3.03\bin64

中的:

pdftotext.exe

结果还是被保护,无法拷贝:

D:\tmp\dev_tools\python\pdf\xpdfbin-win-3.03\xpdfbin-win-3.03\bin64>pdftotext.exe
pdftotext version 3.03
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  -layout           : maintain original physical layout
  -fixed <fp>       : assume fixed-pitch (or tabular) text
  -raw              : keep strings in content stream order
  -htmlmeta         : generate a simple HTML file, including the meta information
  -enc <string>     : output text encoding name
  -eol <string>     : output end-of-line convention (unix, dos, or mac)
  -nopgbrk          : don't insert page breaks between pages
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -cfg <string>     : configuration file to use in place of .xpdfrc
  -v                : print copyright and version info
  -h                : print usage information
  -help             : print usage information
  --help            : print usage information
  -?                : print usage information

D:\tmp\dev_tools\python\pdf\xpdfbin-win-3.03\xpdfbin-win-3.03\bin64>pdftotext.exe -htmlmeta D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf\spec183r21.0.pdf hart183.html
Permission Error: Copying of text from this document is not allowed.

2.所以去解决上述问题:

【未解决】用xpdf的pdftotext打算把PDF转换为HTML时出错:Permission Error: Copying of text from this document is not allowed

但是没解决掉。。。

 

【总结】

目前还是没法用xpdf去把pdf转换为想要的html。



发表评论

电子邮件地址不会被公开。 必填项已用*标注

无觅相关文章插件,快速提升流量