【记录】尝试使用pdftohtml将不可拷贝的PDF文件转换为HTML并保留表格的格式

【背景】

折腾：

【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据

期间，去试试用pdftohtml，将一个不可拷贝的pdf文件，转换为文本或html。

【折腾过程】

1.继续参考：

Howto Convert PDF files to HTML files | Ubuntu Geek

去想办法找到pdftohtml，然后是可以安装并使用pdftohtml，加上-nodrm参数，转换出来html了：

log如下：

crifan@crifan-Ubuntu:~$ sudo apt-get install poppler-utils
[sudo] password for crifan: 
正在读取软件包列表... 完成
正在分析软件包的依赖关系树       
正在读取状态信息... 完成       
poppler-utils 已经是最新的版本了。
升级了 0 个软件包，新安装了 0 个软件包，要卸载 0 个软件包，有 26 个软件包未被升级。
crifan@crifan-Ubuntu:~$ pdf
pdf2dsc      pdffonts     pdfseparate  pdftoppm     pdfunite     
pdf2ps       pdfimages    pdftocairo   pdftops      
pdfdetach    pdfinfo      pdftohtml    pdftotext    
crifan@crifan-Ubuntu:~$ pdftohtml /media/sf_win7_to_ubuntu/
19#21#_(101~303).dwg             spec183r21.0.pdf
examples.desktop                 test_share
python_beginner_tutorial.html    unbuntu 13.04 in virtualbox.png
crifan@crifan-Ubuntu:~$ pdftohtml /media/sf_win7_to_ubuntu/spec183r21.0.pdf /home/crifan/develop/
crosstool-ng/ ubuntu_share/ 
crifan@crifan-Ubuntu:~$ pdftohtml /media/sf_win7_to_ubuntu/spec183r21.0.pdf /home/crifan/develop/^Ccrifan@crifan-Ubuntu:~$ pwd
/home/crifan
crifan@crifan-Ubuntu:~$ cd develop/
crifan@crifan-Ubuntu:~/develop$ mkdir pdf_to_html
crifan@crifan-Ubuntu:~/develop$ cd pdf_to_html/
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ pdftohtml^Cmedia/sf_win7_to_ubuntu/spec183r21.0.pdf hart183.html
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ pdftohtml --help
I/O Error: Couldn't open file '--help': --help.
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ pdftohtml -h
pdftohtml version 0.20.5
Copyright 2005-2012 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2011 Glyph & Cog, LLC

Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  -q                : don't print any messages or errors
  -h                : print usage information
  -help             : print usage information
  -p                : exchange .pdf links by .html
  -c                : generate complex document
  -s                : generate single document that includes all pages
  -i                : ignore images
  -noframes         : generate no frames
  -stdout           : use standard output
  -zoom <fp>        : zoom the pdf document (default 1.5)
  -xml              : output for XML post-processing
  -hidden           : output hidden text
  -nomerge          : do not merge paragraphs
  -enc <string>     : output text encoding name
  -dev <string>     : output device name for Ghostscript (png16m, jpeg etc)
  -fmt <string>     : image file format for Splash output (png or jpg)
  -v                : print copyright and version info
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -nodrm            : override document DRM settings
  -wbt <fp>         : word break threshold (default 10 percent)
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ pdftohtml -nodrm /media/sf_win7_to_ubuntu/spec183r21.0.pdf hart183.htmlDocument has copy-protection bit set.
Page-1
Page-2
Page-3
Page-4
Page-5
Page-6
Page-7
Page-8
Page-9
Page-10
Page-11
Page-12
Page-13
Page-14
Page-15
Page-16
Page-17
Page-18
Page-19
Page-20
Page-21
Page-22
Page-23
Page-24
Page-25
Page-26
Page-27
Page-28
Page-29
Page-30
Page-31
Page-32
Page-33
Page-34
Page-35
Page-36
Page-37
Page-38
Page-39
Page-40
 link to page 41 Page-41
Page-42
Page-43
Page-44
Page-45
Page-46
Page-47
Page-48
Page-49
Page-50
Page-51
Page-52
Page-53
Page-54
Page-55
Page-56
Page-57
Page-58
Page-59
Page-60
Page-61
Page-62
Page-63
Page-64
Page-65
Page-66
Page-67
Page-68
Page-69
Page-70
Page-71
Page-72
Page-73
Page-74
Page-75
Page-76
Page-77
Page-78
Page-79
Page-80
Page-81
Page-82
Page-83
Page-84
Page-85
Page-86
Page-87
Page-88
Page-89
Page-90
Page-91
Page-92
Page-93
Page-94
Page-95
Page-96
Page-97
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ ls
hart183-1_1.png  hart183-1_2.png  hart183-2_1.png  hart183.html  hart183_ind.html  hart183s.html
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ cp * /media/sf_win7_to_ubuntu/^C
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ mkdir /media/sf_win7_to_ubuntu/pdf_to_html
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ cp * /media/sf_win7_to_ubuntu/pdf_to_html/
crifan@crifan-Ubuntu:~/develop/pdf_to_html$

去看看效果：

结果很郁闷的是：

转换出来的html是丢失了表格：