【记录】尝试从Docbook转换出格式良好的word文档

【背景】

之前早就搞定了Docbook转rtf

但是转换出来的结果,不完美,存在很多问题:

【记录】将docbook的xml源码,通过xsltproc和FOP生成(可用word打开的)RTF(Word兼容)格式

而现在希望:

要么是解决上述Docbook转rtf中的问题。

要么是想起他办法,总之想要从Docbook的xml,获得格式完美的word文件。

【折腾过程】

1.去搜了

docbook转word(rtf)

找到很多相关内容:

(1)Microsoft Word

(2)Apache(tm) FOP Development: RTFLib (jfor)

(3)DocBook to Word Conversion? – Stack Overflow

(4)docbook to word

官网这里都解释过了三种方法:

Microsoft Word

(5)DocBook to WordML

从该贴中才看懂,原来:

“round-tripping ”

的意思是:

双向支持:

    docbook to word

    word to docbook

即在Word/Open Office和Docbook之间支持相互转换

(6)DocBook Roundtripping

2.后来无意间发现有个:

http://johnmacfarlane.net/pandoc/README.html

去试试,通过pandoc将自己Docbook输出的html,转换为docx或rtf

现在去下载:

Releases · jgm/pandoc · GitHub

下载到:

pandoc-1.12.4.2-1-windows.msi

安装时,竟然不能选择安装路径。。。

然后去试用。结果你妹的好搓,竟然找不到新安装的文件到哪里去了。。。

后来还是通过Win7的搜索中,找到pandoc的html文档,然后找到路径的:

C:\Users\Administrator.xxxxx\AppData\Local\Pandoc\

win7 found pandoc html location

3.然后拷贝一个HTML(加上对应的Images文件夹)到pandoc的目录下,然后去试试效果:

参考:

Pandoc – Demos

去执行:

然后先看看help:

C:\Users\Administrator.PC-20131018OHXV\AppData\Local\Pandoc>pandoc.exe –help

pandoc.exe [OPTIONS] [FILES]

Input formats: docbook, haddock, html, json, latex, markdown, markdown_github,

                markdown_mmd, markdown_phpextra, markdown_strict, mediawiki,

                native, opml, org, rst, textile

Output formats: asciidoc, beamer, context, docbook, docx, dzslides, epub, epub3,

                fb2, html, html5, icml, json, latex, man, markdown,

                markdown_github, markdown_mmd, markdown_phpextra,

                markdown_strict, mediawiki, native, odt, opendocument, opml,

                org, pdf*, plain, revealjs, rst, rtf, s5, slideous, slidy,

                texinfo, textile

                [*for pdf output, use latex or beamer and -o FILENAME.pdf]

Options:

  -f FORMAT, -r FORMAT –from=FORMAT, –read=FORMAT

  -t FORMAT, -w FORMAT –to=FORMAT, –write=FORMAT

  -o FILENAME –output=FILENAME

                        –data-dir=DIRECTORY

                        –strict

  -R –parse-raw

  -S –smart

                        –old-dashes

                        –base-header-level=NUMBER

                        –indented-code-classes=STRING

  -F PROGRAM –filter=PROGRAM

                        –normalize

  -p –preserve-tabs

                        –tab-stop=NUMBER

  -s –standalone

                        –template=FILENAME

  -M KEY[:VALUE] –metadata=KEY[:VALUE]

  -V KEY[:VALUE] –variable=KEY[:VALUE]

  -D FORMAT –print-default-template=FORMAT

                        –print-default-data-file=FILE

                        –no-wrap

                        –columns=NUMBER

                        –toc, –table-of-contents

                        –toc-depth=NUMBER

                        –no-highlight

                        –highlight-style=STYLE

  -H FILENAME –include-in-header=FILENAME

  -B FILENAME –include-before-body=FILENAME

  -A FILENAME –include-after-body=FILENAME

                        –self-contained

                        –offline

  -5 –html5

                        –html-q-tags

                        –ascii

                        –reference-links

                        –atx-headers

                        –chapters

  -N –number-sections

                        –number-offset=NUMBERS

                        –no-tex-ligatures

                        –listings

  -i –incremental

                        –slide-level=NUMBER

                        –section-divs

                        –default-image-extension=extension

                        –email-obfuscation=none|javascript|references

                        –id-prefix=STRING

  -T STRING –title-prefix=STRING

  -c URL –css=URL

                        –reference-odt=FILENAME

                        –reference-docx=FILENAME

                        –epub-stylesheet=FILENAME

                        –epub-cover-image=FILENAME

                        –epub-metadata=FILENAME

                        –epub-embed-font=FILE

                        –epub-chapter-level=NUMBER

                        –latex-engine=PROGRAM

                        –bibliography=FILE

                        –csl=FILE

                        –citation-abbreviations=FILE

                        –natbib

                        –biblatex

  -m[URL] –latexmathml[=URL], –asciimathml[=URL]

                        –mathml[=URL]

                        –mimetex[=URL]

                        –webtex[=URL]

                        –jsmath[=URL]

                        –mathjax[=URL]

                        –gladtex

                        –trace

                        –dump-args

                        –ignore-args

  -v –version

  -h –help

C:\Users\Administrator.PC-20131018OHXV\AppData\Local\Pandoc>

再去试用:

C:\Users\Administrator.PC-20131018OHXV\AppData\Local\Pandoc>pandoc.exe -s -S test\src_html\arm_vs_mips.html -o test\output\arm_vs_mips.docx

pandoc.exe: Could not find image `file:///E:/dev_root/docbook/dev/config/images/system/tip.png’, skipping…

pandoc.exe: Could not find image `images/cp0_reg16_select1.png’, skipping…

C:\Users\Administrator.PC-20131018OHXV\AppData\Local\Pandoc>

然后看看有没有输出。

转换后的效果如下:

use pandoc converted word effect

use pandoc converted word effect 2

use pandoc converted word effect 3

use pandoc converted word effect 4

 

【总结】

目前尝试用pandoc转换出word的总体感觉:

(1)图片丢失

(2)不算图片丢失的问题,word文档的总体的效果,要比fop转换出来的rtf效果要好。

但是:

总体上,离理想的目标,还是差距很大。

所以暂时放弃此种方案。

抽空继续去找其他可能的,更好的方案。



发表评论

电子邮件地址不会被公开。 必填项已用*标注

无觅相关文章插件,快速提升流量