最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【未解决】用php的html解析库simplehtmldom解析印象笔记帖子的html源码

PHP crifan 1216浏览 0评论
折腾:
【未解决】php中用html解析库去解析处理印象笔记的html源码
期间,再去换别的库试试
PHP Simple HTML DOM Parser
PHP Simple HTML DOM Parser – Browse Files at SourceForge.net
https://sourceforge.net/projects/simplehtmldom/files/
PHP Simple HTML DOM Parser: Manual
PHP Simple HTML DOM Parser – Browse /simplehtmldom/1.9 at SourceForge.net
https://sourceforge.net/projects/simplehtmldom/files/simplehtmldom/1.9/
simplehtmldom_1_9.zip
下载后里面有example
去试试
期间参考
PHP Simple HTML DOM Parser: Manual
Attribute Filters
[attribute*=value]
Matches elements that have the specified attribute and it contains a certain value.
再去参考:
Extract contents from
echo file_get_html(‘http://www.google.com/’)->plaintext;
用代码:
<?php
include_once('./simple_html_dom.php');

$originEvernoteHtml = '<div><br /></div><div>此处包含要测试的内容,包括code代码:</div><div style="box-sizing: border-box; padding: 8px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 12px; color: rgb(51, 51, 51); border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; background-color: rgb(251, 250, 248); border: 1px solid rgba(0, 0, 0, 0.14902);-en-codeblock:true;"><div><span style="font-size: 12px; font-family: Monaco;">some code include</span></div><div><span style="font-size: 12px; font-family: Monaco;">little &lt;</span></div><div><span style="font-size: 12px; font-family: Monaco;">greater &gt;</span></div><div><span style="font-size: 12px; font-family: Monaco;">at &amp;</span></div><div><span style="font-size: 12px; font-family: Monaco;">和其他字符</span></div></div><div>希望同步后,不要:</div><div>有多余的code</div><div>html字符不要被转义</div><div><br /></div><div>另外再去看看,之前出bug的代码</div><div>好像是中间包含多个空行?的代码</div><div style="box-sizing: border-box; padding: 8px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 12px; color: rgb(51, 51, 51); border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; background-color: rgb(251, 250, 248); border: 1px solid rgba(0, 0, 0, 0.14902);-en-codeblock:true;"><div># Author: Crifan Li</div><div># Function: Batch make for all gitbooks</div><div># Version: 20190716</div><div>#</div><div># [Note]</div><div># 1. this makefile should be located in</div><div># /Users/crifan/dev/dev_root/gitbook/gitbook_src_root/common</div><div><div><br /></div><div><br /></div></div><div><div>SUB_BOOKS=$(shell ls ../books)</div><div><br /></div></div><div><div>BOOKS_SRC_ROOT=$(shell cd ../books &amp;&amp; pwd)</div><div><br /></div></div><div><div><br /></div><div><br /></div></div><div># Batch make for all gitbooks</div><div><div>help debug_dir init sync_content clean_all website pdf epub mobi all upload commit deploy:</div><div><br /></div></div><div>  @echo "Current path="`pwd`;</div><div>  @echo "LS_OUTPUT="$(SUB_BOOKS);</div><div>  @echo "BOOKS_SRC_ROOT="$(BOOKS_SRC_ROOT);</div><div><div>  @for each_item in $(SUB_BOOKS); \</div><div><br /></div></div><div><div>  do \</div><div><br /></div></div><div><div>    if [ -d $(BOOKS_SRC_ROOT)/$$each_item ]; then \</div><div><br /></div></div><div><div>      cd $(BOOKS_SRC_ROOT)/$$each_item; \</div><div><br /></div></div><div><div>      echo `pwd`; \</div><div><br /></div></div><div><div>      if [ -f Makefile ]; then \</div><div><br /></div></div><div><div>        make $@ || exit "$$?"; \</div><div><br /></div></div><div><div>      fi; \</div><div><br /></div></div><div><div>      cd ..; \</div><div><br /></div></div><div><div>    fi; \</div><div><br /></div></div><div>  done;</div></div><div>看看效果</div><div><br /></div>';
// $originEvernoteHtml = "<div>" . $originEvernoteHtml . "</div>";
// $originEvernoteHtml = "<html><head><title>parse evernote html</title></head><body>" . $originEvernoteHtml . "</body></html>";

$html = str_get_html($originEvernoteHtml);
// print $html;

$codeBlockList = $html->find('div[style*="en-codeblock"]');
foreach($codeBlockList as $codeBlockHtml){
  // print $codeBlockHtml;

  $codeBlockStr = $codeBlockHtml->save();
  print $codeBlockStr;
} 

?>
真的可以搜索到两个code block:
输出到网页的效果:
是2个代码段
不错。
那继续去调试
打印出html的str
尤其是:把div code block换成pre
以及去掉div内部的嵌套的div
结果发现是,对于
<div><span style的代码,没有换行:
而对于下面的div中的代码块,倒是换行了:
那抽空再去看看其他代码块,是不是正常情况下都可以换行
以及如何确保这个span style的div,也能保持换行

转载请注明:在路上 » 【未解决】用php的html解析库simplehtmldom解析印象笔记帖子的html源码

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
82 queries in 0.175 seconds, using 22.10MB memory