【已解决】C#用HtmlAgilityPack执行Html解析时,发现InnerText中包含javascript,要去除Javascript

【问题】

C#中,中HtmlAgilityPack,去解析:

http://www.amazon.com/Kindle-Fire-HD/dp/B0083PWAPW/ref=lp_1055398_1_2?ie=UTF8&qid=1369721900&sr=1-2

的html中的:

World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini)

时,发现对应的源码是:

<span>World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (<a href="#" id="kpp-popover-0">compared to the iPad mini</a><script type="text/javascript">
amznJQ.available(‘jQuery’, function() {
(function ($) {
amznJQ.available(‘popover’, function() {
    var content = ‘<h2 style="font-size: 17px;">Two Antennas, Better Bandwidth</h2>’

    + ‘<img src="http://g-ec2.images-amazon.com/images/G/01/kindle/dp/2012/KT/tate_feature-wifi._V395653267_.gif"/>’
   
    $(‘#kpp-popover-0’).amazonPopoverTrigger({
        literalContent: content,
        closeText: ‘Close’,
        title: ‘&nbsp;’,
        width: 550,
        location: ‘centered’
    });

});
}(jQuery));
});

</script>)</span>

然后用HtmlAgilityPack解析后,结果发现其中的InnerText却是:

World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini\namznJQ.available(‘jQuery’, function() { \n(function ($) {\namznJQ.available(‘popover’, function() {\n\tvar content = ‘<h2 style=\"font-size: 17px;\">Two Antennas, Better Bandwidth</h2>’ \n\n\t+ ‘<img src=\"http://g-ec2.images-amazon.com/images/G/01/kindle/dp/2012/KT/tate_feature-wifi._V395653267_.gif\"/>’\n\t\n\t$(‘#kpp-popover-0’).amazonPopoverTrigger({\n\t\tliteralContent: content,\n\t\tcloseText: ‘Close’,\n\t\ttitle: ‘&nbsp;’,\n\t\twidth: 550,\n\t\tlocation: ‘centered’\n\t});\n\n});\n}(jQuery)); \n}); \n\n)

而不是所希望的:

World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini)

即,需要去除InnerText中的Javascript。

【解决过程】

1.参考之前就看过的:

向HtmlAgilityPack道歉:解析HTML还是你好用

和对应的:

C#: HtmlAgilityPack extract inner text

然后调试了半天,最终用:

//remove sub node from current html node
//eg: 
//"script"
//for
//<script type="text/javascript"> 
public HtmlNode removeSubHtmlNode(HtmlNode curHtmlNode, string subNodeToRemove)
{
    HtmlNode afterRemoved = curHtmlNode;
    HtmlNodeCollection foundAllSub = curHtmlNode.SelectNodes(subNodeToRemove);
    if ((foundAllSub!= null ) && (foundAllSub.Count > 0))
    {
        foreach (HtmlNode subNode in foundAllSub)
        {
            curHtmlNode.RemoveChild(subNode);
        }
    }
    
    //foreach (var subNode in afterRemoved.Descendants(subNodeToRemove))
    //{
    //    //An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll
    //    //Additional information: Collection was modified; enumeration operation may not execute.
    //    afterRemoved.RemoveChild(subNode);
    //    curHtmlNode.RemoveChild(subNode);
        
    //    //subNode.Remove();
    //}

    return afterRemoved;
}

HtmlNode curBulletNode = allBulletNodeList[idx];

HtmlNode noJsNode = crl.removeSubHtmlNode(curBulletNode, "script");
HtmlNode noStyleNode = crl.removeSubHtmlNode(curBulletNode, "style");

string bulletStr = noStyleNode.InnerText;

而解决了问题。

 

其中可以看出:

1.那人给出的例子中,用

htmlDoc.DocumentNode.Descendants("script")

找到子节点,然后用

script.Remove();

去删除,是可以的。

2.但是此处如果用,当前的Html节点,做类似的处理:

foreach (var subNode in afterRemoved.Descendants(subNodeToRemove))
{
    //An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll
    //Additional information: Collection was modified; enumeration operation may not execute.
    afterRemoved.RemoveChild(subNode);
    curHtmlNode.RemoveChild(subNode);
  
    //subNode.Remove();
}

就会出现注释中提示的错误:

Additional information: Collection was modified; enumeration operation may not execute.

Additional information Collection was modified enumeration operation may not execute

即,在枚举Collection中,删除其中的值,是不允许的。

所以才想了别的办法去实现类似的remove的效果的。

 

【总结】

实现类似的删除的效果,真的是累屎了。。。。

删除根节点其下的子节点,好删;

删除当前某个节点下的节点,难删。(后来调试中,发现,其实执行subNode.Remove(); 时,已经删除成功了,但是接着还是会去执行foreach循环,导致报错的。。。)



发表评论

电子邮件地址不会被公开。 必填项已用*标注

无觅相关文章插件,快速提升流量