【已解决】C#去除Html的tag且同时去除注释

【问题】

C#中,想要去除html的标签tag,且同时去除注释comment。

 

【解决过程】

1.参考:

How can I strip HTML tags from a string in ASP.NET?

去试试用:

    public string htmlRemoveTag(string html)
    {
        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
        htmlDoc.LoadHtml(html);
        if (htmlDoc == null)
        {
            return "";
        }

        string filteredHtml = "";
        foreach (var node in htmlDoc.DocumentNode.ChildNodes)
        {
            filteredHtml += node.InnerText;
        }

        return filteredHtml;
    }

结果是,可以去除所有的tag了。

但是对于html的注释:

<!——- A+ Content Begins Here ——->  <!——- BRAND LOGO ——->      <!——- TITLE ——->  Frigidaire Mini Air Conditioner  <!——- GENERAL DESCRIPTION ——->     Frigidaire’s FRA052XT7 5,000 BTU 115-Volt Window-Mounted Mini-Compact Air Conditioner is perfect for rooms up to 150 square feet.  It quickly cools a room on hot days and quie。。。。。。。。

却没去掉。

2.继续去除comment。

参考:

Removing HTML Comments

然后用:

    public string htmlRemoveTag(string html)
    {
        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
        htmlDoc.LoadHtml(html);
        if (htmlDoc == null)
        {
            return "";
        }

        // 1. remove all comments
        //(1)get all comment nodes using XPATH
        foreach (HtmlNode comment in htmlDoc.DocumentNode.SelectNodes("//comment()"))
        {
            //(2) remove comment node itself
            comment.ParentNode.RemoveChild(comment);
        }

        //2. get all content
        string filteredHtml = "";
        foreach (var node in htmlDoc.DocumentNode.ChildNodes)
        {
            filteredHtml += node.InnerText;
        }

        return filteredHtml;
    }

就实现了目的,结果是html的内容,没有tag,没有comment:

”          Frigidaire Mini Air Conditioner       Frigidaire’s FRA052XT7 5,000 BTU 115-Volt Window-Mounted Mini-Compact Air Conditioner is perfect for rooms up to 150 square feet.  It quickly cools a room on hot days and quiet operation keeps you cool without keeping you awake. This unit features mechanical rotary  controls and top, full-width, 2-way air direction control. The antimicrobial mesh filter with side, slide-out access cleans the air  removing harmful bacteria. Low voltage start-up conserves energy and saves you money 。。。。。。。。。。。。。。

 

【总结】

想要去除html的tag,并且不保留对应的comment,那么可以用:

using HtmlAgilityPack;

public string htmlRemoveTag(string html)
{
    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
    htmlDoc.LoadHtml(html);
    if (htmlDoc == null)
    {
        return "";
    }

    // 1. remove all comments
    //(1)get all comment nodes using XPATH
    foreach (HtmlNode comment in htmlDoc.DocumentNode.SelectNodes("//comment()"))
    {
        //(2) remove comment node itself
        comment.ParentNode.RemoveChild(comment);
    }

    //2. get all content
    string filteredHtml = "";
    foreach (var node in htmlDoc.DocumentNode.ChildNodes)
    {
        filteredHtml += node.InnerText;
    }

    return filteredHtml;
}



发表评论

电子邮件地址不会被公开。 必填项已用*标注

无觅相关文章插件,快速提升流量