【已解决】Python中过滤html的标签（但保留标签内的内容）

【问题】

已经通过Python中的BeautifulSoup获得了对应的soup：

LINE 253 : INFO foundDescription=<td valign="top" colspan="2" itemprop="description">
 BAD CREDIT 
NO CREDIT 
NO PROBLEM!!! 
 
CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820 </td>

现在，想要得到其中的description的内容，并且过滤掉其中的br等标签。

【解决过程】

1.当然最土，最笨的办法就是，手动用正则去除掉对应的br标签。

但是想要找个更好的办法。

2.后来从：

Python HTML sanitizer / scrubber / filter

发现BeautifulSoup竟然有个renderContents，所以去参考官网文档：

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html

找到对应的解释后，所以去试试：

        descContents = foundDescription.renderContents();
        logging.info("descContents=%s", descContents);

结果是

LINE 257 : INFO descContents=

BAD CREDIT

NO CREDIT

NO PROBLEM!!!

CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820

还是有对应的br标签。

3.所以，看来还是算了，还是自己手动暂时此处用正则去处理算了。

写成：

        descContents = crifanLib.soupContentsToUnicode(foundDescription.contents);
        #descContents = foundDescription.renderContents();
        logging.info("descContents=%s", descContents);
        descHtmlDecoded = crifanLib.decodeHtmlEntity(descContents);
        logging.info("descHtmlDecoded=%s", descHtmlDecoded);
        descHtmlFiltered = re.sub("<br\s*>", "", descHtmlDecoded);
        descHtmlFiltered = re.sub("<br\s*/>", "", descHtmlFiltered);
        logging.info("descHtmlFiltered=%s", descHtmlFiltered);

效果是：

LINE 262 : INFO descHtmlFiltered=

BAD CREDIT

NO CREDIT

NO PROBLEM!!!

CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820

基本满足此处需求了。

就此这么着吧。

等遇到更复杂的，再想更好的办法。

【总结】

暂时只能还是通过正则去处理html的tag。

【后记 2013-05-03】

1.后来继续试了试：

    VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br', 'a'];
    soup = BeautifulSoup(origHtml);
    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True;

    filteredHtml = soup.renderContents();
    logging.info("processed, filteredHtml=%s", filteredHtml);

结果，只是起到，过滤到非法的tag，而不是把tag去掉，保留tag内的内容。

2.后来只能是自己手动删除tag，而保留其中的内容了：

def filterHtmlTag(origHtml):
    """
    filter html tag, but retain its contents
    eg:
        Brooklyn, NY 11220<br />
        Brooklyn, NY 11220
        
        <a href="mailto:[email protected]">[email protected]</a><br />
        [email protected]
        
        <a href="javascript:void(0);" onClick="window.open(new Array('http','',':','//','stores.ebay.com','/Bay-Ridge-Nissan-of-New-York?_rdc=1').join(''), '_blank')">stores.ebay.com</a>
        stores.ebay.com
        
        <a href="javascript:void(0);" onClick="window.open(new Array('http','',':','//','www.carfaxonline.com','/cfm/Display_Dealer_Report.cfm?partner=AXX_0&UID=C367031&vin=JH4KB2F61AC001005').join(''), '_blank')">www.carfaxonline.com</a>
        www.carfaxonline.com        
    """
    #logging.info("html tag, origHtml=%s", origHtml);
    filteredHtml = origHtml;

    #Method 1: auto remove tag use re
    #remove br
    filteredHtml = re.sub("<br\s*>", "", filteredHtml, flags=re.I);
    filteredHtml = re.sub("<br\s*/>", "", filteredHtml, flags=re.I);
    #logging.info("remove br, filteredHtml=%s", filteredHtml);
    #remove a
    filteredHtml = re.sub("<a\s+[^<>]+>(?P<aContent>[^<>]+?)</a>", "\g<aContent>", filteredHtml, flags=re.I);
    #logging.info("remove a, filteredHtml=%s", filteredHtml);
    #remove b,strong
    filteredHtml = re.sub("<b>(?P<bContent>[^<>]+?)</b>", "\g<bContent>", filteredHtml, re.I);
    filteredHtml = re.sub("<strong>(?P<strongContent>[^<>]+?)</strong>", "\g<strongContent>", filteredHtml, flags=re.I);
    #logging.info("remove b,strong, filteredHtml=%s", filteredHtml);

    return filteredHtml;

3.以后会继续更新此函数的。

转载请注明：在路上 » 【已解决】Python中过滤html的标签（但保留标签内的内容）

Post Views: 1,114

【已解决】Python中过滤html的标签（但保留标签内的内容）

与本文相关的文章

Hi，您需要填写昵称和邮箱！