【已解决】Python中过滤html的标签(但保留标签内的内容)

【问题】

已经通过Python中的BeautifulSoup获得了对应的soup:

 

LINE 253  : INFO     foundDescription=<td valign="top" colspan="2" itemprop="description">
                        BAD CREDIT <br />
NO CREDIT<br />
NO PROBLEM!!!<br />
<br />
CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820           </td>

现在,想要得到其中的description的内容,并且过滤掉其中的br等标签。

【解决过程】

1.当然最土,最笨的办法就是,手动用正则去除掉对应的br标签。

但是想要找个更好的办法。

2.后来从:

Python HTML sanitizer / scrubber / filter

发现BeautifulSoup竟然有个renderContents,所以去参考官网文档:

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html

找到对应的解释后,所以去试试:

        descContents = foundDescription.renderContents();
        logging.info("descContents=%s", descContents);

结果是

LINE 257  : INFO     descContents=

                        BAD CREDIT <br />

NO CREDIT<br />

NO PROBLEM!!!<br />

<br />

CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820

还是有对应的br标签。

3.所以,看来还是算了,还是自己手动暂时此处用正则去处理算了。

写成:

        descContents = crifanLib.soupContentsToUnicode(foundDescription.contents);
        #descContents = foundDescription.renderContents();
        logging.info("descContents=%s", descContents);
        descHtmlDecoded = crifanLib.decodeHtmlEntity(descContents);
        logging.info("descHtmlDecoded=%s", descHtmlDecoded);
        descHtmlFiltered = re.sub("<br\s*>", "", descHtmlDecoded);
        descHtmlFiltered = re.sub("<br\s*/>", "", descHtmlFiltered);
        logging.info("descHtmlFiltered=%s", descHtmlFiltered);

效果是:

LINE 262  : INFO     descHtmlFiltered=

                        BAD CREDIT

NO CREDIT

NO PROBLEM!!!

CALL/TEXT DAVID FOR MORE INFO AT 210-473-9820

基本满足此处需求了。

就此这么着吧。

等遇到更复杂的,再想更好的办法。

 

【总结】

暂时只能还是通过正则去处理html的tag。

 

【后记 2013-05-03】

1.后来继续试了试:

    VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br', 'a'];
    soup = BeautifulSoup(origHtml);
    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True;

    filteredHtml = soup.renderContents();
    logging.info("processed, filteredHtml=%s", filteredHtml);

结果,只是起到,过滤到非法的tag,而不是把tag去掉,保留tag内的内容。

2.后来只能是自己手动删除tag,而保留其中的内容了:

def filterHtmlTag(origHtml):
    """
    filter html tag, but retain its contents
    eg:
        Brooklyn, NY 11220<br />
        Brooklyn, NY 11220
        
        <a href="mailto:Bayridgenissan42@yahoo.com">Bayridgenissan42@yahoo.com</a><br />
        Bayridgenissan42@yahoo.com
        
        <a href="javascript:void(0);" onClick="window.open(new Array('http','',':','//','stores.ebay.com','/Bay-Ridge-Nissan-of-New-York?_rdc=1').join(''), '_blank')">stores.ebay.com</a>
        stores.ebay.com
        
        <a href="javascript:void(0);" onClick="window.open(new Array('http','',':','//','www.carfaxonline.com','/cfm/Display_Dealer_Report.cfm?partner=AXX_0&UID=C367031&vin=JH4KB2F61AC001005').join(''), '_blank')">www.carfaxonline.com</a>
        www.carfaxonline.com        
    """
    #logging.info("html tag, origHtml=%s", origHtml);
    filteredHtml = origHtml;

    #Method 1: auto remove tag use re
    #remove br
    filteredHtml = re.sub("<br\s*>", "", filteredHtml, flags=re.I);
    filteredHtml = re.sub("<br\s*/>", "", filteredHtml, flags=re.I);
    #logging.info("remove br, filteredHtml=%s", filteredHtml);
    #remove a
    filteredHtml = re.sub("<a\s+[^<>]+>(?P<aContent>[^<>]+?)</a>", "\g<aContent>", filteredHtml, flags=re.I);
    #logging.info("remove a, filteredHtml=%s", filteredHtml);
    #remove b,strong
    filteredHtml = re.sub("<b>(?P<bContent>[^<>]+?)</b>", "\g<bContent>", filteredHtml, re.I);
    filteredHtml = re.sub("<strong>(?P<strongContent>[^<>]+?)</strong>", "\g<strongContent>", filteredHtml, flags=re.I);
    #logging.info("remove b,strong, filteredHtml=%s", filteredHtml);

    return filteredHtml;

3.以后会继续更新此函数的。



发表评论

电子邮件地址不会被公开。 必填项已用*标注

无觅相关文章插件,快速提升流量