最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】Python中用BeautifulSoup提取class中包含一定规则的节点

BeautifulSoup crifan 8342浏览 0评论

解决:

https://github.com/crifan/BlogsToWordpress/issues/1

期间,想要用Python的BeautifulSoup去提取:

<div class="ui-1582983425 noselect js-zfrg-6541" style="display: block;">
  <span class="pgi pgb iblock fc03 bgc9 bdc0 js-znpg-097">上一页</span>
  <span class="pgi zpg1 iblock fc03 bgc9 bdc0 js-zslt-987 fc05">1</span>
  <span class="frg fgp fc06">…</span>
  <span class="pgi zpg2 iblock fc03 bgc9 bdc0">2</span>
  <span class="pgi zpg3 iblock fc03 bgc9 bdc0">3</span>
  <span class="pgi zpg4 iblock fc03 bgc9 bdc0">4</span>
  <span class="pgi zpg5 iblock fc03 bgc9 bdc0">5</span>
  <span class="pgi zpg6 iblock fc03 bgc9 bdc0">6</span>
  <span class="pgi zpg7 iblock fc03 bgc9 bdc0">7</span>
  <span class="pgi zpg8 iblock fc03 bgc9 bdc0">8</span>
  <span class="frg fgn fc06">…</span>
  <span class="pgi zpg9 iblock fc03 bgc9 bdc0">58</span>
  <span class="pgi pgb iblock fc03 bgc9 bdc0">下一页</span>
</div>

中的:

<span class="pgi zpg9 iblock fc03 bgc9 bdc0">58</span>

所以想要去查找:

class是pgi zpg开头的

(如果更精准的话,最好是:

class是pgi zpgN  iblock fc03 bgc9 bdc0

其中N是数字,位数不限

得到数组后,取最后一个

beautifulsoup find class contains

python – Beautiful Soup if Class "Contains" or Regex? – Stack Overflow

soup.select好像不够好用?

python 2.7 – Beautiful Soup – Class contains ‘a’ and not contains ‘b’ – Stack Overflow

beautifulsoup

此处用的是3.0.6的bs

beautifulsoup 3

Beautiful Soup documentation

”soup.findAll(attrs={‘id’ : re.compile("para$")})“

Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation

Beautiful Soup 4.4.0 文档 — beautifulsoup 4.4.0 文档

去试试re正则去匹配

【总结】

最后用:

soup = htmlToSoup(respHtml)
pageClassPattern = re.compile("pgi zpg\d+ iblock fc03 bgc9 bdc0")
logging.debug("pageClassPattern=%s", pageClassPattern)
allPageNodeList = soup.findAll(attrs={"class" : pageClassPattern})
logging.debug("allPageNodeList=%s", allPageNodeList)
if allPageNodeList :
    lastPageNumNode = allPageNodeList[-1]
    logging.debug("lastPageNumNode=%s", lastPageNumNode)
    lastPageNumStr = lastPageNumNode.string.strip()
    logging.debug("lastPageNumStr=%s", lastPageNumStr)
    lastPageNum = int(lastPageNumStr)
    logging.debug("lastPageNum=%s", lastPageNum)

即可获得并提取出所要的数字。

转载请注明:在路上 » 【已解决】Python中用BeautifulSoup提取class中包含一定规则的节点

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
89 queries in 0.216 seconds, using 22.13MB memory