【已解决】pyppeteer中提取百度搜索结果中的信息

折腾：

【未解决】Mac中用puppeteer自动操作浏览器实现百度搜索

期间，已经可以用pyppeteer去触发百度搜索了，现在去看看如何提取搜索结果。

先去搞清楚，如何匹配元素

对于

【整理】用Chrome或Chromium查看百度首页中各元素的html源码

其中的元素：

<h3 class="t"><a data-click="{
            'F':'778317EA',
            'F1':'9D73F1E4',
            'F2':'4CA6DE6B',
            'F3':'54E5243F',
            'T':'1616767238',
                        'y':'EFBCEFBE'
                                                }" href="https://www.baidu.com/link?url=nDSbU9I2MSInD6Tq7Je06wZD-CiTQ-ckokscP4kiXneJcS0UWUPIqWHMjLDyn5uW&wd=&eqid=919e8ff000236bc300000004605de906" target="_blank"><em>crifan</em> (<em>Crifan</em> Li) · GitHub</a></h3>

去看看如何写selector去匹配元素

    h3ASelector = "h3[class^='t'] a"
    aElemList = await page.querySelectorAll(h3ASelector)
    print("aElemList=%s" % aElemList)

可以解析到：

继续研究。

再去搞清楚，如何提取元素的值

puppeteer extract text

javascript – how to get text inside div in puppeteer – Stack Overflow

刚注意官网

https://github.com/pyppeteer/pyppeteer

竟然就是：

element = await page.querySelector('h1')
title = await page.evaluate('(element) => element.textContent', element)

Pyppeteer 使用笔记 – 拐弯 – 博客园

可以用

   # elements = await page.xpath('//div[@class="title-box"]/a')
    elements = await page.querySelectorAll(".title-box a")
    for item in elements:
        print(await item.getProperty('textContent'))
        # <pyppeteer.execution_context.JSHandle object at 0x000002220E7FE518>


        # 获取文本
        title_str = await (await item.getProperty('textContent')).jsonValue()


        # 获取链接
        title_link = await (await item.getProperty('href')).jsonValue()

继续写代码

    searchResultANum = len(searchResultAList)
    print("searchResultANum=%s" % searchResultANum)
    for curIdx, aElem in enumerate(searchResultAList):
      curNum = curIdx + 1
      print("%s [%d] %s" % ("-"*20, curNum, "-"*20))
      aTextJSHandle = await aElem.getProperty('textContent')
      print("type(aTextJSHandle)=%s" % type(aTextJSHandle))
      print("aTextJSHandle=%s" % aTextJSHandle)
      title = await aTextJSHandle.jsonValue()
      print("type(title)=%s" % type(title))
      print("title=%s" % title)


      baiduLinkUrl = await (await aElem.getProperty("href")).jsonValue()
      print("baiduLinkUrl=%s" % baiduLinkUrl)

调试，结果：

【已解决】pyppeteer中page.querySelectorAll运行时无法获取到结果

然后代码

    resultASelector = "h3[class^='t'] a"
    searchResultAList = await page.querySelectorAll(resultASelector)
    print("searchResultAList=%s" % searchResultAList)
    searchResultANum = len(searchResultAList)
    print("searchResultANum=%s" % searchResultANum)
    for curIdx, aElem in enumerate(searchResultAList):
      curNum = curIdx + 1
      print("%s [%d] %s" % ("-"*20, curNum, "-"*20))
      aTextJSHandle = await aElem.getProperty('textContent')
      print("type(aTextJSHandle)=%s" % type(aTextJSHandle))
      print("aTextJSHandle=%s" % aTextJSHandle)
      title = await aTextJSHandle.jsonValue()
      print("type(title)=%s" % type(title))
      print("title=%s" % title)


      baiduLinkUrl = await (await aElem.getProperty("href")).jsonValue()
      print("baiduLinkUrl=%s" % baiduLinkUrl)

一次性通过，是正常的：

输出：

searchResultAList=[<pyppeteer.element_handle.ElementHandle object at 0x10309e860>, <pyppeteer.element_handle.ElementHandle object at 0x10309e278>, <pyppeteer.element_handle.ElementHandle object at 0x10309e0f0>, <pyppeteer.element_handle.ElementHandle object at 0x1030b0b00>, <pyppeteer.element_handle.ElementHandle object at 0x1030b0710>, <pyppeteer.element_handle.ElementHandle object at 0x1030b0198>, <pyppeteer.element_handle.ElementHandle object at 0x1030b06d8>, <pyppeteer.element_handle.ElementHandle object at 0x1030b0160>, <pyppeteer.element_handle.ElementHandle object at 0x1030b0ef0>, <pyppeteer.element_handle.ElementHandle object at 0x1030b30b8>]
searchResultANum=10
-------------------- [1] --------------------
type(aTextJSHandle)=<class 'pyppeteer.execution_context.JSHandle'>
aTextJSHandle=<pyppeteer.execution_context.JSHandle object at 0x10309c9b0>
type(title)=<class 'str'>
title=在路上on the way - 走别人没走过的路,让别人有路可走
baiduLinkUrl=http://www.baidu.com/link?url=1l0yCSpJcbVYGRoCdaXUal3NIDZXTn9F1Q4XwZcd6KzpAmUVPLca_wIIqFVlaBs6

【总结】

此处最后用：

    ################################################################################
    # Wait page reload complete
    ################################################################################
    SearchFoundWordsSelector = 'span.nums_text'
    SearchFoundWordsXpath = "//span[@class='nums_text']"


    # await page.waitForSelector(SearchFoundWordsSelector)
    # await page.waitFor(SearchFoundWordsSelector)
    # await page.waitForXPath(SearchFoundWordsXpath)
    # Note: all above exception: 发生异常: ElementHandleError Evaluation failed: TypeError: MutationObserver is not a constructor
    #   so change to following


    # # Method 1: just wait
    # await page.waitFor(2000) # millisecond


    # Method 2: wait element showing
    SingleWaitSeconds = 1
    while not await page.querySelector(SearchFoundWordsSelector):
      print("Still not found %s, wait %s seconds" % (SearchFoundWordsSelector, SingleWaitSeconds))
      await asyncio.sleep(SingleWaitSeconds)
      # pass

确保页面内容加载完毕。

再用：

    ################################################################################
    # Extract result
    ################################################################################


    resultASelector = "h3[class^='t'] a"
    searchResultAList = await page.querySelectorAll(resultASelector)
    # print("searchResultAList=%s" % searchResultAList)
    searchResultANum = len(searchResultAList)
    print("Found %s search result:" % searchResultANum)
    for curIdx, aElem in enumerate(searchResultAList):
      curNum = curIdx + 1
      print("%s [%d] %s" % ("-"*20, curNum, "-"*20))
      aTextJSHandle = await aElem.getProperty('textContent')
      # print("type(aTextJSHandle)=%s" % type(aTextJSHandle))
      # type(aTextJSHandle)=<class 'pyppeteer.execution_context.JSHandle'>
      # print("aTextJSHandle=%s" % aTextJSHandle)
      # aTextJSHandle=<pyppeteer.execution_context.JSHandle object at 0x10309c9b0>
      title = await aTextJSHandle.jsonValue()
      # print("type(title)=%s" % type(title))
      # type(title)=<class 'str'>
      print("title=%s" % title)


      baiduLinkUrl = await (await aElem.getProperty("href")).jsonValue()
      print("baiduLinkUrl=%s" % baiduLinkUrl)

提取出要的结果。

输出：

Found 10 search result:
-------------------- [1] --------------------
title=在路上on the way - 走别人没走过的路,让别人有路可走
baiduLinkUrl=http://www.baidu.com/link?url=eGTzEXXlMw-hnvXYSFk8t4VSZPck1dougn7YhfCwBf3ZzGJEHdZYsoAQK-4GBJuP
-------------------- [2] --------------------
title=crifan – 在路上
baiduLinkUrl=http://www.baidu.com/link?url=l6jXejlgARrWj34ODgKWZ9BeNKwyYZLRhLb5B8oDFVqNpHoco8a_qbAdD1m-t_cf
-------------------- [3] --------------------
title=crifan简介_crifan的专栏-CSDN博客_crifan
baiduLinkUrl=http://www.baidu.com/link?url=IIqPM5wuVE_QP7S357-1bJWGGU1kpFcAZ945BaXUQNpaDzXihf_98wAVi05Gk6-8Qu4aGLv2Rv65WJm6Qr5kk_
-------------------- [4] --------------------
title=crifan的微博_微博
baiduLinkUrl=http://www.baidu.com/link?url=NnqeMlu4Jr_Ld-zoui8pbQO4eRMMO9pLd_DHXagqcdZ46NF4CSuyEziKSTpqCNEi
-------------------- [5] --------------------
title=Crifan的电子书大全 | crifan.github.io
baiduLinkUrl=http://www.baidu.com/link?url=uOZ-AmgNBNr3mGdETezIjTvtedH_ueM6-LNOc2QxbjcNeS8LuVBY-kirwogX7qLl
-------------------- [6] --------------------
title=GitHub - crifan/crifanLib: crifan's library
baiduLinkUrl=http://www.baidu.com/link?url=t42I1rYfn32DGw9C6cw_5lB-z1worKzEuROOtWj-Jyf1l2IBNBcz-l85hSKv9s9T
-------------------- [7] --------------------
title=在路上www.crifan.com - 网站排行榜
baiduLinkUrl=http://www.baidu.com/link?url=WwLwfXA72vK08Obyx2hwqA3-wmq8jAisi4VVSt2R0Ml3ccCy_yxeYfxD2xouAX-i5AyUU1U_2EghwVbJ2p-ipa
-------------------- [8] --------------------
title=crifan的专栏_crifan_CSDN博客-crifan领域博主
baiduLinkUrl=http://www.baidu.com/link?url=Cmcn2mXwiZr87FBGQBq85Np0hgGTP_AK2yLUW6GDeA21r7Q5WvUOUjaKZo5Jhb0f
-------------------- [9] --------------------
title=User crifan - Stack Overflow
baiduLinkUrl=http://www.baidu.com/link?url=yGgsq1z2vNDAAeWY-5VDWbHv7e7zPILHI4GVFPZd6MaFrGjYHsb3Onir1Vi6vvZqD7QAGJrZehIYZpcBfh_Gq_
-------------------- [10] --------------------
title=crifan - Bing 词典
baiduLinkUrl=http://www.baidu.com/link?url=UatxhUBL3T_1ikPco5OazvJaWkVqCeCHh4eoA6AX_lP4t_Bx3GVHlMHZjgu3YAwE

效果：

转载请注明：在路上 » 【已解决】pyppeteer中提取百度搜索结果中的信息

Post Views: 2,693

与本文相关的文章