【已解决】PySpider中获取PyQuery获取到节点的子元素

折腾：

【未解决】用Python爬取汽车之家的车型车系详细数据

期间，希望从：

期间需要从：

    <ul class="rank-list-ul" 0>

      <li id="s3170">
        <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
        <div>指导价：<a class="red" href="//www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div>
        <div><a href="//car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170"
            href="//car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170"
            class="js-che168link" href="//www.che168.com/china/series3170/">二手车</a> <a
            href="//club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a
            href="//k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div>

      </li>


      <li id="s692">
。。。

提取：

<li id="s3170">

结果试了多种写法：

            merchantRankDoc = merchantRankDocList[curIdx]
            print("merchantRankDoc=%s" % merchantRankDoc)
            print("type(merchantRankDoc)=%s" % type(merchantRankDoc)) # type(merchantRankDoc)=<class 'lxml.html.HtmlElement'>
            merchantRankHtml = merchantRankDoc.html()
            print("merchantRankHtml=%s" % merchantRankHtml)
            # <li id="s3170">
            # carSeriesDocGenerator = merchantRankDoc.find("li")
            carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
            print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))
            # carSeriesDocGenerator = merchantRankDoc.items("li[id*=s]")
            # carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")

都无法获取到，结果此处基本上都是：

None

或只有3个子节点

通过打印得知此处是：

merchantRankDoc=<Element ul at 0x109b69c78>
type(merchantRankDoc)=<class 'lxml.html.HtmlElement'>

即：

lxml.html.HtmlElement

所以，去搞清楚，如何从

lxml.html.HtmlElement的ul，获取其下多个的li

参考：

The lxml.etree Tutorial

>>> print(etree.tostring(root,pretty_print=True))
<root>
  <child1/>
  <child2/>
  <child3/>
</root>

>>> children = list(root)

>>> forchild inroot:
...     print(child.tag)
child1
child2
child3

试试：

            carSeriesDocList = list(merchantRankDoc)
            print("carSeriesDocList=%s" % carSeriesDocList)

然后去打印html

from lxml import etree
            merchantRankHtml = etree.tostring(merchantRankDoc)
            print("merchantRankHtml=%s" % merchantRankHtml)

输出：

merchantRankHtml=b'<ul class="rank-list-ul">&#13;\n                                                &#13;\n                                                <li id="s3170">&#13;\n                                                <h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">&#22885;&#36842;A3</a></h4><div>&#25351;&#23548;&#20215;&#65306;<a class="red" href="https://www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46&#19975;</a></div><div><a href="https://car.autohome.com.cn/price/series-3170.html#pvareaid=103446">&#25253;&#20215;</a> <a id="atk_3170" href="https://car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">&#22270;&#24211;</a> <a data-value="3170" class="js-che168link" href="https://www.che168.com/china/series3170/">&#20108;&#25163;&#36710;</a> <a href="https://club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">&#35770;&#22363;</a> <a href="https://k.autohome.com.cn/3170/#pvareaid=103459">&#21475;&#30865;</a></div>&#13;\n                                                    &#13;\n                                                </li>&#13;\n                                                &#13;\n                                                。。。
。。。
。。。
href="https://k.autohome.com.cn/812/#pvareaid=103459">&#21475;&#30865;</a></div>&#13;\n                                                    &#13;\n                                                </li>&#13;\n                                                &#13;\n                                            </ul>&#13;\n                                            &#13;\n                                            '

而获取其下子元素：

            carSeriesDocList = list(merchantRankDoc)
            print("carSeriesDocList=%s" % carSeriesDocList)
            carSeriesDocListLen = len(carSeriesDocList)
            print("carSeriesDocListLen=%s" % carSeriesDocListLen)

输出：

carSeriesDocList=[<Element li at 0x109b92c28>, <Element li at 0x109b92048>, <Element li at 0x109ba2548>, <Element li at 0x109ba2b38>, <Element li at 0x109ba2048>, <Element li at 0x109ba22c8>, <Element li at 0x109ba2908>, <Element li at 0x109ba2188>, <Element li at 0x109ba26d8>, <Element li at 0x109ba2b88>, <Element li at 0x109ba2ea8>, <Element li at 0x109ba2098>, <Element li at 0x109ba2e58>, <Element li at 0x109ba2368>, <Element li at 0x109ba2138>]
carSeriesDocListLen=15

好像是可以获取子节点中li元素了

但是没法直接搜索符合条件的

比如：

要的是：

      <li id="s4871">
。。。

但是不要

<li class="dashline"></li>

所以去找找，如何匹配

不过突然想起来，或许是，找找之前items返回generator，如果for循环，会不会得到的是query的对象，而不是lxml的

此处发现是：

不论是generator转为list

        merchantDocGenerator = response.doc("dd div[class='h3-tit'] a").items()
        merchantDocList = list(merchantDocGenerator)
        print("merchantDocList=%s" % merchantDocList)

还是直接for循环，都是PyQuery：

type(merchantItem)=<class 'pyquery.pyquery.PyQuery'>

而不是lxml

pyquery – PyQuery complete API — pyquery 1.2.4 documentation

去看看能否直接用items()加上参数

不过好像突然发现，前面一直是lxml的元素，而不是query是忘了加上items()的原因，去加上：

        # merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']")
        merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']").items()

结果：

就是可以的了：

merchantRankDocListLen=24

而后续想要获取子元素，没获取到，是因为笔误，改回正常的：

            # carSeriesDocList = list(merchantRankDoc)
            carSeriesDocList = list(carSeriesDocGenerator)

至少逻辑上是对的了

然后再去看看

carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")

后续输出

carSeriesDocListLen=13
--------------------------------------------------------------------------------
[0] eachCarSeriesDoc=<Element li at 0x1082f8228>
type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
carSeriesInfoDoc=<Element h4 at 0x10831d0e8>

而换成：

后续输出：

[0] eachCarSeriesDoc=<li id="s3170">&#13;
                                                <h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&amp;pvareaid=101594">奥迪A3</a></h4><div>指导价：<a class="red" href="https://www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div><div><a href="https://car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170" href="https://car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170" class="js-che168link" href="https://www.che168.com/china/series3170/">二手车</a> <a href="https://club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a href="https://k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div>&#13;
                                                    &#13;
                                                </li>&#13;
                                                &#13;
                                                
type(eachCarSeriesDoc)=<class 'pyquery.pyquery.PyQuery'>
carSeriesInfoDoc=<h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&amp;pvareaid=101594">奥迪A3</a></h4>
carSeriesName=奥迪A3

就可以获取到：

其下的子元素

经过继续调试发现：

对于：

    <ul class="rank-list-ul" 0>

      <li id="s3170">
。。。
      </li>

      <li id="s692">
。。。
      </li>

如果是find()：

carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))

则返回的是PyQuery

type(carSeriesDocGenerator)=<class 'pyquery.pyquery.PyQuery'>

然后generator转换成list后：

carSeriesDocList = list(carSeriesDocGenerator)
for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
    print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc))

每个元素是：

type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>

如果换成items()

carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))

则返回的是generator：

type(carSeriesDocGenerator)=<class 'generator'>

然后generator转换成list后：

carSeriesDocList = list(carSeriesDocGenerator)
for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):

每个元素是：

type(eachCarSeriesDoc)=<class 'pyquery.pyquery.PyQuery'>

对应着官网文档中的

pyquery – PyQuery complete API — pyquery 1.2.4 documentation

PyQuery.items(selector=None)
    
Iter over elements. Return PyQuery objects:

-》items()返回的是PyQuery(=pyquery.pyquery.PyQuery)的generator

PyQuery.find(selector)
    
Find elements using selector traversing down from self:

->find() 返回的是element元素=lxml.html.HtmlElement

【总结】

此处，对于html

    <ul class="rank-list-ul" 0>

      <li id="s3170">
        <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
        <div>指导价：<a class="red" href="//www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div>
        <div><a href="//car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170"
            href="//car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170"
            class="js-che168link" href="//www.che168.com/china/series3170/">二手车</a> <a
            href="//club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a
            href="//k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div>


      <li id="s692">
        <h4><a href="//www.autohome.com.cn/692/#levelsource=000000000_0&pvareaid=101594">奥迪A4L</a></h4>
        <div>指导价：<a class="red" href="//www.autohome.com.cn/692/price.html#pvareaid=101446">30.58-39.68万</a></div>
        <div><a href="//car.autohome.com.cn/price/series-692.html#pvareaid=103446">报价</a> <a id="atk_692"
            href="//car.autohome.com.cn/pic/series/692.html#pvareaid=103448">图库</a> <a data-value="692"
            class="js-che168link" href="//www.che168.com/china/series692/">二手车</a> <a
            href="//club.autohome.com.cn/bbs/forum-c-692-1.html#pvareaid=103447">论坛</a> <a
            href="//k.autohome.com.cn/692/#pvareaid=103459">口碑</a></div>
      </li>
。。。

想要获取到ul其下的多个li节点

之前出各种问题，主要原因：

笔误

把变量写错了

不熟悉find() 和 items()返回的结果不同

此处希望返回PyQuery，所以应该用items()

最后代码是：

            carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
            # carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
            print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))

            carSeriesDocList = list(carSeriesDocGenerator)
            print("type(carSeriesDocList)=%s" % type(carSeriesDocList))
            print("carSeriesDocList=%s" % carSeriesDocList)
            carSeriesDocListLen = len(carSeriesDocList)
            print("carSeriesDocListLen=%s" % carSeriesDocListLen)

            for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
                print("%s" % "-"*80)
                print("[%d] eachCarSeriesDoc=%s" % (curSeriesIdx, eachCarSeriesDoc))
                print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc)) # type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
                # <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
                carSeriesInfoDoc = eachCarSeriesDoc.find("h4 a")
                print("type(carSeriesInfoDoc)=%s" % type(carSeriesInfoDoc))
                print("carSeriesInfoDoc=%s" % carSeriesInfoDoc)
                carSeriesName = carSeriesInfoDoc.text()
                print("carSeriesName=%s" % carSeriesName)
                carSeriesUrl = carSeriesInfoDoc.attr.href
                print("carSeriesUrl=%s" % carSeriesUrl)

输出：

type(carSeriesDocGenerator)=<class 'pyquery.pyquery.PyQuery'>
type(carSeriesDocList)=<class 'list'>
carSeriesDocList=[<Element li at 0x109bc3a98>, <Element li at 0x109bc36d8>, <Element li at 0x109bc3908>, <Element li at 0x109bc3b88>, <Element li at 0x109bc3e58>, <Element li at 0x109b9c908>, <Element li at 0x109bc2c78>, <Element li at 0x109bc2d68>, <Element li at 0x109bc21d8>, <Element li at 0x109bc2958>, <Element li at 0x109bc2db8>, <Element li at 0x109bc2908>, <Element li at 0x109bc27c8>]
carSeriesDocListLen=13
--------------------------------------------------------------------------------
[0] eachCarSeriesDoc=<Element li at 0x109bc3a98>
type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
[E 200815 22:15:25 base_handler:203] Empty tag name
    Traceback (most recent call last):
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task
        result = self._run_task(task, response)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 176, in _run_task
        return self._run_func(function, response, task)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 155, in _run_func
        ret = function(*arguments[:len(args) - 1])
      File "<autohome_20200814>", line 138, in gradCarHtmlPage
      File "src/lxml/etree.pyx", line 1532, in lxml.etree._Element.find
      File "src/lxml/_elementpath.py", line 325, in lxml._elementpath.find
      File "src/lxml/_elementpath.py", line 102, in select
      File "src/lxml/_elementpath.py", line 103, in select
      File "src/lxml/etree.pyx", line 1437, in lxml.etree._Element.iterchildren
      File "src/lxml/etree.pyx", line 2841, in lxml.etree.ElementChildIterator.__cinit__
      File "src/lxml/etree.pyx", line 2812, in lxml.etree._ElementMatchIterator._initTagMatcher
      File "src/lxml/etree.pyx", line 2679, in lxml.etree._MultiTagMatcher.__cinit__
      File "src/lxml/etree.pyx", line 2718, in lxml.etree._MultiTagMatcher.initTagMatch
      File "src/lxml/etree.pyx", line 2749, in lxml.etree._MultiTagMatcher._storeTags
      File "src/lxml/etree.pyx", line 2736, in lxml.etree._MultiTagMatcher._storeTags
      File "src/lxml/apihelpers.pxi", line 1657, in lxml.etree._getNsTag
      File "src/lxml/apihelpers.pxi", line 1692, in lxml.etree.__getNsTag
    ValueError: Empty tag name

把find() 换 items()：

            carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
            print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))

            carSeriesDocList = list(carSeriesDocGenerator)
            print("type(carSeriesDocList)=%s" % type(carSeriesDocList))
            print("carSeriesDocList=%s" % carSeriesDocList)
            carSeriesDocListLen = len(carSeriesDocList)
            print("carSeriesDocListLen=%s" % carSeriesDocListLen)
            
            for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
                print("%s" % "-"*80)
                print("[%d] eachCarSeriesDoc=%s" % (curSeriesIdx, eachCarSeriesDoc))
                print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc)) # type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
                # <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
                carSeriesInfoDoc = eachCarSeriesDoc.find("h4 a")
                print("type(carSeriesInfoDoc)=%s" % type(carSeriesInfoDoc))
                print("carSeriesInfoDoc=%s" % carSeriesInfoDoc)
                carSeriesName = carSeriesInfoDoc.text()
                print("carSeriesName=%s" % carSeriesName)
                carSeriesUrl = carSeriesInfoDoc.attr.href
                print("carSeriesUrl=%s" % carSeriesUrl)

就正常了

type(carSeriesDocGenerator)=<class 'generator'>
type(carSeriesDocList)=<class 'list'>
carSeriesDocList=[[<li#s3170>], [<li#s692>], [<li#s18>], [<li#s4526>], [<li#s4871>], [<li#s5240>], [<li#s2951>], [<li#s4851>], [<li#s3304>], [<li#s5765>], [<li#s19>], [<li#s509>], [<li#s812>]]
carSeriesDocListLen=13
--------------------------------------------------------------------------------
[0] eachCarSeriesDoc=<li id="s3170">&#13;
                                                <h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&amp;pvareaid=101594">奥迪A3</a></h4><div>指导价：<a class="red" href="https://www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div><div><a href="https://car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170" href="https://car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170" class="js-che168link" href="https://www.che168.com/china/series3170/">二手车</a> <a href="https://club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a href="https://k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div>&#13;
                                                    &#13;
                                                </li>&#13;
                                                &#13;
                                                
type(eachCarSeriesDoc)=<class 'pyquery.pyquery.PyQuery'>
type(carSeriesInfoDoc)=<class 'pyquery.pyquery.PyQuery'>
carSeriesInfoDoc=<a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&amp;pvareaid=101594">奥迪A3</a>
carSeriesName=奥迪A3
carSeriesUrl=https://www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594

即可。

转载请注明：在路上 » 【已解决】PySpider中获取PyQuery获取到节点的子元素

Post Views: 2,914

与本文相关的文章