最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】车型车系数据缺失如红旗H5等部分车型数据

数据 crifan 193浏览 0评论
之前已抓取的数据:
【已解决】汽车之家车型车系数据:支持旧版车系页面
中,后来发现缺少了部分数据
现在去研究看看原因
以品牌红旗为例
在售车型为例
红旗H5,在售车型12款
2020款相关数据没有抓取到
【车型大全】汽车车型大全_汽车之家
【红旗H5】红旗_红旗H5报价_红旗H5图片_汽车之家
2020款 1.5T DCT旗悦版
去找找结果数据中,是否的确缺失 2020款红旗H5的数据
-》
果然是:
此处年份是null
对应打开
https://www.autohome.com.cn/spec/43344/#pvareaid=3454492
https://www.autohome.com.cn/spec/45625/#pvareaid=3454492
就是2020款
所以去调试找原因
后来发现了,是多个不同for层,car的dict没有copy,而直接赋值
导致之前实际上已抓取的到的的2020款数据:
------------------------------ [0] ------------------------------
carModelYear=2020款
carModelEmissionStandards=国VI
carModelPower=1.5T
carModelGearBox=7挡双离合
carModelName=2020款 1.5T DCT旗悦版
carModelSpecUrl=https://www.autohome.com.cn/spec/46112/#pvareaid=3454492
typeDefaultListDoc=<generator object PyQuery.items at 0x10ae5cba0>
typeDefaultList=[[<span.type-default>], [<span.type-default>]]
spanTypeDefault0=<span class="type-default">前置前驱</span>
carModelDriveType=前置前驱
spanTypeDefault1=<span class="type-default">7挡双离合</span>
carModelGearBox=7挡双离合
carModelMsrp=14.58万
后续被替换掉,变成2019款的数据了:
去修改代码
核心部分:
            carModelDict = copy.deepcopy(carSeriesDict)

            for curLiIdx, eachHatADoc in enumerate(haltAListDoc):
                curHaltCarDict = copy.deepcopy(carModelDict)
...

            for curSpecWrapIdx, eachSpecWrapDoc in enumerate(carSpecWrapListDoc):
                print("%s [%d] %s" % ('#'*30, curSpecWrapIdx, '#'*30))

                curSpecWrapCarDict = copy.deepcopy(carModelDict)
...
                for curDlIdx, eachDlDoc in enumerate(dlDocList):
                    print("%s [%d] %s" % ('='*30, curDlIdx, '='*30))
    
                    curDlCarDict = copy.deepcopy(curSpecWrapCarDict)
...
                    for curDdIdx, eachDdDoc in enumerate(ddListDoc):
                        print("%s [%d] %s" % ('-'*30, curDdIdx, '-'*30))

                        curDdCarDict = copy.deepcopy(curDlCarDict)
...
                        self.send_message(self.project_name, curDdCarDict, url=carModelSpecUrl)
即可正常输出car信息:
抓取到2020款信息
[
  [
    "autohome_20200819",
    {
      "carBrandId": "91",
      "carBrandLogoUrl": "https://car3.autoimg.cn/cardfs/series/g26/M05/AE/94/100x100_f40_autohomecar__wKgHEVs9tm6ASWlTAAAUz_2mWTY720.png",
      "carBrandName": "红旗",
      "carMerchantName": "一汽红旗",
      "carMerchantUrl": "https://car.autohome.com.cn/price/brand-91-190.html#pvareaid=2042363",
      "carModelDriveType": "前置前驱",
      "carModelEmissionStandards": "国VI",
      "carModelGearBox": "7挡双离合",
      "carModelGroupName": "1.5升 涡轮增压 169马力 国VI",
      "carModelMsrp": "14.58万",
      "carModelName": "2020款 1.5T DCT旗悦版",
      "carModelPower": "1.5T",
      "carModelSpecUrl": "https://www.autohome.com.cn/spec/46112/#pvareaid=3454492",
      "carModelYear": "2020款",
      "carSeriesId": "4410",
      "carSeriesLevelId": "4",
      "carSeriesLevelName": "中型车",
      "carSeriesMainImgUrl": "https://car2.autoimg.cn/cardfs/product/g3/M04/92/40/380x285_0_q87_autohomecar__ChsEkV8G1BiAFN2JAAlzGHoYv9M868.jpg",
      "carSeriesMaxPrice": "19.08",
      "carSeriesMinPrice": "14.58",
      "carSeriesMsrp": "14.58-19.08万",
      "carSeriesMsrpUrl": "https://www.autohome.com.cn/4410/price.html#pvareaid=101446",
      "carSeriesName": "红旗H5",
      "carSeriesUrl": "https://www.autohome.com.cn/4410/#levelsource=000000000_0&pvareaid=101594"
    },
    "https://www.autohome.com.cn/spec/46112/#pvareaid=3454492"
  ],
然后再去重新运行,估计就可以了。

另外,再去优化一些细节,比如:
支持部分页面 电动车的type-default时3个的情况:
【图】名爵6新能源 2020款 混动Trophy旗舰版报价_图片_名爵_汽车之家
代码:
        if typeDefaultList:
            """
            正常:
                <p>
                    <span class="type-default">前置前驱</span>
                    <span class="type-default">7挡双离合</span>
                </p>


            特殊:
                https://www.autohome.com.cn/4605/


                <p>
                    <span class="type-default">电动</span>
                    <span class="type-default">前置前驱</span>
                    <span class="type-default">AMT(组合10挡)</span>
                </p>
            """
            # spanTypeDefault0 = typeDefaultList[0]
            spanTypeDefault0 = typeDefaultList[-2]
            print("spanTypeDefault0=%s" % spanTypeDefault0)
            carModelDriveType = spanTypeDefault0.text()
            print("carModelDriveType=%s" % carModelDriveType)
            # spanTypeDefault1 = typeDefaultList[1]
            spanTypeDefault1 = typeDefaultList[-1]
            print("spanTypeDefault1=%s" % spanTypeDefault1)
            carModelGearBox = spanTypeDefault1.text()
            print("carModelGearBox=%s" % carModelGearBox)

以及:
当sift不存在 -> model的year是空时
从 model的name中提取 year
        if not curDdCarDict["carModelYear"]:
            foundYearType = re.search("(?P<yearType>\d{4}款)", carModelName)
            if foundYearType:
                yearType = foundYearType.group("yearType")
                print("yearType=%s" % yearType)
                carModelYear = yearType
                print("extract year=%s from modelName=%s" % (carModelYear, carModelName))
                curDdCarDict["carModelYear"] = carModelYear
即可从:
2019款 50T Pro
提取出:
2019款
另外顺带优化了整个代码结构
把能提取出函数的部分,都提取出来了
便于后续回看代码逻辑,方便调试
最后完整代码是:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2020-08-25 21:48:28
# Project: autohome_20200825


import string
import re
import copy


from lxml import etree


from pyspider.libs.base_handler import *


AutohomeHost = "https://www.autohome.com.cn"
CarSpecPrefix = "%s/spec" % AutohomeHost # "https://www.autohome.com.cn/spec/%s/"


class Handler(BaseHandler):
    UserAgent_Mac_Chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
    crawl_config = {
        "headers": {
            "User-Agent": UserAgent_Mac_Chrome,
        }
    }


    def genSpecUrl(self, specId):
        # return "%s/%s" % (CarSpecPrefix, specId)
        return "%s/%s/" % (CarSpecPrefix, specId)


    def genConfigSpecUrl(self, specId):
        configSpecTemplate = "https://car.autohome.com.cn/config/spec/%s.html"
        # https://car.autohome.com.cn/config/spec/43593.html
        return configSpecTemplate % specId
    
    def to10KPrice(self, originPrice):
        tenKPrice = ""
        # 19.08 / '19.08' -> '19.08万'
        if isinstance(originPrice, str):
            tenKPrice = "%s万" % originPrice
        elif isinstance(originPrice, float):
            tenKPrice = "%.2f万" % originPrice
        elif isinstance(originPrice, int):
            tenKPrice = "%s.00万" % originPrice
        
        return tenKPrice


    def extractSpecId(self, specUrl):
        carSpedId = ""
        # https://www.autohome.com.cn/spec/41511/#pvareaid=3454492
        # https://www.autohome.com.cn/spec/2304/
        foundSpecId = re.search("spec/(?P<specId>\d+)", specUrl)
        print("foundSpecId=%s" % foundSpecId)
        if foundSpecId:
            carSpedId = foundSpecId.group("specId")
            print("carSpedId=%s" % carSpedId)
        return carSpedId


    # @every(minutes=24 * 60)
    def on_start(self):
        # autohomeEntryUrl = "https://www.autohome.com.cn/car/"
        # self.crawl(autohomeEntryUrl, callback=self.carBrandListCallback)
        for eachLetter in list(string.ascii_lowercase):
            letterUpper = eachLetter.upper()
            # # for debug
            # letterUpper = "H"
            print("letterUpper=%s" % letterUpper)
            self.crawl("https://www.autohome.com.cn/grade/carhtml/%s.html" % eachLetter,
                save={"initials": letterUpper},
                callback=self.gradCarHtmlPage)


    # # @config(age=10 * 24 * 60 * 60)
    # def carBrandListCallback(self, response):
    #     print("response.url=%s" % response.url)
    #     # <div vos="gs" class="uibox" id="boxA" style="">
    #     for eachVosGs in response.doc('div[vos="gs"]').items():
    #         print("eachVosGs=%s" % eachVosGs)
    #         # self.crawl(each.attr.href, callback=self.detail_page)


    # # @config(priority=2)
    # def detail_page(self, response):
    #     return {
    #         "url": response.url,
    #         "title": response.doc('title').text(),
    #     }


    @catch_status_code_error
    def gradCarHtmlPage(self, response):
        print("gradCarHtmlPage: response=", response)


        # picSeriesItemList = response.doc('.rank-list-ul li div a[href*="/pic/series"]').items()
        # print("picSeriesItemList=", picSeriesItemList)
        # print("len(picSeriesItemList)=%s"%(len(picSeriesItemList)))
        # for each in picSeriesItemList:
        #     self.crawl(each.attr.href, callback=self.picSeriesPage)


        saveDict = response.save
        print("saveDict=", saveDict)
        initials = saveDict["initials"]
        print("initials=", initials)
        respText = response.text
        # print("respText=", respText)


        """
        <dl id="33" olr="6">
            <dt><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362"><img width="50" height="50"
                    src="//car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"></a>
                <div><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362">奥迪</a></div>
            </dt>
        """
        # brandDoc = response.doc('dl dt')
        # print("brandDoc=%s" % brandDoc)
        # brandListDoc = response.doc('dl[id and orl] dt')
        # dlListDoc = response.doc('dl[id and orl]').items()
        # dlListDoc = response.doc("dl[id*=''][orl*='']").items()
        # dlListDoc = response.doc("dl[orl*='']").items()
        # dlListDoc = response.doc("dl").items()
        # dlListDoc = response.doc("dl:regex(id, \d+)").items()
        # dlListDoc = response.doc("dl:regex(id,[0-9])").items()
        # dlListDoc = response.doc("dl[id]").items()
        dlListDoc = response.doc("dl[olr]").items()
        print("type(dlListDoc)=%s" % type(dlListDoc))
        dlList = list(dlListDoc)
        print("len(dlList)=%s" % len(dlList))
        print("dlList=%s" % dlList)
        for curBrandIdx, eachDlDoc in enumerate(dlList):
            print("%s [%d] %s" % ('#'*30, curBrandIdx, '#'*30))


            dtDoc = eachDlDoc.find("dt")
            # print("dtDoc=%s" % dtDoc)
            # <a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362"><img width="50" height="50" src="//car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"></a>
            brandLogoDoc = dtDoc.find('a img')
            # print("brandLogoDoc=%s" % brandLogoDoc)
            carBrandLogoUrl = brandLogoDoc.attr["src"]
            print("carBrandLogoUrl=%s" % carBrandLogoUrl)
            # <div><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362">奥迪</a></div>
            brandNameDoc = dtDoc.find('div a')
            # print("brandNameDoc=%s" % brandNameDoc)
            carBrandName = brandNameDoc.text()
            print("carBrandName=%s" % carBrandName)


            # <div class="h3-tit"><a href="//car.autohome.com.cn/price/brand-33-9.html#pvareaid=2042363">一汽-大众奥迪</a></div>
            # merchantDocGenerator = response.doc("dd div[class='h3-tit'] a").items()
            # ddDoc = eachDlDoc.find("dd")
            ddDoc = eachDlDoc.find("dd")
            # print("ddDoc=%s" % ddDoc)


            merchantDocGenerator = ddDoc.items("div[class='h3-tit'] a")
            merchantDocList = list(merchantDocGenerator)
            # print("merchantDocList=%s" % merchantDocList)
            merchantDocLen = len(merchantDocList)
            print("merchantDocLen=%s" % merchantDocLen)


            # <ul class="rank-list-ul" 0>
            # merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']")
            # merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']").items()
            merchantRankDocGenerator = ddDoc.items("ul[class='rank-list-ul']")
            merchantRankDocList = list(merchantRankDocGenerator)
            # print("merchantRankDocList=%s" % merchantRankDocList)
            merchantRankDocListLen = len(merchantRankDocList)
            print("merchantRankDocListLen=%s" % merchantRankDocListLen)


            for curIdx, merchantItem  in enumerate(merchantDocList):
            # for curIdx, merchantItem  in enumerate(merchantDocGenerator):
                # print("%s" % "="*80)
                print("%s [%d] %s" % ('='*30, curIdx, '='*30))
                # print("type(merchantItem)=%s" % type(merchantItem))
                # print("[%d] merchantItem=%s" % (curIdx, merchantItem))
                # print("[%d] merchantItem=%s" % (curIdx, merchantItem))
                carMerchantName = merchantItem.text()
                print("carMerchantName=%s" % carMerchantName)
                merchantItemAttr = merchantItem.attr
                # print("merchantItemAttr=%s" % merchantItemAttr)
                carMerchantUrl = merchantItemAttr["href"]
                print("carMerchantUrl=%s" % carMerchantUrl)


                # curSubBrandDict = {
                #     "brandName": brandName,
                #     "carBrandLogoUrl": carBrandLogoUrl,
                #     "carMerchantName": carMerchantName,
                #     "carMerchantUrl": carMerchantUrl,
                # }
                # self.send_message(self.project_name, curSubBrandDict, url=carMerchantUrl)


                merchantRankDoc = merchantRankDocList[curIdx]
                # print("merchantRankDoc=%s" % merchantRankDoc)
                # print("type(merchantRankDoc)=%s" % type(merchantRankDoc))


                # type(merchantRankDoc)=<class 'lxml.html.HtmlElement'>
                # merchantRankHtml = etree.tostring(merchantRankDoc)


                # type(merchantRankDoc)=<class 'pyquery.pyquery.PyQuery'>
                # merchantRankHtml = merchantRankDoc.html()


                # print("merchantRankHtml=%s" % merchantRankHtml)


                # <li id="s3170">
                # carSeriesDocGenerator = merchantRankDoc.find("li")
                # carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
                carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
                # print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))
                carSeriesDocList = list(carSeriesDocGenerator)
                # print("type(carSeriesDocList)=%s" % type(carSeriesDocList))
                # print("carSeriesDocList=%s" % carSeriesDocList)
                carSeriesDocListLen = len(carSeriesDocList)
                # print("carSeriesDocListLen=%s" % carSeriesDocListLen)
                
                for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
                    print("%s [%d] %s" % ('-'*30, curSeriesIdx, '-'*30))
                    # print("[%d] eachCarSeriesDoc=%s" % (curSeriesIdx, eachCarSeriesDoc))
                    # print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc)) # type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
                    # <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
                    carSeriesInfoDoc = eachCarSeriesDoc.find("h4 a")
                    # print("type(carSeriesInfoDoc)=%s" % type(carSeriesInfoDoc))
                    # print("carSeriesInfoDoc=%s" % carSeriesInfoDoc)
                    carSeriesName = carSeriesInfoDoc.text()
                    print("carSeriesName=%s" % carSeriesName)
                    carSeriesUrl = carSeriesInfoDoc.attr.href
                    print("carSeriesUrl=%s" % carSeriesUrl)


                    # <div>指导价:<a class="red" href="//www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div>
                    # 厂商指导价=厂商建议零售价格=MSRP=Manufacturer's Suggested Retail Price
                    # carSeriesMsrpDoc = eachCarSeriesDoc.find("div a")
                    carSeriesMsrpDoc = eachCarSeriesDoc.find("div a[class='red']")
                    # print("carSeriesMsrpDoc=%s" % carSeriesMsrpDoc)
                    carSeriesMsrp = carSeriesMsrpDoc.text()
                    print("carSeriesMsrp=%s" % carSeriesMsrp)
                    carSeriesMsrpUrl = carSeriesMsrpDoc.attr.href
                    print("carSeriesMsrpUrl=%s" % carSeriesMsrpUrl)


                    carSeriesDict = {
                        "carBrandName": carBrandName,
                        "carBrandLogoUrl": carBrandLogoUrl,
                        "carMerchantName": carMerchantName,
                        "carMerchantUrl": carMerchantUrl,
                        "carSeriesName": carSeriesName,
                        "carSeriesUrl": carSeriesUrl,
                        "carSeriesMsrp": carSeriesMsrp,
                        "carSeriesMsrpUrl": carSeriesMsrpUrl,
                    }
                    # self.send_message(self.project_name, carSeriesDict, url=carSeriesUrl)
                    self.crawl(carSeriesUrl,
                        callback=self.carSeriesDetailPage,
                        save=carSeriesDict,
                    )


    def on_message(self, project, msg):
        print("on_message: msg=%s" % msg)
        return msg


    @catch_status_code_error
    def carSeriesDetailPage(self, response):
        carSeriesDict = response.save
        print("carSeriesDict=%s" % carSeriesDict)


        carSeriesUrl = response.url
        print("carSeriesUrl=%s" % carSeriesUrl)


        carSeriesMainImgUrl = ""
        carSeriesId = ""
        carSeriesLevelId = ""
        carSeriesMsrp = ""
        carSeriesMinPrice = ""
        carSeriesMaxPrice = ""


        carSeriesHtml = response.text
        print("type(carSeriesHtml)=%s" % type(carSeriesHtml)) # <class 'str'>
        # print("carSeriesHtml=%s" % carSeriesHtml)


        foundLevelId = re.search("var\s+levelid\s+=", carSeriesHtml)
        print("foundLevelId=%s" % foundLevelId)
        isNewLayoutHtml = bool(foundLevelId)
        print("isNewLayoutHtml=%s" % isNewLayoutHtml)
        foundShowCityId = re.search("var\s+showCityId\s+=", carSeriesHtml)
        print("foundShowCityId=%s" % foundShowCityId)
        isOldLayoutHtml = bool(foundShowCityId)
        print("isOldLayoutHtml=%s" % isOldLayoutHtml)


        if isOldLayoutHtml:
            # Q开头
            # https://www.autohome.com.cn/grade/carhtml/q.html
            # ->
            # 东风悦达起亚-千里马
            # https://www.autohome.com.cn/142/#levelsource=000000000_0&pvareaid=101594
            # 其他:
            # 
            # 一汽丰田-花冠
            # https://www.autohome.com.cn/109/#levelsource=000000000_0&pvareaid=101594
            # 
            # 昶洧-昶洧 SUV
            # https://www.autohome.com.cn/4550/#levelsource=000000000_0&pvareaid=101594


            """
            <div class="car_detail " id="tab1-2">
                <div class="models">
                <!--年代-->
                    <div class="header">
                        <div class="car_price">
                            <span class="years">2005款</span>
                            <span class="price">指导价(停售):<strong class="red">6.28万-9.18万</strong></span>
                            <span class="price">二手车价格:<strong class="red"><a class='cd60000' href='//www.che168.com/china/qiya/qianlima/a0_0msdgscncgpiltocsp1exs276/?pvareaid=103693'>0.39万-1.30万</a></strong></span>
            。。。
            <div class="car_detail current" id="tab1-1">
                <div class="models">
                    <!--年代-->
                    <div class="header">
                        <div class="car_price">
                            <span class="years">2006款</span>
                            <span class="price">指导价(停售):<strong class="red">7.28万-8.58万</strong></span>
            。。。
            """
            carDetailDivGenerator = response.doc("div[class^='car_detail']").items()
            print("carDetailDivGenerator=%s" % carDetailDivGenerator)
            carDetailDivList = list(carDetailDivGenerator)
            print("carDetailDivList=%s" % carDetailDivList)
            for curDivIdx, eachCarDetailDoc in enumerate(carDetailDivList):
                print("%s [%d] %s" % ('#'*30, curDivIdx, '#'*30))


                if curDivIdx == 0:
                    # use first car model as series: main img, msrp, ...
                    """
                    <div class="models_info">
                        <dl class='models_pics'>
                            <dt><a href='//car.autohome.com.cn/photolist/series/2305/23796.html?pvareaid=101468'><img
                                src='https://car0.autoimg.cn/upload/spec/1344/t_1344388912334.jpg' width='240'
                                height='180' /></a></dt>
                    """
                    # modelMainImgDocListGenerator = response.doc("div[class='models_info'] dl[class='models_pics'] dt a img").items()
                    # modelMainImgDocList = list(modelMainImgDocListGenerator)
                    # firstModelMainImgDoc = modelMainImgDocList[0]
                    firstModelMainImgDoc = eachCarDetailDoc.find("div[class='models_info'] dl[class='models_pics'] dt a img")
                    firstModelMainImgUrl = firstModelMainImgDoc.attr["src"]
                    print("firstModelMainImgUrl=%s" % firstModelMainImgUrl)
                    carSeriesMainImgUrl = firstModelMainImgUrl
                    print("carSeriesMainImgUrl=%s" % carSeriesMainImgUrl)


                    carSeriesDict["carSeriesMainImgUrl"] = carSeriesMainImgUrl


                    # <div class="car_price">
                    #   <span class="price">指导价(停售):<strong class="red">7.28万-8.58万</strong></span>
                    carPriceStrongDocGenerator = eachCarDetailDoc.items("div[class='car_price'] span[class='price'] strong[class='red']")
                    print("carPriceStrongDocGenerator=%s" % carPriceStrongDocGenerator)
                    if carPriceStrongDocGenerator:
                        carPriceStrongDocList = list(carPriceStrongDocGenerator)
                        print("carPriceStrongDocList=%s" % carPriceStrongDocList)
                        carPriceStrongDoc = carPriceStrongDocList[0]
                        print("carPriceStrongDoc=%s" % carPriceStrongDoc)
                        carPriceMinMax = carPriceStrongDoc.text()
                        print("carPriceMinMax=%s" % carPriceMinMax)
                        if carPriceMinMax:
                            foundMinMax = re.search("(?P<minPrice>[\d\.]+)万-(?P<maxPrice>[\d\.]+)万", carPriceMinMax)
                            print("foundMinMax=%s" % foundMinMax)
                            if foundMinMax:
                                minPrice = foundMinMax.group("minPrice")
                                print("minPrice=%s" % minPrice)
                                minPriceFloat = float(minPrice)
                                print("minPriceFloat=%s" % minPriceFloat)
                                maxPrice = foundMinMax.group("maxPrice")
                                print("maxPrice=%s" % maxPrice)
                                maxPriceFloat = float(maxPrice)
                                print("maxPriceFloat=%s" % maxPriceFloat)
                                averageMsrcPrice = (minPriceFloat + maxPriceFloat) / 2.0
                                print("averageMsrcPrice=%s" % averageMsrcPrice)


                                # carSeriesMsrp = "%.2f万" % averageMsrcPrice
                                carSeriesMsrp = self.to10KPrice(averageMsrcPrice)
                                print("carSeriesMsrp=%s" % carSeriesMsrp)
                                # carSeriesMinPrice = "%.2f万" % minPriceFloat
                                carSeriesMinPrice = self.to10KPrice(minPriceFloat)
                                print("carSeriesMinPrice=%s" % carSeriesMinPrice)
                                # carSeriesMaxPrice = "%.2f万" % maxPriceFloat
                                carSeriesMaxPrice = self.to10KPrice(maxPriceFloat)
                                print("carSeriesMaxPrice=%s" % carSeriesMaxPrice)


                                carSeriesDict["carSeriesMsrp"] = carSeriesMsrp
                                carSeriesDict["carSeriesMinPrice"] = carSeriesMinPrice
                                carSeriesDict["carSeriesMaxPrice"] = carSeriesMaxPrice
                    print("")
                
                self.processSingleCarDetailDiv(carSeriesDict, eachCarDetailDoc)


        elif isNewLayoutHtml:
            carModelDict = copy.deepcopy(carSeriesDict)


            # carSeriesUrl=https://www.autohome.com.cn/2123/#levelsource=000000000_0&pvareaid=101594
            foundSeriesId = re.search("www\.autohome\.com\.cn/(?P<seriesId>\d+)/", carSeriesUrl)
            carSeriesId = foundSeriesId.group("seriesId")
            # carSeriesId = int(carSeriesId)
            print("carSeriesId=%s" % carSeriesId) # 2123
            carModelDict["carSeriesId"] = carSeriesId


            """
            <div class="information-pic">
                <div class="pic-main">
                。。。
                        <picture>
                            。。。
                            <img sizes="380px" width="380" height="285"
                                src="//car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/380x285_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg"
                                srcset="//car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/380x285_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg 380w, //car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/760x570_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg 760w">
                        </picture>
            """
            mainImgDoc = response.doc("div[class='information-pic'] div[class='pic-main'] picture img")
            print("mainImgDoc=%s" % mainImgDoc)
            carSeriesMainImgUrl = mainImgDoc.attr["src"]
            print("carSeriesMainImgUrl=%s" % carSeriesMainImgUrl)
            carModelDict["carSeriesMainImgUrl"] = carSeriesMainImgUrl


            """
            <script type="text/javascript">
                。。。
                var seriesid = '2123';
                var seriesname='哈弗H6';
                var yearid = '0';
                var brandid = '181';
                var levelid = '17';
                var levelname='紧凑型SUV';
                var fctid = '4';
                var SeriesMinPrice='9.80';
                var SeriesMaxPrice='14.10';
            """


            infoKeyList = [
                "seriesid",
                # "seriesname", # has got
                # "yearid", # no need
                "brandid",
                "levelid",
                "levelname",
                # "fctid", # unknown meaning
                "SeriesMinPrice",
                "SeriesMaxPrice",
            ]
            InfoDict = {}
            for eachInfoKey in infoKeyList:
                curPattern = "var\s+%s\s*=\s*'(?P<infoValue>[^']+)'\s*;" % eachInfoKey
                print("curPattern=%s" % curPattern)
                foundInfo = re.search(curPattern, carSeriesHtml)
                print("foundInfo=%s" % foundInfo)
                # if foundInfo:
                infoValue = foundInfo.group("infoValue")
                print("infoValue=%s" % infoValue)
                InfoDict[eachInfoKey] = infoValue
            print("InfoDict=%s" % InfoDict)


            # if "seriesid" in InfoDict:
            carSeriesId = InfoDict["seriesid"] # 2123
            carModelDict["carSeriesId"] = carSeriesId
            # carModelDict["carSeriesName"] = InfoDict["seriesname"] # 哈弗H6
            # if "brandid" in InfoDict:
            carModelDict["carBrandId"] = InfoDict["brandid"] # 181
            # if "levelid" in InfoDict:
            carSeriesLevelId = InfoDict["levelid"] # 17
            carModelDict["carSeriesLevelId"] = carSeriesLevelId
            # if "levelname" in InfoDict:
            carModelDict["carSeriesLevelName"] = InfoDict["levelname"] # 紧凑型SUV
            # if "SeriesMinPrice" in InfoDict:
            carSeriesMinPrice = InfoDict["SeriesMinPrice"] # 9.80
            carModelDict["carSeriesMinPrice"] = self.to10KPrice(carSeriesMinPrice)
            # if "SeriesMaxPrice" in InfoDict:
            carSeriesMaxPrice = InfoDict["SeriesMaxPrice"] # 14.10
            carModelDict["carSeriesMaxPrice"] = self.to10KPrice(carSeriesMaxPrice)


            """
            <div class="series-list">
            。。。
                <li class="more-dropdown">
                    <a href="javascript:void(0);" target="_self" data-toggle="tab" class="tab-disabled" data-target="#specWrap-3">停售款 <i class="athm-iconfont athm-iconfont-arrowdown"></i></a>
                    <ul class="dropdown-con" id="haltList">
                        <li><a href="javascript:void(0);" target="_self" data-toggle="tab" data-yearid="11691">2019款</a></li>
                        ...
                        <li><a href="javascript:void(0);" target="_self" data-toggle="tab" data-yearid="3100">2011款</a></li>
                    </ul>
                </li>
            """
            haltADocGenerator = response.doc("li[class='more-dropdown'] ul[id='haltList'] li a").items()
            print("type(haltADocGenerator)=%s" % type(haltADocGenerator))
            print("haltADocGenerator=%s" % haltADocGenerator)
            haltADocList = list(haltADocGenerator)
            print("haltADocList=%s" % haltADocList)
            for curLiIdx, eachHatADoc in enumerate(haltADocList):
                print("%s [%d] %s" % ('%'*30, curLiIdx, '%'*30))
                self.processSingleHaltA(carModelDict, eachHatADoc)


            # """
            # <div class="information-summary">
            #     <dl class="information-price">
            #         ...
            #         <dd class="type">
            #             <span class="type__item">紧凑型车</span>
            # """
            # carLevelDoc = response.doc("div[class='information-summary'] dl[class='information-price'] dd[class='type'] span[class='type__item']").eq(0)
            # print("carLevelDoc=%s" % carLevelDoc)
            # carSeriesLevelName = carLevelDoc.text()
            # print("carSeriesLevelName=%s" % carSeriesLevelName)
            # carModelDict["carSeriesLevelName"] = carSeriesLevelName


            carSeriesContentDoc = response.doc("div[class='series-content']")
            # print("carSeriesContentDoc=%s" % carSeriesContentDoc)
            # carSpecWrapDoc = carSeriesContentDoc.find("div[class^='spec-wrap']")
            # carSpecWrapDoc = carSeriesContentDoc.find("div[class^='spec-wrap active']")
            carSpecWrapDocGenerator = carSeriesContentDoc.items("div[class^='spec-wrap']")
            print("carSpecWrapDocGenerator=%s" % carSpecWrapDocGenerator)
            carSpecWrapDocList = list(carSpecWrapDocGenerator)
            print("carSpecWrapDocList=%s" % carSpecWrapDocList)
            for curSpecWrapIdx, eachSpecWrapDoc in enumerate(carSpecWrapDocList):
                print("%s [%d] %s" % ('#'*30, curSpecWrapIdx, '#'*30))
                self.processSingleSpecWrapDiv(carModelDict, eachSpecWrapDoc)


    def processSingleCarDetailDiv(self, carSeriesDict, curCarDetailDoc):
        print("in processSingleCarDetailDiv")
        curCarModelGroupDict = copy.deepcopy(carSeriesDict)


        # <span class="years">2006款</span>
        modelYearDoc = curCarDetailDoc.find("span[class='years']")
        print("modelYearDoc=%s" % modelYearDoc)
        carModelYear = modelYearDoc.text()
        print("carModelYear=%s" % carModelYear)
        curCarModelGroupDict["carModelYear"] = carModelYear


        """
        <div class="modelswrap">
            <!-- 信息 start -->
            <div class="models_info">
                <dl class='models_prop'>
                    <dt>发动机:</dt>
                    <dd><span>1.3L</span><span>1.6L</span></dd>
                </dl>
                <dl class='models_prop'>
                    <dt>变速箱:</dt>
                    <dd><span>手动</span><span>自动</span></dd>
                    <dt>车身结构:</dt>
                    <dd><span>三厢</span></dd>
                </dl>
        """
        # modelsPropDdList = curCarDetailDoc.find("div[class='modelswrap'] div[class='models_info'] dl[class='models_prop'] dd")
        modelsPropDdGenerator = curCarDetailDoc.items("div[class='modelswrap'] div[class='models_info'] dl[class='models_prop'] dd")
        print("modelsPropDdGenerator=%s" % modelsPropDdGenerator)
        modelsPropDdList = list(modelsPropDdGenerator)
        print("modelsPropDdList=%s" % modelsPropDdList)
        engineValueDoc = modelsPropDdList[0]
        print("engineValueDoc=%s" % engineValueDoc)
        engineValue = engineValueDoc.text()
        print("engineValue=%s" % engineValue)
        gearBoxValueDoc = modelsPropDdList[1]
        print("gearBoxValueDoc=%s" % gearBoxValueDoc)
        gearBoxValue = gearBoxValueDoc.text()
        print("gearBoxValue=%s" % gearBoxValue)
        bodyStructureValueDoc = modelsPropDdList[2]
        print("bodyStructureValueDoc=%s" % bodyStructureValueDoc)
        bodyStructureValue = bodyStructureValueDoc.text()
        print("bodyStructureValue=%s" % bodyStructureValue)


        carModelGearBox = gearBoxValue
        print("carModelGearBox=%s" % carModelGearBox)
        curCarModelGroupDict["carModelGearBox"] = carModelGearBox # 手动自动
        curCarModelGroupDict["carModelDriveType"] = ""


        curCarModelGroupDict["carModelEmissionStandards"] = ""
        carModelPower = engineValue
        print("carModelPower=%s" % carModelPower)
        curCarModelGroupDict["carModelPower"] = carModelPower


        carModelGroupName = "%s %s %s" % (engineValue, gearBoxValue, bodyStructureValue)
        print("carModelGroupName=%s" % carModelGroupName)
        curCarModelGroupDict["carModelGroupName"] = carModelGroupName


        """
        <table class='models_tab tableline' cellspacing='0' cellpadding='0' border='0'>
            <tr>
                <td class='name_d'>
                    <div class='name'><a title='2006款 1.6L MT特别版GL' href='spec/2304/'>2006款 1.6L MT特别版GL</a></div>
                </td>
                <td class='price_d'>
                    <div class='price01'>8.18万</div>
                </td>
        """
        modelsTrDocGenerator = curCarDetailDoc.items("table[class^='models_tab'] tr")
        print("modelsTrDocGenerator=%s" % modelsTrDocGenerator)
        modelsTrDocList = list(modelsTrDocGenerator)
        print("modelsTrDocList=%s" % modelsTrDocList)
        for curTabIdx, eachModelTrDoc in enumerate(modelsTrDocList):
            print("%s [%d] %s" % ('='*30, curTabIdx, '='*30))
            self.processSingleModelsTr(curCarModelGroupDict, eachModelTrDoc)


    def processSingleModelsTr(self, curCarModelGroupDict, curModelTrDoc):
        curTrCarModeDict = copy.deepcopy(curCarModelGroupDict)
        print("curModelTrDoc=%s" % curModelTrDoc)
        nameADoc = curModelTrDoc.find("td[class='name_d'] div[class='name'] a")
        print("nameADoc=%s" % nameADoc)
        carModelName = nameADoc.text()
        print("carModelName=%s" % carModelName)


        carModelSpecUrl = nameADoc.attr["href"]
        # bug -> wrong url:
        # https://www.autohome.com.cn/142/spec/2304/
        # need repace
        # https://www.autohome.com.cn/142/spec/2304/
        # to 
        # https://www.autohome.com.cn/spec/2304/
        foundSpecId = re.search("spec/(?P<specId>\d+)", carModelSpecUrl)
        carModelSpecId = foundSpecId.group("specId")
        print("carModelSpecId=%s" % carModelSpecId) # 2304
        carModelSpecUrl = self.genSpecUrl(carModelSpecId)
        print("carModelSpecUrl=%s" % carModelSpecUrl)


        priceDivDoc = curModelTrDoc.find("td[class='price_d'] div[class='price01']")
        print("priceDivDoc=%s" % priceDivDoc)
        carModelMsrp = priceDivDoc.text()
        print("carModelMsrp=%s" % carModelMsrp)
        if "暂无" in carModelMsrp:
            carModelMsrp = ""
            print("carModelMsrp=%s" % carModelMsrp)


        curTrCarModeDict["carModelName"] = carModelName
        curTrCarModeDict["carModelSpecUrl"] = carModelSpecUrl
        curTrCarModeDict["carModelMsrp"] = carModelMsrp


        self.send_message(self.project_name, curTrCarModeDict, url=carModelSpecUrl)
        # self.processCarSpecConfig(curTrCarModeDict)


    def processSingleHaltA(self, carModelDict, curHatADoc):
        curHaltCarDict = copy.deepcopy(carModelDict)
        print("curHatADoc=%s" % curHatADoc)
        yearName = curHatADoc.text()
        print("yearName=%s" % yearName)
        yearId = curHatADoc.attr["data-yearid"]
        print("yearId=%s" % yearId)


        # getHaltSpecUrl = "https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=%s&syearid=%s&levelid=%s" % (curHaltCarDict["carSeriesId"], yearId, curHaltCarDict["carSeriesLevelId"])
        carSeriesId = curHaltCarDict["carSeriesId"]
        carSeriesLevelId = curHaltCarDict["carSeriesLevelId"]
        if carSeriesId and carSeriesLevelId:
            getHaltSpecUrl = "https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=%s&syearid=%s&levelid=%s" % (carSeriesId, yearId, carSeriesLevelId)
            # https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=2123&syearid=10379&levelid=17
            print("getHaltSpecUrl=%s" % getHaltSpecUrl)
            self.crawl(getHaltSpecUrl,
                callback=self.haltCarSpecCallback,
                save=curHaltCarDict,
            )


    def processSingleSpecWrapDiv(self, curCarModelDict, curSpecWrapDoc):
        curSpecWrapCarDict = copy.deepcopy(curCarModelDict)
        # print("curSpecWrapDoc=%s" % curSpecWrapDoc)
        """
        <!--即将上市 start-->
        <div class="spec-wrap  active" id="specWrap-1">
            
            <dl class="halt-spec">
                <dt>
                    <div class="spec-name">
                        <span>参数配置未公布</span>
                    </div>


        <dl class="halt-spec">
            <dt>
                <div class="spec-name">
                    <span>1.5升 涡轮增压 169马力 国VI</span>
                </div>
        """
        # dlDoc = curSpecWrapDoc.find("dl[class='']")
        # dlDoc = curSpecWrapDoc.find("dl")
        dlListDocGenerator = curSpecWrapDoc.items("dl")
        print("dlListDocGenerator=%s" % dlListDocGenerator)
        dlDocList = list(dlListDocGenerator)
        print("dlDocList=%s" % dlDocList)
        for curDlIdx, eachDlDoc in enumerate(dlDocList):
            print("%s [%d] %s" % ('='*30, curDlIdx, '='*30))
            self.processSingleSpecDl(curSpecWrapCarDict, eachDlDoc)
    
    def processSingleSpecDl(self, curSpecWrapCarDict, curDlDoc):
        curDlCarDict = copy.deepcopy(curSpecWrapCarDict)
        # print("curDlDoc=%s" % curDlDoc)
        """
            <dt>
                <div class="spec-name">
                    <span>1.5升 涡轮增压 169马力 国VI</span>
        """
        dtDoc = curDlDoc.find("dt")
        # print("dtDoc=%s" % dtDoc)
        groupSpecNameSpanDoc = dtDoc.find("div[class='spec-name'] span")
        print("groupSpecNameSpanDoc=%s" % groupSpecNameSpanDoc)
        carModelGroupName = ""
        if groupSpecNameSpanDoc:
            carModelGroupName = groupSpecNameSpanDoc.text()
            print("carModelGroupName=%s" % carModelGroupName)
        
        curDlCarDict["carModelGroupName"] = carModelGroupName


        # <dd data-sift1="2020款" data-sift2="国VI" data-sift3="1.5T" data-sift4="7挡双离合" class="">
        ddListDoc = curDlDoc.items("dd")
        print("ddListDoc=%s" % ddListDoc)
        for curDdIdx, eachDdDoc in enumerate(ddListDoc):
            print("%s [%d] %s" % ('-'*30, curDdIdx, '-'*30))
            self.processSingleSiftDd(curDlCarDict, eachDdDoc)
    
    def processSingleSiftDd(self, curDlCarDict, curDdDoc):
        print("in processSingleSiftDd")
        curDdCarDict = copy.deepcopy(curDlCarDict)


        curDdAttr = curDdDoc.attr
        """
        正常:
            <dd data-sift1="2020款" data-sift2="国VI" data-sift3="1.5T" data-sift4="7挡双离合" class="">
                ...
        特殊:
            无sift:
                <dd data-electricspecid="47050">
        """
        # print("curDdAttr=%s" % curDdAttr)
        carModelYear = curDdAttr["data-sift1"]
        print("carModelYear=%s" % carModelYear)
        carModelEmissionStandards = curDdAttr["data-sift2"]
        print("carModelEmissionStandards=%s" % carModelEmissionStandards)
        carModelPower = curDdAttr["data-sift3"]
        print("carModelPower=%s" % carModelPower)
        carModelGearBox = curDdAttr["data-sift4"]
        print("carModelGearBox=%s" % carModelGearBox)


        curDdCarDict["carModelYear"] = carModelYear
        curDdCarDict["carModelEmissionStandards"] = carModelEmissionStandards
        curDdCarDict["carModelPower"] = carModelPower
        curDdCarDict["carModelGearBox"] = carModelGearBox


        """
        <div class="spec-name">
            <div class="name-param">
                <p data-gcjid="41511" id="spec_41511">
                    <a href="/spec/41511/#pvareaid=3454492" class="name">2020款 1.5GDIT 自动铂金舒适版</a>
                    <span class="athm-badge athm-badge--grey is-plain">停产在售</span>
                <span class="athm-badge athm-badge--orange">特惠</span></p>
                <p><span class="type-default">前置前驱</span><span class="type-default">7挡双离合</span></p>
            </div>
        </div>
        """
        specNameDoc = curDdDoc.find("div[class='spec-name']")
        # print("specNameDoc=%s" % specNameDoc)
        specADoc = specNameDoc.find("p a[class='name']")
        # print("specADoc=%s" % specADoc)
        carModelName = specADoc.text()
        print("carModelName=%s" % carModelName) # 2020款 1.5GDIT 自动铂金舒适版
        carModelSpecUrl = specADoc.attr["href"]
        print("carModelSpecUrl=%s" % carModelSpecUrl) # https://www.autohome.com.cn/spec/41511/#pvareaid=3454492
        typeDefaultListDoc = specNameDoc.items("p span[class='type-default']")
        print("typeDefaultListDoc=%s" % typeDefaultListDoc)
        typeDefaultList = list(typeDefaultListDoc)
        print("typeDefaultList=%s" % typeDefaultList)
        carModelDriveType = ""
        carModelGearBox = ""
        if typeDefaultList:
            """
            正常:
                <p>
                    <span class="type-default">前置前驱</span>
                    <span class="type-default">7挡双离合</span>
                </p>


            特殊:
                https://www.autohome.com.cn/4605/


                <p>
                    <span class="type-default">电动</span>
                    <span class="type-default">前置前驱</span>
                    <span class="type-default">AMT(组合10挡)</span>
                </p>
            """
            # spanTypeDefault0 = typeDefaultList[0]
            spanTypeDefault0 = typeDefaultList[-2]
            print("spanTypeDefault0=%s" % spanTypeDefault0)
            carModelDriveType = spanTypeDefault0.text()
            print("carModelDriveType=%s" % carModelDriveType)
            # spanTypeDefault1 = typeDefaultList[1]
            spanTypeDefault1 = typeDefaultList[-1]
            print("spanTypeDefault1=%s" % spanTypeDefault1)
            carModelGearBox = spanTypeDefault1.text()
            print("carModelGearBox=%s" % carModelGearBox)


        curDdCarDict["carModelName"] = carModelName
        if not curDdCarDict["carModelYear"]:
            foundYearType = re.search("(?P<yearType>\d{4}款)", carModelName)
            if foundYearType:
                yearType = foundYearType.group("yearType")
                print("yearType=%s" % yearType)
                carModelYear = yearType
                print("extract year=%s from modelName=%s" % (carModelYear, carModelName))
                curDdCarDict["carModelYear"] = carModelYear


        curDdCarDict["carModelSpecUrl"] = carModelSpecUrl
        curDdCarDict["carModelDriveType"] = carModelDriveType # 前置前驱
        curDdCarDict["carModelGearBox"] = carModelGearBox # 7挡双离合


        """
        <div class="spec-guidance">
            <p class="guidance-price">
                <span>10.40万</span>
                <a href="//j.autohome.com.cn/pc/carcounter?type=1&specId=41511&pvareaid=3454617"><i class="athm-iconpng athm-iconpng-calculator"></i></a>
            </p>
        </div>


        <div class="spec-guidance">
            <p class="guidance-price">
                <span><span>暂无</span></span>
        """
        specGuidanceDoc = curDdDoc.find("div[class='spec-guidance']")
        # print("specGuidanceDoc=%s" % specGuidanceDoc)
        guidancePriceSpanDoc = specGuidanceDoc.find("p[class='guidance-price'] span")
        # print("guidancePriceSpanDoc=%s" % guidancePriceSpanDoc)
        carModelMsrp = guidancePriceSpanDoc.text()
        print("carModelMsrp=%s" % carModelMsrp)
        if "暂无" in carModelMsrp:
            carModelMsrp = ""
            print("carModelMsrp=%s" % carModelMsrp)
        curDdCarDict["carModelMsrp"] = carModelMsrp


        self.send_message(self.project_name, curDdCarDict, url=carModelSpecUrl)
        # self.processCarSpecConfig(curDdCarDict)


    @catch_status_code_error
    def haltCarSpecCallback(self, response):
        prevCarModelDict = response.save
        carModelDict = copy.deepcopy(prevCarModelDict)
        print("carModelDict=%s" % carModelDict)


        respJson = response.json
        print("respJson=%s" % respJson)


        """
        [
            {
                "name": "1.5升 涡轮增压 169马力",
                "speclist": [
                    {
                        "specid": 36955,
                        "specname": "2019款 红标 1.5GDIT 自动舒适版",
                        "specstate": 40,
                        "minprice": 102000,
                        "maxprice": 102000,
                        "fueltype": 1,
                        "fueltypedetail": 1,
                        "driveform": "前置前驱",
                        "drivetype": "前驱",
                        "gearbox": "7挡双离合",
                        "evflag": "",
                        "newcarflag": "",
                        "subsidy": "",
                        "paramisshow": 1,
                        "videoid": 0,
                        "link2sc": "http://www.che168.com/china/hafu/hafuh6/7_8/",
                        "price2sc": "7.58万",
                        "price": "10.20万",
                        "syear": 2019
                    }, {
                        "specid": 36956,
                        "specname": "2019款 红标 1.5GDIT 自动都市版",
                        "specstate": 40,
                        "minprice": 109000,
                        "maxprice": 109000,
                        "fueltype": 1,
                        "fueltypedetail": 1,
                        "driveform": "前置前驱",
                        "drivetype": "前驱",
                        "gearbox": "7挡双离合",
                        "evflag": "",
                        "newcarflag": "",
                        "subsidy": "",
                        "paramisshow": 1,
                        "videoid": 0,
                        "link2sc": "",
                        "price2sc": "",
                        "price": "10.90万",
                        "syear": 2019
                    },
                    ...
        """
        if respJson:
            for eachModelGroupDict in respJson:
                modelGroupName = eachModelGroupDict["name"]
                modelSpecList = eachModelGroupDict["speclist"]
                for eachModelDict in modelSpecList:
                    curCarModelDict = copy.deepcopy(carModelDict)


                    carModelYear = "%s款" % eachModelDict["syear"]
                    # carModelSpecUrl = "%s/%s" % (CarSpecPrefix, eachModelDict["specid"])
                    carModelSpecUrl = self.genSpecUrl(eachModelDict["specid"])


                    curCarModelDict["carModelGroupName"] = modelGroupName
                    curCarModelDict["carModelYear"] = carModelYear
                    curCarModelDict["carModelEmissionStandards"] = ""
                    curCarModelDict["carModelPower"] = ""
                    curCarModelDict["carModelDriveType"] = eachModelDict["drivetype"]
                    curCarModelDict["carModelGearBox"] = eachModelDict["gearbox"]
                    curCarModelDict["carModelName"] = eachModelDict["specname"]
                    curCarModelDict["carModelSpecUrl"] = carModelSpecUrl
                    curCarModelDict["carModelMsrp"] = eachModelDict["price"]


                    self.send_message(self.project_name, curCarModelDict, url=carModelSpecUrl)
                    # self.processCarSpecConfig(curCarModelDict)


    @catch_status_code_error
    def processCarSpecConfig(self, curCarModelDict):
        carModelDict = copy.deepcopy(curCarModelDict)
        print("processCarSpecConfig: carModelDict=%s" % carModelDict)
        carModelSpecUrl = carModelDict["carModelSpecUrl"]
        print("carModelSpecUrl=%s" % carModelSpecUrl)
        carModelSpecId = self.extractSpecId(carModelSpecUrl)
        print("carModelSpecId=%s" % carModelSpecId)
        carModelDict["carModelSpecId"] = carModelSpecId # 43593
        carConfigSpecUrl = self.genConfigSpecUrl(carModelSpecId)
        # https://car.autohome.com.cn/config/spec/43593.html
        print("carConfigSpecUrl=%s" % carConfigSpecUrl)
        self.crawl(carConfigSpecUrl,
            fetch_type="js",
            callback=self.carConfigSpecCallback,
            save=carModelDict,
        )


    @catch_status_code_error
    def carConfigSpecCallback(self, response):
        curCarModelDict = response.save
        print("curCarModelDict=%s" % curCarModelDict)
        carModelDict = copy.deepcopy(curCarModelDict)


        configSpecHtml = response.text
        # print("configSpecHtml=%s" % configSpecHtml)
        # print("")


        """
        <table class="tbcs" id="tab_0" style="width: 932px;">
            <tbody>
                <tr>
                    <th class="cstitle" show="1" pid="tab_0" id="nav_meto_0" colspan="5">
                    <h3><span>基本参数</span></h3>
                    </th>
                </tr>
                <tr data-pnid="1_-1" id="tr_0">
        """
        tbodyDoc = response.doc("table[id='tab_0'] tbody")
        print("tbodyDoc=%s" % tbodyDoc)


        carEnergyType = self.getItemFirstValue(tbodyDoc, 2) # 纯电动 / 燃油 / 插电式混合动力
        carModelDict["carEnergyType"] = carEnergyType


        if carEnergyType == "燃油":


            # carModelEmissionStandards = 
            print("TODO: 燃油")


        elif carEnergyType == "纯电动":
            carReleaseTime = self.getItemFirstValue(tbodyDoc, 3) # 2019.11
            carModelDict["carReleaseTime"] = carReleaseTime


            # 工信部纯电续航里程(km)
            carMiitEnduranceMileagePureElectric = self.getItemFirstValue(tbodyDoc, 4) # 265
            carModelDict["carMiitEnduranceMileagePureElectric"] = carMiitEnduranceMileagePureElectric


            # 快充时间(小时)
            carQuickCharge = self.getItemFirstValue(tbodyDoc, 5) # 0.6
            carModelDict["carQuickCharge"] = carQuickCharge


            # 慢充时间(小时)
            carSlowCharge = self.getItemFirstValue(tbodyDoc, 6) # 17
            carModelDict["carSlowCharge"] = carSlowCharge


            # 快充电量百分比
            carQuickChargePercent = self.getItemFirstValue(tbodyDoc, 7) # 80
            carModelDict["carQuickChargePercent"] = carQuickChargePercent


            # 最大功率(kW)
            carMaxPower = self.getItemFirstValue(tbodyDoc, 8) # 100
            carModelDict["carMaxPower"] = carMaxPower


            # 最大扭矩(N·m)
            carMaxTorque = self.getItemFirstValue(tbodyDoc, 9) # 290
            carModelDict["carMaxTorque"] = carMaxTorque


            # 电动机(Ps)
            carHorsePowerElectric = self.getItemFirstValue(tbodyDoc, 10) # 136
            carModelDict["carHorsePowerElectric"] = carHorsePowerElectric


            # 长*宽*高(mm)
            carSize = self.getItemFirstValue(tbodyDoc, 11) # 4237*1785*1548
            carModelDict["carSize"] = carSize


            # 车身结构
            carBodyStructure = self.getItemFirstValue(tbodyDoc, 12) # 5门5座SUV
            carModelDict["carBodyStructure"] = carBodyStructure


            # 最高车速(km/h)
            carMaxSpeed = self.getItemFirstValue(tbodyDoc, 13) # 150
            carModelDict["carMaxSpeed"] = carMaxSpeed


            # 官方0-100km/h加速(s)
            carOfficialSpeedupTime = self.getItemFirstValue(tbodyDoc, 14) # -
            carModelDict["carOfficialSpeedupTime"] = carOfficialSpeedupTime


            # 实测0-100km/h加速(s)
            carActualTestSpeedupTime = self.getItemFirstValue(tbodyDoc, 15) # -
            carModelDict["carActualTestSpeedupTime"] = carActualTestSpeedupTime


            # 实测100-0km/h制动(m)
            carActualTestBrakeDistance = self.getItemFirstValue(tbodyDoc, 16) # -
            carModelDict["carActualTestBrakeDistance"] = carActualTestBrakeDistance


            # 实测续航里程(km)
            carActualTestEnduranceMileage = self.getItemFirstValue(tbodyDoc, 17) # -
            carModelDict["carActualTestEnduranceMileage"] = carActualTestEnduranceMileage


            # 实测快充时间(小时)
            carActualTestQuickCharge = self.getItemFirstValue(tbodyDoc, 18) # -
            carModelDict["carActualTestQuickCharge"] = carActualTestQuickCharge


            # 实测慢充时间(小时)
            carActualTestSlowCharge = self.getItemFirstValue(tbodyDoc, 19) # -
            carModelDict["carActualTestSlowCharge"] = carActualTestSlowCharge


            # 整车质保
            firstDivDoc = self.getItemFirstValue(tbodyDoc, 20, isRespDoc=True)
            # <div>三<span class="hs_kw7_configxv"></span>10<span class="hs_kw1_configxv"></span>公里</div>
            print("firstDivDoc=%s" % firstDivDoc)
            firstDivHtml = firstDivDoc.html()
            # carWholeWarranty = firstDivDoc.text() # 三10公里
            print("firstDivHtml=%s" % firstDivHtml)
            # 三<span class="hs_kw7_configCC"></span>10<span class="hs_kw1_configCC"></span>公里
            # carWholeQualityQuarantee = re.sub("[^<>]+(?P<firstSpan><span.+?></span>)[^<>]+(?P<secondSpan><span.+?></span>)[^<>]+", )
            foundYearDistance = re.search("(?P<warrantyYear>[^<>]+)<span.+?></span>(?P<distanceNumber>[^<>]+)<span.+?></span>(?P<distanceUnit>[^<>]+)", firstDivHtml)
            warrantyYear = foundYearDistance.group("warrantyYear")
            distanceNumber = foundYearDistance.group("distanceNumber")
            distanceUnit = foundYearDistance.group("distanceUnit")
            carWholeWarranty = "%s年或%s万%s" % (warrantyYear, distanceNumber, distanceUnit)
            print("carWholeWarranty=%s" % carWholeWarranty) # 三年或10万公里
            carModelDict["carWholeWarranty"] = carWholeWarranty


        elif carEnergyType == "插电式混合动力":
            print("TODO: 插电式混合动力")


        else:
            errMsg = "TODO: add support %s!" % carEnergyType
            raise Exception(errMsg)


    @catch_status_code_error
    def getItemFirstValue(self, rootDoc, trNumber, isRespDoc=False):
        """
        <tr data-pnid="1_-1" id="tr_2">
            <th>
                <div id="1149"><a href="https://car.autohome.com.cn/baike/detail_7_18_1149.html#pvareaid=2042252">能源类型</a>
                </div>
            </th>
            <td style="background:#F0F3F8;">
                <div>纯电动</div>
            </td>


        <tr data-pnid="1_-1" id="tr_3">
            <th>
                <div id="0">上市<span class="hs_kw40_configxv"></span></div>
            </th>
            <td style="background:#F0F3F8;">
                <div>2019.11</div>
            </td>
            <td>
                <div>2019.11</div>
            </td>
            <td>
                <div></div>
            </td>
            <td>
                <div></div>
            </td>
        </tr>
        """
        trQuery = "tr[id='tr_%s']" % trNumber
        # print("trQuery=%s" % trQuery)
        trDoc = rootDoc.find(trQuery)
        # print("trDoc=%s" % trDoc)
        tdDocGenerator = trDoc.items("td")
        # print("tdDocGenerator=%s" % tdDocGenerator)
        tdDocList = list(tdDocGenerator)
        # print("tdDocList=%s" % tdDocList)
        firstTdDoc = tdDocList[0]
        # print("firstTdDoc=%s" % firstTdDoc)
        firstTdDivDoc = firstTdDoc.find("div")
        print("firstTdDivDoc=%s" % firstTdDivDoc)
        if isRespDoc:
            respItem = firstTdDivDoc
        else:
            firstItemValue = firstTdDivDoc.text()
            respItem = firstItemValue
        print("respItem=%s" % respItem)
        return respItem
供参考。

然后之前出错的数据的部分,就正常了:
搜:
2020款 1.5T DCT旗悦版
找到是2020款的了:
https://www.autohome.com.cn/spec/46112/#pvareaid=3454492    91    https://car3.autoimg.cn/cardfs/series/g26/M05/AE/94/100x100_f40_autohomecar__wKgHEVs9tm6ASWlTAAAUz_2mWTY720.png    红旗    一汽红旗    https://car.autohome.com.cn/price/brand-91-190.html#pvareaid=2042363    前置前驱    国VI    7挡双离合    1.5升 涡轮增压 169马力 国VI    14.58万    2020款 1.5T DCT旗悦版    1.5T    https://www.autohome.com.cn/spec/46112/#pvareaid=3454492    2020款    4410    4    中型车    https://car2.autoimg.cn/cardfs/product/g3/M04/92/40/380x285_0_q87_autohomecar__ChsEkV8G1BiAFN2JAAlzGHoYv9M868.jpg    19.08万    14.58万    14.58-19.08万    https://www.autohome.com.cn/4410/price.html#pvareaid=101446    红旗H5    https://www.autohome.com.cn/4410/#levelsource=000000000_0&pvareaid=101594    {}

转载请注明:在路上 » 【已解决】车型车系数据缺失如红旗H5等部分车型数据

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
90 queries in 0.137 seconds, using 20.84MB memory