【已解决】Python中用re.sub时出错:TypeError: sequence item 1: expected string or Unicode, int found

【问题】

python中,用如下代码:

    #http://autoexplosion.com/cars/buy/150954.php
    #error :
    #TypeError: sequence item 1: expected string or Unicode, int found
    #when use:
    logging.debug("origHtml=%s", origHtml);
    def _nameToCodepoint(matched):
        logging.info("matched=%s", matched);
        decodedCodepoint = "";
        entityName = matched.group("entityName");
        logging.info("entityName=%s", entityName);
        if(entityName in htmlentitydefs.name2codepoint):
            decodedCodepoint = htmlentitydefs.name2codepoint[entityName];
            logging.info("decodedCodepoint=%s", decodedCodepoint);
            unicodeChar = unichr(decodedCodepoint);
            logging.info("unicodeChar=%s", unicodeChar);
        else:
            #invalid key, just omit it
            
            #http://autoexplosion.com/RVs/buy/9882.php
            #Washer&Dryer;, Awning,
            decodedCodepoint = entityName;
        return decodedCodepoint;
    decodedEntityName = re.sub('&(?P<entityName>[a-zA-Z]{2,10});', _nameToCodepoint, origHtml);

去处理:

LINE 300   DEBUG    origHtml=

            This 2010 Acura RL 3.7 w/CMBS Super Handling All-Wheel Drive all-whe is offered to you for sale by Bay Ridge Nissan.<br />

<br />

When your newly purchased Acura from Bay Ridge Nissan comes with the CARFAX BuyBack Guarantee, you know you’re buying smart. The RL 3.7 w/CMBS Super Handling All-Wheel Drive all-whe’s pristine good looks were combined with the Acura high standard of excellence in order to make this a unique find.<br />

<br />

This RL 3.7 w/CMBS Super Handling All-Wheel Drive all-whe comes equipped with all wheel drive, which means no limitations as to how or where you can drive. Different terrains and varying weather conditions will have little effect as to how this vehicle performs. This vehicle has had only 40,313 miles put on it’s odometer. That amount of mileage makes this vehicle incomparable to the other vehicles on this market and is ready for you to come and see at Bay Ridge Nissan.<br />

<br />

More information about the 2010 Acura RL:<br />

<br />

Standard all-wheel drive stabilizes the car around corners, which is a nice thing to know with the 3.7-liter V6 under the hood. This is the kind of Acura technology that has earned the RL reviews for being quiet and surefooted, as well as quick and refined. The luxurious interior is very well insulated, and the Bose sound system adapts its volume to the speed of the vehicle.<br />

<br />

Strengths of this model include high-tech options available, luxurious interior, All-wheel drive standard, and quick, controlled handling<br />

If you have any questions, need more information or need more pictures please contact us at: <br />

Bay Ridge Nissan<br />

6501 5TH Ave<br />

Brooklyn, NY 11220<br />

<a href="mailto:Bayridgenissan42@yahoo.com">Bayridgenissan42@yahoo.com</a><br />

Darrel or Adriana<br />

1-866-980-2123<br />

1-917-834-1290<br />

1-347-225-0975 FAX<br />

PLEASE VIEW OUR INVENTORY ON EBAY.<br />

<a href="javascript:void(0);" onclick="window.open(new Array(‘http’,”,’:’,’//’,’stores.ebay.com’,’/Bay-Ridge-Nissan-of-New-York?_rdc=1′).join(”), ‘_blank’)">stores.ebay.com</a><br />

<a href="javascript:void(0);" onclick="window.open(new Array(‘http’,”,’:’,’//’,’www.carfaxonline.com’,’/cfm/Display_Dealer_Report.cfm?partner=AXX_0&amp;UID=C367031&amp;vin=JH4KB2F61AC001005′).join(”), ‘_blank’)">www.carfaxonline.com</a>

结果出错:

LINE 302   INFO     matched=<_sre.SRE_Match object at 0x0000000002EBEBE8>

LINE 305   INFO     entityName=amp

LINE 308   INFO     decodedCodepoint=38

LINE 310   INFO     unicodeChar=&

LINE 302   INFO     matched=<_sre.SRE_Match object at 0x0000000002EBEBE8>

LINE 305   INFO     entityName=amp

LINE 308   INFO     decodedCodepoint=38

LINE 310   INFO     unicodeChar=&

LINE 628   ERROR    Unknown Error !

Traceback (most recent call last):

  File "E:\Dev_Root\freelance\Elance\projects\40377988_data_mining\40377988_data_mining\40377988_data_mining.py", line 626, in <module>

    main();

  File "E:\Dev_Root\freelance\Elance\projects\40377988_data_mining\40377988_data_mining\40377988_data_mining.py", line 602, in main

    processEachPageHtml(curType, singlePageHtml);

  File "E:\Dev_Root\freelance\Elance\projects\40377988_data_mining\40377988_data_mining\40377988_data_mining.py", line 510, in processEachPageHtml

    itemInfoDict = processEachItem(itemLink);

  File "E:\Dev_Root\freelance\Elance\projects\40377988_data_mining\40377988_data_mining\40377988_data_mining.py", line 287, in processEachItem

    descHtmlDecoded = crifanLib.decodeHtmlEntity(descContents);

  File "libs\crifanLib.py", line 318, in decodeHtmlEntity

    decodedEntityName = re.sub(‘&(?P<entityName>[a-zA-Z]{2,10});’, _nameToCodepoint, origHtml);

  File "E:\dev_install_root\Python27\lib\re.py", line 151, in sub

    return _compile(pattern, flags).sub(repl, string, count)

TypeError: sequence item 1: expected string or Unicode, int found

【解决过程】

1.看错误的说明,意思是

希望item是string或unicode,但是输入的却是int

但是,此处对于上述的代码中的:

decodedEntityName = re.sub('&(?P<entityName>[a-zA-Z]{2,10});', _nameToCodepoint, origHtml);

实在找不到,其中的origHtml,哪里会有什么int输入进去的。

即,根本找不到错误的具体的位置。

2.然后就单独测试这段字符串试试:

def debugReSub():

    import htmlentitydefs;

   
    origHtml="""            This 2010 Acura RL 3.7 w/CMBS Super Handling All-Wheel Drive all-whe is offered to you for sale by Bay Ridge Nissan.<br />

<br />

When your newly purchased Acura from Bay Ridge Nissan comes with the CARFAX BuyBack Guarantee, you know you’re buying smart. The RL 3.7 w/CMBS Super Handling All-Wheel Drive all-whe’s pristine good looks were combined with the Acura high standard of excellence in order to make this a unique find.<br />

<br />

This RL 3.7 w/CMBS Super Handling All-Wheel Drive all-whe comes equipped with all wheel drive, which means no limitations as to how or where you can drive. Different terrains and varying weather conditions will have little effect as to how this vehicle performs. This vehicle has had only 40,313 miles put on it’s odometer. That amount of mileage makes this vehicle incomparable to the other vehicles on this market and is ready for you to come and see at Bay Ridge Nissan.<br />

<br />

More information about the 2010 Acura RL:<br />

<br />

Standard all-wheel drive stabilizes the car around corners, which is a nice thing to know with the 3.7-liter V6 under the hood. This is the kind of Acura technology that has earned the RL reviews for being quiet and surefooted, as well as quick and refined. The luxurious interior is very well insulated, and the Bose sound system adapts its volume to the speed of the vehicle.<br />

<br />

Strengths of this model include high-tech options available, luxurious interior, All-wheel drive standard, and quick, controlled handling<br />

If you have any questions, need more information or need more pictures please contact us at: <br />

Bay Ridge Nissan<br />

6501 5TH Ave<br />

Brooklyn, NY 11220<br />

<a href="mailto:Bayridgenissan42@yahoo.com">Bayridgenissan42@yahoo.com</a><br />

Darrel or Adriana<br />

1-866-980-2123<br />

1-917-834-1290<br />

1-347-225-0975 FAX<br />

PLEASE VIEW OUR INVENTORY ON EBAY.<br />

<a href="javascript:void(0);" onclick="window.open(new Array(‘http’,”,’:’,’//’,’stores.ebay.com’,’/Bay-Ridge-Nissan-of-New-York?_rdc=1′).join(”), ‘_blank’)">stores.ebay.com</a><br />

<a href="javascript:void(0);" onclick="window.open(new Array(‘http’,”,’:’,’//’,’www.carfaxonline.com’,’/cfm/Display_Dealer_Report.cfm?partner=AXX_0&amp;UID=C367031&amp;vin=JH4KB2F61AC001005′).join(”), ‘_blank’)">www.carfaxonline.com</a> """;

    def _nameToCodepoint(matched):

        logging.info("matched=%s", matched);

        logging.info("matched=%s", matched);

        decodedCodepoint = "";

        entityName = matched.group("entityName");

        logging.info("entityName=%s", entityName);

        if(entityName in htmlentitydefs.name2codepoint):

            decodedCodepoint = htmlentitydefs.name2codepoint[entityName];

            logging.info("decodedCodepoint=%s", decodedCodepoint);

            unicodeChar = unichr(decodedCodepoint);

            logging.info("unicodeChar=%s", unicodeChar);

        else:

            #invalid key, just omit it

           
            #http://autoexplosion.com/RVs/buy/9882.php

            #Washer&Dryer;, Awning,

            decodedCodepoint = entityName;

        return decodedCodepoint;

    decodedEntityName = re.sub(‘&(?P<entityName>[a-zA-Z]{2,10});’, _nameToCodepoint, origHtml);

    logging.info("decodedEntityName=%s", decodedEntityName);

结果错误依旧。

感觉貌似是re.sub的bug???

因为此处,很明显的,符合&xxxx;的,只有两个

&amp;

再也没有其他的,但是对应的调试输出结果中:

LINE 596   INFO     matched=<_sre.SRE_Match object at 0x0000000002C6D8A0>

LINE 597   INFO     matched=<_sre.SRE_Match object at 0x0000000002C6D8A0>

LINE 600   INFO     entityName=amp

LINE 603   INFO     decodedCodepoint=38

LINE 605   INFO     unicodeChar=&

LINE 596   INFO     matched=<_sre.SRE_Match object at 0x0000000002C6D8A0>

LINE 597   INFO     matched=<_sre.SRE_Match object at 0x0000000002C6D8A0>

LINE 600   INFO     entityName=amp

LINE 603   INFO     decodedCodepoint=38

LINE 605   INFO     unicodeChar=&

LINE 687   ERROR    Unknown Error !

Traceback (most recent call last):

  File "E:\Dev_Root\freelance\Elance\projects\40377988_data_mining\40377988_data_mining\40377988_data_mining.py", line 685, in <module>

    main();

  File "E:\Dev_Root\freelance\Elance\projects\40377988_data_mining\40377988_data_mining\40377988_data_mining.py", line 618, in main

    debugReSub();

  File "E:\Dev_Root\freelance\Elance\projects\40377988_data_mining\40377988_data_mining\40377988_data_mining.py", line 613, in debugReSub

    decodedEntityName = re.sub(‘&(?P<entityName>[a-zA-Z]{2,10});’, _nameToCodepoint, origHtml);

  File "E:\dev_install_root\Python27\lib\re.py", line 151, in sub

    return _compile(pattern, flags).sub(repl, string, count)

TypeError: sequence item 1: expected string, int found

明显是第三次再去调用re.sub中找到了匹配的项了,之后才出现这个错误的。

3.缩小了内容范围:

def debugReSub():
    import htmlentitydefs;
    
    origHtml="""partner=AXX_0&amp;UID=C367031&amp;vin=JH4KB2F61AC001005').join(''), '_blank')">www.carfaxonline.com</a> """;
    def _nameToCodepoint(matched):
        logging.info("matched=%s", matched);
        logging.info("matched=%s", matched);
        decodedCodepoint = "";
        entityName = matched.group("entityName");
        logging.info("entityName=%s", entityName);
        if(entityName in htmlentitydefs.name2codepoint):
            decodedCodepoint = htmlentitydefs.name2codepoint[entityName];
            logging.info("decodedCodepoint=%s", decodedCodepoint);
            unicodeChar = unichr(decodedCodepoint);
            logging.info("unicodeChar=%s", unicodeChar);
        else:
            #invalid key, just omit it
            
            #http://autoexplosion.com/RVs/buy/9882.php
            #Washer&Dryer;, Awning,
            decodedCodepoint = entityName;
        return decodedCodepoint;
    decodedEntityName = re.sub('&(?P<entityName>[a-zA-Z]{2,10});', _nameToCodepoint, origHtml);
    logging.info("decodedEntityName=%s", decodedEntityName);

结果错误依旧。

4.实在不行,只有去看re源代码了:

E:\dev_install_root\Python27\Lib\re.py

找到对应的位置,添加调试代码:

然后通过:

def sub(pattern, repl, string, count=0, flags=0):
    """Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
    a replacement string to be used."""
    print "pattern=",pattern;
    print "flags=",flags;
    print "repl=",repl;
    print "string=",string;
    print "count=",count;
    return _compile(pattern, flags).sub(repl, string, count)

得到:

pattern= &(?P<entityName>[a-zA-Z]{2,10});

flags= 0

repl= <function _nameToCodepoint at 0x0000000002C86208>

string= partner=AXX_0&amp;UID=C367031&amp;

count= 0

看起来,也是正常的,没啥错误。

4.后来,终于自己发现自己的失误之处了:

不小心,把

htmlentitydefs.name2codepoint[entityName];

所返回的,int值,赋值给了decodedCodepoint

然后最后_nameToCodepoint返回的是int类型的decodedCodepoint

最终改为:

    def _nameToCodepoint(matched):
        logging.debug("matched=%s", matched);
        wholeStr = matched.group(0);
        logging.debug("wholeStr=%s", wholeStr);
        decodedUnicodeChar = "";
        entityName = matched.group("entityName");
        logging.debug("entityName=%s", entityName);
        if(entityName in htmlentitydefs.name2codepoint):
            decodedCodepoint = htmlentitydefs.name2codepoint[entityName];
            logging.debug("decodedCodepoint=%s", decodedCodepoint);
            decodedUnicodeChar = unichr(decodedCodepoint);
        else:
            #invalid key, just omit it
            
            #http://autoexplosion.com/RVs/buy/9882.php
            #&Dryer;
            #from
            #Washer&Dryer;, Awning,
            decodedUnicodeChar = wholeStr;
        logging.debug("decodedUnicodeChar=%s", decodedUnicodeChar);
        return decodedUnicodeChar;
    decodedEntityName = re.sub('&(?P<entityName>[a-zA-Z]{2,10});', _nameToCodepoint, origHtml);

就正常了。

 

【总结】

其实此处,是自己的不小心,导致了此错误。

导致了,在re.sub调用repl的函数_nameToCodepoint

所得到的返回值,是错误的int类型的decodedCodepoint

改为,正常的string类型的decodedUnicodeChar,就可以了。



发表评论

电子邮件地址不会被公开。 必填项已用*标注

无觅相关文章插件,快速提升流量