【整理】Python中的re.search和re.findall之间的区别和联系 + re.finall中带命名的组，不带命名的组，非捕获的组，没有分组四种类型之间的区别

之前自己曾被搞晕过很多次。

后来使用这些函数次数多了之后，终于比较清楚的弄懂了两者之间的区别和关系了。

尤其是一些细节方面的注意事项了。

在看下面的总结和代码之前，请先确保你对如下基本概念已经有所了解了：

【教程】详解Python正则表达式

【教程】详解Python正则表达式之： (…) group 分组

【教程】详解Python正则表达式之： (?P<name>…) named group 带命名的组

下面，简单总结如下：

re.search和re.findall的区别和联系

函数返回结果

常见的获得对应的值的方法

常见疑问及解答

re.search

一个Match对象

通过Match对象内的group编号或命名，获得对应的值

问：为何search只匹配到一项，而不是所有匹配的项？
答：因为search本身的功能就是:
从左到右，去计算是否匹配，如果有匹配，就返回。
即只要找到匹配，就返回了。
所以，最多只会匹配一个，
而不会匹配多个。
想要匹配多个，请去使用re.findall

re.findall

一个列表；

列表中每个元素的值的类型，取决于你的正则表达式的写法

是元组tuple：当你的正则表达式中有（带捕获的）分组（简单可理解为有括号）

而tuple的值，是各个group的值所组合出来的

是字符串：当你的正则表达式中没有捕获的组（不分组，或非捕获分组）

字符串的值，是你的正则表达式所匹配到的单个完整的字符串

直接获得对应的列表
每个列表中的值，一般就是你想要的值了

参见下面的详细解释，需要注意四种不同类型的正则表达式的效果的区别。

其中，对于re.findall，又需要特殊注意四种不同类型的正则表达式的效果，都不太一样：

re.finall使用正则表达式的类型	返回值的类型相同点	返回值的区别	用途
不分组=no group	都是返回列表类型的值	列表中每个值，都是完整匹配的字符串	适用于，先通过此种方法获得对应的完整匹配到的字符串，然后再针对每个字符串，提取所需的（对应的每个域，每个组）的值
非捕获分组=non-capturing group	都是返回列表类型的值	列表中每个值，都是完整匹配的字符串	同上，只不过是从正则表达式的形式上，和分组的类型（不带命名的组或带命名的组）中，一一对应，方便逻辑是理解后续所要处理的值
不带命名的分组=unnamed group	都是返回列表类型的值	列表中每个值，都是元祖（tuple）类型的值，内容是每个分组的值的组合	适用于，直接通过findall，就可以获得多个匹配的字符串中，每个字符串中特定的组的内容，省却了再次通过re.search再去提取的工作了
带命名的分组=named group	都是返回列表类型的值	列表中每个值，都是元祖（tuple）类型的值，内容是每个分组的值的组合	同上，但是在正则表达式的形式上，更容易看清楚各个分组的含义

如何深入理解上述的含义，则需要代码详细的演示：

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
【整理】Python中的re.search和re.findall之间的区别和联系 + re.finall中带命名的组，不带命名的组，非捕获的组，没有分组四种类型之间的区别
【整理】Python中的re.search和re.findall之间的区别和联系 + re.finall中带命名的组，不带命名的组，非捕获的组，没有分组四种类型之间的区别


Version:    2012-11-16
Author:     Crifan
"""

import re;

# 提示：
# 在看此教程之前，请先确保已经对下列内容已了解：
# 【教程】详解Python正则表达式
# https://www.crifan.com/detailed_explanation_about_python_regular_express/
# 【教程】详解Python正则表达式之： (…) group 分组
# https://www.crifan.com/detailed_explanation_about_python_regular_express_about_group/
# 【教程】详解Python正则表达式之： (?P<name>…) named group 带命名的组
# https://www.crifan.com/detailed_explanation_about_python_regular_express_named_group/

searchVsFindallStr = """
pic url test 1 http://1821.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35f9d5g213.jpg
pic url test 2 http://1881.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35ee46g213.jpg
pic url test 2 http://1802.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae361ac6g213.jpg
"""

singlePicUrlP_noGroup = "http://\w+\.\w+\.\w+.+?/\w+?.jpg"; # 不带括号，即没有group的
singlePicUrlP_nonCapturingGroup = "http://(?:\w+)\.(?:\w+)\.(?:\w+).+?/(?:\w+?).jpg"; #非捕获的组 == non-capturing group
singlePicUrlP_namedGroup = "http://(?P<field1>\w+)\.(?P<field2>\w+)\.(?P<field3>\w+).+?/(?P<filename>\w+?).jpg"; #带命名的group == named group
singlePicUrlP_unnamedGroup = "http://(\w+)\.(\w+)\.(\w+).+?/(\w+?).jpg"; #不带命名的group == unnamed group

# 1. re.search
#通过search，只能获得单个的字符串
#因为search不像findall，会去搜索所有符合条件的
foundSinglePicUrl = re.search(singlePicUrlP_namedGroup, searchVsFindallStr);
#searc只会在找到第一个符合条件的之后，就停止搜索了
print "foundSinglePicUrl=",foundSinglePicUrl; #foundSinglePicUrl= <_sre.SRE_Match object at 0x01F75230>
#然后返回对应的Match对象
print "type(foundSinglePicUrl)=",type(foundSinglePicUrl); #type(foundSinglePicUrl)= <type '_sre.SRE_Match'>
if(foundSinglePicUrl):
    #对应的，如果带括号了，即带group，是可以通过group来获得对应的值的
    field1 = foundSinglePicUrl.group("field1");
    field2 = foundSinglePicUrl.group("field2");
    field3 = foundSinglePicUrl.group("field3");
    filename = foundSinglePicUrl.group("filename");
    
    group1 = foundSinglePicUrl.group(1);
    group2 = foundSinglePicUrl.group(2);
    group3 = foundSinglePicUrl.group(3);
    group4 = foundSinglePicUrl.group(4);
    
    #field1=1821, filed2=img, field3=pp, filename=u121516081_136ae35f9d5g213
    print "field1=%s, filed2=%s, field3=%s, filename=%s"%(field1, field2, field3, filename);
    
    #此处也可以看到，即使group是命名了，但是也还是对应着索引号1,2,3,4的group的值的
    #两者是等价的，只是通过名字去获得对应的组的值，相对更加具有可读性，且不会出现搞混淆组的编号的问题
    #group1=1821, group2=img, group3=pp, group4=u121516081_136ae35f9d5g213
    print "group1=%s, group2=%s, group3=%s, group4=%s"%(group1, group2, group3, group4); 

# 2. re.findall - no group
#通过findall，想要获得整个字符串的话，就要使用不带括号的，即没有分组
foundAllPicUrl = re.findall(singlePicUrlP_noGroup, searchVsFindallStr);
#findall会找到所有的匹配的字符串
print "foundAllPicUrl=",foundAllPicUrl; #foundAllPicUrl= ['http://1821.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35f9d5g213.jpg', 'http://1881.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35ee46g213.jpg', 'http://1802.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae361ac6g213.jpg']
#然后作为一个列表返回
print "type(foundAllPicUrl)=",type(foundAllPicUrl); #type(foundAllPicUrl)= <type 'list'>
if(foundAllPicUrl):
    for eachPicUrl in foundAllPicUrl:
        print "eachPicUrl=",eachPicUrl; # eachPicUrl= http://1821.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35f9d5g213.jpg
        
        #此处，一般常见做法就是，针对每一个匹配到的，完整的字符串
        #再去使用re.search处理，提取我们所需要的值
        foundEachPicUrl = re.search(singlePicUrlP_namedGroup, eachPicUrl);
        print "type(foundEachPicUrl)=",type(foundEachPicUrl); #type(foundEachPicUrl)= <type '_sre.SRE_Match'>
        print "foundEachPicUrl=",foundEachPicUrl; #foundEachPicUrl= <_sre.SRE_Match object at 0x025D45F8>
        if(foundEachPicUrl):
            field1 = foundEachPicUrl.group("field1");
            field2 = foundEachPicUrl.group("field2");
            field3 = foundEachPicUrl.group("field3");
            filename = foundEachPicUrl.group("filename");
            
            #field1=1821, filed2=img, field3=pp, filename=u121516081_136ae35f9d5g213
            print "field1=%s, filed2=%s, field3=%s, filename=%s"%(field1, field2, field3, filename);

# 3. re.findall - non-capturing group
#其实，此处通过非捕获的组，去使用findall的效果，其实和上面使用的，没有分组的效果，是类似的：
foundAllPicUrlNonCapturing = re.findall(singlePicUrlP_nonCapturingGroup, searchVsFindallStr);
#findall同样会找到所有的匹配的整个的字符串
print "foundAllPicUrlNonCapturing=",foundAllPicUrlNonCapturing; #foundAllPicUrlNonCapturing= ['http://1821.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35f9d5g213.jpg', 'http://1881.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35ee46g213.jpg', 'http://1802.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae361ac6g213.jpg']
#同样作为一个列表返回
print "type(foundAllPicUrlNonCapturing)=",type(foundAllPicUrlNonCapturing); #type(foundAllPicUrlNonCapturing)= <type 'list'>
if(foundAllPicUrlNonCapturing):
    for eachPicUrlNonCapturing in foundAllPicUrlNonCapturing:
        print "eachPicUrlNonCapturing=",eachPicUrlNonCapturing; #eachPicUrlNonCapturing= http://1821.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35f9d5g213.jpg
        
        #此处，可以根据需要，和上面没有分组的例子中类似，再去分别处理每一个字符串，提取你所需要的值

# 4. re.findall - named group
#接着再来演示一下，如果findall中，使用了带命名的group（named group）的结果：
foundAllPicGroups = re.findall(singlePicUrlP_namedGroup, searchVsFindallStr);
#则也是可以去查找所有的匹配到的字符串的
#然后返回的是列表的值
print "type(foundAllPicGroups)=",type(foundAllPicGroups); #type(foundAllPicGroups)= <type 'list'>
#只不过，列表中每个值，都是对应的，各个group的值了
print "foundAllPicGroups=",foundAllPicGroups; #foundAllPicGroups= [('1821', 'img', 'pp', 'u121516081_136ae35f9d5g213'), ('1881', 'img', 'pp', 'u121516081_136ae35ee46g213'), ('1802', 'img', 'pp', 'u121516081_136ae361ac6g213')]
if(foundAllPicGroups):
    for eachPicGroups in foundAllPicGroups:
        #此处，不过由于又是给group命名了，所以，就对应着
        #(?P<field1>\w+) (?P<field2>\w+) (?P<field3>\w+) (?P<filename>\w+?) 这几个部分的值了
        print "eachPicGroups=",eachPicGroups; #eachPicGroups= ('1821', 'img', 'pp', 'u121516081_136ae35f9d5g213')
        #由于此处有多个group，此处类型是tuple，其中由上述四个group所组成
        print "type(eachPicGroups)=",type(eachPicGroups); #type(eachPicGroups)= <type 'tuple'>
        
        #此处，可以根据需要，和上面没有分组的例子中类似，再去分别处理每一个字符串，提取你所需要的值

# 5. re.findall - unnamed group
#此处再来演示一下，findall中，如果使用带group，但是是没有命名的group（unnamed group）的效果：
foundAllPicGroupsUnnamed = re.findall(singlePicUrlP_unnamedGroup, searchVsFindallStr);
#此处，肯定也是返回对应的列表类型
print "type(foundAllPicGroupsUnnamed)=",type(foundAllPicGroupsUnnamed); #type(foundAllPicGroupsUnnamed)= <type 'list'>
#而列表中每个值，其实也是对应各个组的值的组合
print "foundAllPicGroupsUnnamed=",foundAllPicGroupsUnnamed; #foundAllPicGroupsUnnamed= [('1821', 'img', 'pp', 'u121516081_136ae35f9d5g213'), ('1881', 'img', 'pp', 'u121516081_136ae35ee46g213'), ('1802', 'img', 'pp', 'u121516081_136ae361ac6g213')]
if(foundAllPicGroupsUnnamed):
    for eachPicGroupsUnnamed in foundAllPicGroupsUnnamed:
        #可以看到，同样的，每个都是一个tuple变量
        print "type(eachPicGroupsUnnamed)=",type(eachPicGroupsUnnamed); #type(eachPicGroupsUnnamed)= <type 'tuple'>
        #每个tuple中的值，仍是各个未命名的组的值的组合
        print "eachPicGroupsUnnamed=",eachPicGroupsUnnamed; #eachPicGroupsUnnamed= ('1821', 'img', 'pp', 'u121516081_136ae35f9d5g213')
        
        #此处，可以根据需要，和上面没有分组的例子中类似，再去分别处理每一个字符串，提取你所需要的值

【总结】

最简单的总结为：

re.search用来查找，单个的字符串，从中提取所需的，不同域值，即不同group的值；

re.findall，一次性提前多个匹配到

单个完整的字符串（可以后续接着用re.search再去提取不同group的值）
一个tuple值，其中包括了每个group的值 -> 省却了在用re.search提起不同组的值

各位可以根据自己的需要，选择不同的函数。

另外，再提醒一点，我之前就是遇到过一个情况：

即需要获得多个匹配的，每个单个的完整字符串（图片的地址）；

也需要针对每个图片的地址，下载对应的图片，并且提取出其中不同的域值；

此时，就没法使用 re.findall+带命名的分组，去实现了。

只能是通过上述的：

先用re.findall，获得匹配的，每个的单个字符串；

然后针对每个单个字符串，再去做对应的下载图片，用re.search提取所需域值。

所以，还是那句话，需要根据你自己的实际需求，选择合适的函数，实现你所要的功能。

转载请注明：在路上 » 【整理】Python中的re.search和re.findall之间的区别和联系 + re.finall中带命名的组，不带命名的组，非捕获的组，没有分组四种类型之间的区别

Post Views: 5,409

与本文相关的文章