【教程】详解Python正则表达式之： ‘|’ vertical bar 竖杠

Python的手册中，是这么解释的：

'|'
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].

关于竖杠，表示或者的关系，其简单的用法，暂且不多解释。

下面只针对相对复杂一些的用法中，竖杠’|’，即或者，是如何使用的。

1.使用竖杠，匹配多个可能的字符串中的其中一种

最经典的例子要数匹配文件名后缀，图片后缀了。

此处，以图片后缀为例。

【需求】

希望匹配各种图片后缀，比如jpg，jpeg，gif，png，bmp等等，其中的一种。

【代码】

经过折腾，下述代码，是可以正常匹配的：

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
【教程】详解Python正则表达式之： '|' vertical bar 竖杠
【教程】详解Python正则表达式之： ‘|’ vertical bar 竖杠


Version:    2012-11-05
Author:     Crifan
"""

import re;

testStrList = [
    "http://www.example.com/picture_name.jpg",
    "http://www.example.com/picture_name.jpeg",
    "http://www.example.com/picture_name.gif",
    "http://www.example.com/picture_name.bmp",
    "http://www.example.com/picture_name.png",
    "http://www.example.com/picture_name.ige",
];

for eachTestStr in testStrList:
    #foundPictureSuffix = re.search("http://[^:]+?\.(?P<pictureSuffix>[(jpg)|(jpeg)|(gif)|(png)|(bmp)])", eachTestStr); # all will match e
    #foundPictureSuffix = re.search("http://[^:]+?\.(?P<pictureSuffix>[(jpg)|(jpeg)|(gif)|(png)|(bmp)]{3,4})", eachTestStr); # also match .ige
    foundPictureSuffix = re.search("http://[^:]+?\.(?P<pictureSuffix>(jpg)|(jpeg)|(gif)|(png)|(bmp))", eachTestStr); # work ok, not match ige
    
    if(foundPictureSuffix):
        print "eachTestStr=%s, pictureSuffix=%s"%(eachTestStr, foundPictureSuffix.group("pictureSuffix"));
    else:
        print "eachTestStr=%s, pictureSuffix=NOT MATCH"%(eachTestStr);

print "-----------------------------------------------------------------------";

for eachTestStr in testStrList:
    foundPictureSuffixNoGroupName = re.search("http://[^:]+?\.((jpg)|(jpeg)|(gif)|(png)|(bmp))", eachTestStr); # work ok, not match ige
    #foundPictureSuffixNoGroupName = re.search("http://[^:]+?\.(jpg)|(jpeg)|(gif)|(png)|(bmp)", eachTestStr); # only match .jpg
    #foundPictureSuffixNoGroupName = re.search("http://[^:]+?(\.jpg)|(\.jpeg)|(\.gif)|(\.png)|(\.bmp)", eachTestStr); #still only match .jpg
    #foundPictureSuffixNoGroupName = re.search("http://[^:]+?[(\.jpg)|(\.jpeg)|(\.gif)|(\.png)|(\.bmp)]", eachTestStr); # will error: IndexError: no such group
    #foundPictureSuffixNoGroupName = re.search("http://[^:]+?([(\.jpg)|(\.jpeg)|(\.gif)|(\.png)|(\.bmp)])", eachTestStr); # only match .
    
    if(foundPictureSuffixNoGroupName):
        print "eachTestStr=%s, pictureSuffix=%s"%(eachTestStr, foundPictureSuffixNoGroupName.group(1));
    else:
        print "eachTestStr=%s, pictureSuffix=NOT MATCH"%(eachTestStr);

其中，被注释掉的代码，是各种尝试，其中对应的输出，已经标注出来了，感兴趣的，自己去试试即可。

【总结】

总的来说，还是在最外层需要一个圆括号，表示一个group，然后group内部，再用多个竖杠，区分出多个可能的字符串，然后每个字符串，也被括号括起来，表示自己本身是完整的集合，需要严格匹配的。

即，类似于

((aaa)|(bbb)|(ccc))

之类的形式，就可以匹配多种字符串中的其中一种了。

另外对应的，如果想要给组添加命名，则就是这样的形式了：

(?P<groupName>(aaa)|(bbb)|(ccc))

转载请注明：在路上 » 【教程】详解Python正则表达式之： ‘|’ vertical bar 竖杠

Post Views: 3,736

与本文相关的文章