6.1. re模块搜索时要注意竖线"|"的使用

某次,对于字符串

footerUni=u"分类: | 标签:";

使用:


foundCatZhcn = re.search(u"分类:(?P<catName>.+)|", footerUni);
print "foundCatZhcn=",foundCatZhcn;
if(foundCatZhcn):
    print "foundCatZhcn.group(0)=",foundCatZhcn.group(0);
    print "foundCatZhcn.group(1)=",foundCatZhcn.group(1);
    catName = foundCatZhcn.group("catName");
    print "catName=",catName;

    

所得到的结果却是:


foundCatZhcn= <_sre.SRE_Match object at 0x027E3C20>
foundCatZhcn.group(0)=
foundCatZhcn.group(1)= None
catName= None

    

其中group(0),不是所期望的整个匹配的字符串,且group(1)应该是一个空格的字符,而不是None。

调试了半天,最后终于找到原因了,原来是在正则搜索中,竖线"|",是or的关系

'|'

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].

所以此处匹配到的结果是空值

所以测试过程中,无论如何修改re中的表达式,也都会得到foundCatZhcn是非空的值

然后对应的解决办法是,给竖线加上反斜杠,表示竖线字符本身:

foundCatZhcn = re.search(u"分类:(?P<catName>.*?)\|", footerUni);

这样才能真正自己想要的效果。