【已解决】Beautifulsoup 4中搜索html的p的value包含特定值和p中的a的href

折腾：

【已解决】Scrapy的Python中如何解析部分的html字符串并格式化为html网页源码

期间，

对于：

<code>&lt;h4&gt;Subtitles in .srt format:&lt;/h4&gt;
&lt;p&gt;
&lt;a href="/download/?url=http%3A%2F%2Fwww.youtube.com%2Fapi%2Ftimedtext%3Fsignature%3D28DC9DEF32E87382F97D0A1EA5C18C598FBBF791.5BE63BA820226EC354042E43654F6F35308A1D49%26hl%3Den_US%26expire%3D1520328653%26caps%3Dasr%26v%3DUhO0bkdC4pQ%26asr_langs%3Dru%252Cko%252Cde%252Cpt%252Cja%252Cnl%252Cen%252Cit%252Ces%252Cfr%26key%3Dyttt1%26sparams%3Dasr_langs%252Ccaps%252Cv%252Cexpire%26kind%3Dasr%26lang%3Den&amp;title=Humf+-+23+Humf+Bakes+Biscuits+%28full+episode%29"&gt;
&lt;b&gt;Download&lt;/b&gt;
&lt;/a&gt;&amp;nbsp;&amp;nbsp;English (auto-generated)
&lt;br/&gt;
&lt;/p&gt;
&lt;br/&gt;
&lt;h4&gt;Translated Subtitles:&lt;/h4&gt;
&lt;p&gt;
&lt;a href="/download/?url=http%3A%2F%2Fwww.youtube.com%2Fapi%2Ftimedtext%3Fsignature%3D28DC9DEF32E87382F97D0A1EA5C18C598FBBF791.5BE63BA820226EC354042E43654F6F35308A1D49%26hl%3Den_US%26expire%3D1520328653%26caps%3Dasr%26v%3DUhO0bkdC4pQ%26asr_langs%3Dru%252Cko%252Cde%252Cpt%252Cja%252Cnl%252Cen%252Cit%252Ces%252Cfr%26key%3Dyttt1%26sparams%3Dasr_langs%252Ccaps%252Cv%252Cexpire%26kind%3Dasr%26lang%3Den%26tlang%3Daf&amp;title=Humf+-+23+Humf+Bakes+Biscuits+%28full+episode%29"&gt;
&lt;b&gt;Download&lt;/b&gt;
&lt;/a&gt;&amp;nbsp;&amp;nbsp;Afrikaans
&lt;br/&gt;
&lt;/p&gt;
...
</code>

想要提取对应的

判断p下面的内容是English (auto-generated)

然后再去找p下面的a的href

参考：

Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation

结果用：

<code># englishNode = linksSoup.find("p", text=re.compile(r"English \(auto-generated\)"))
# englishNode = linksSoup.find("p", text=re.compile("English"))
# englishNode = linksSoup.find("p", text=re.compile("English \(auto\-generated\)"))
</code>

都搜不到

去搜p：

<code>englishNode = linksSoup.find("p")
</code>

搜到了，但是是：

p下面变成list了，而不是嵌套的html的元素

正愁如何用soup去find这个p的[1]中才包含字符串呢

发现：

其实是text的值是：

u’Download English (auto-generated)’

所以应该还是可以去搜索的。

此处好像有点怪，已经调试发现：

englishNode.text

是我们要的：u’Download English (auto-generated)’

但是为何：

# englishNode = linksSoup.find(“p”, text=re.compile(“.*?English.*”))

搜不到呢？

又试了试半天：

<code># englishNode = linksSoup.find("p", text=re.compile(r"English \(auto-generated\)"))
# englishNode = linksSoup.find("p", text=re.compile("English"))
# englishNode = linksSoup.find("p")
# englishNode = linksSoup.find("p", text=re.compile(".*?English \(auto\-generated\)"))
# englishNode = linksSoup.find("p", text=re.compile(".*?English.*"))
# englishNode = linksSoup.find("p", text=re.compile(r".*English.*"))
# englishNode = linksSoup.find("p", text="Download  English (auto-generated)")
# englishNode = linksSoup.find_all("p", text=re.compile(r".*English.*"))
# englishNode = linksSoup.find("p", text=re.compile("^Download"))
englishNode = linksSoup.find("p", text=re.compile("^Download"))
self.logger.info("englishNode.contents=%s", englishNode.contents)
self.logger.info("englishNode.text=%s", englishNode.text)
self.logger.info("englishNode=%s", englishNode)
</code>

还是不行。

bs4 re.compile not work

python – regex not working in bs4 – Stack Overflow

试试

果然是：

<code>englishNode = linksSoup.find(lambda tag : tag.name == "p" and "English" in tag.get_text())
</code>

找到了要的节点：

对应着：

此处p节点，有三个字元素：

字符串

所以，此处的p的string是None

但是text是有值的

所以才能搜到要的值

【总结】

此处用bs4的re.compile：

<code>englishNode = linksSoup.find("p", text=re.compile(".*?English.*"))
</code>

之所以找不到节点是因为：

bs4中，string的定义发生变化：

当节点中有多个子节点，比如此处的：

<code>&lt;p&gt;
&lt;a href="/download/?url=http%3A%2F%2Fwww.youtube.com…..."&gt;
&lt;b&gt;Download&lt;/b&gt;
&lt;/a&gt;&amp;nbsp;&amp;nbsp;English (auto-generated)
&lt;br/&gt;
&lt;/p&gt;
</code>

则对应的string是不知道具体指哪一个，所以默认为None

只有是单个节点，才有值。

所以此处是多个节点，用text=re.compile，内部是用string去搜索，所以搜不到。

办法是：

利用text

写法：

<code>englishNode = linksSoup.find(lambda tag : tag.name == "p" and "English" in tag.get_text())
</code>

就可以搜索到了。

转载请注明：在路上 » 【已解决】Beautifulsoup 4中搜索html的p的value包含特定值和p中的a的href

Post Views: 2,124

与本文相关的文章