最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】PySpider中PyQuery中把得到的html的text转换为带换行的纯文本字符串

HTML crifan 1168浏览 0评论
折腾:
【记录】用PySpider去爬取scholastic的绘本书籍数据
期间,现在对于:
Riff Raff Sails the High Cheese by Susan Schade | Scholastic
已经用代码:
(注意 html的节点,在PySpider爬取出来的 和 浏览器中的 略有不同)
        description = ""
        descriptionElement = response.doc('div[id="description"]')
        print("descriptionElement=%s" % descriptionElement)
        # descLessElement = descriptionElement.find('span[class="description-less-con"]')
        # print("descLessElement=%s" % descLessElement)
        # descLessText = descLessElement.text()
        # print("descLessText=%s" % descLessText)
        # descMoreLement = descriptionElement.find('span[class="description-more-con"]')
        # print("descMoreLement=%s" % descMoreLement)
        # if descMoreLement:
        #     descMoreText = descMoreLement.text()
        #     print("descMoreText=%s" % descMoreText)
        descriptionText = descriptionElement.text()
        print("descriptionText=%s" % descriptionText)
已经得到了:
descriptionText=Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?<br />With colorful illustrations from Anne Kennedy and lively text from Susan Schade, <i>Riff Raff Sails the High Cheese</i> strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.<br /><i>Riff Raff Sails the High Cheese</i> is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.
现在希望对于此处的,带br,带i的html的内容,转换为字符串,纯文本
-》去掉i,把br变成回车之类的
-》虽然可以手动替换,但是不严谨
所以要去找到更好的办法
pyquery – PyQuery complete API — pyquery 1.2.4 documentation
看了一堆,貌似没什么好办法。
不过看到一个lxml.etree.tostring,所以去:
lxml.etree.tostring html to string
python – Incredibly basic lxml questions: getting HTML/string content of lxml.etree._Element? – Stack Overflow
python – lxml.tostring incorrectly replacing text with HTML entities – Stack Overflow
Python Why lxml etree tostring method returns bytes – Python – Makble
pyspider html remove tag to string
Python code to remove HTML tags from a string – Stack Overflow
用re正则:不方便,还是手动自己替换
用bs:太重了
去试试:
def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())
strip – How to remove tags from a string in python using regular expressions? (NOT in HTML) – Stack Overflow
【已解决】Python中xml.etree.ElementTree出错:AttributeError: module ‘xml’ has no attribute ‘etree’
参考:
https://stackoverflow.com/questions/35663687/lxml-tostring-incorrectly-replacing-text-with-html-entities
>>> from lxml import etree, html
>>> element = etree.fromstring('<p>Hel-lo World</p>')
去改为lxml
import lxml
# import xml

def htmlToString(htmlText):
    # return ''.join(xml.etree.ElementTree.fromstring(htmlText).itertext())
    return ''.join(lxml.etree.ElementTree.fromstring(htmlText).itertext())
结果:
出错:
[E 181011 11:55:58 base_handler:203] 'cython_function_or_method' object has no attribute 'fromstring'
    Traceback (most recent call last):
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task
        result = self._run_task(task, response)
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 176, in _run_task
        return self._run_func(function, response, task)
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 155, in _run_func
        ret = function(*arguments[:len(args) - 1])
      File "<ScholasticStorybook>", line 227, in singleBookCallback
      File "<ScholasticStorybook>", line 19, in htmlToString
    AttributeError: 'cython_function_or_method' object has no attribute 'fromstring'
不过另外又报错:
[E 181011 13:33:22 base_handler:203] syntax error: line 1, column 0
    Traceback (most recent call last):
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task
        result = self._run_task(task, response)
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 176, in _run_task
        return self._run_func(function, response, task)
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 155, in _run_func
        ret = function(*arguments[:len(args) - 1])
      File "<ScholasticStorybook>", line 232, in singleBookCallback
      File "<ScholasticStorybook>", line 23, in htmlToString
      File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
        parser.feed(text)
      File "<string>", line None
    xml.etree.ElementTree.ParseError: syntax error: line 1, column 0
python html to string
Decode HTML entities in Python string? – Stack Overflow
好像python有内置的html的库,和相关的HTMLParser
web scraping – Converting html to text with Python – Stack Overflow
还是用的bs
python – Convert HTML entities to Unicode and vice versa – Stack Overflow
Extracting text from HTML file using Python – Stack Overflow
其推荐用:html2text
Decoding HTML Entities to Text in Python – fredericiana
EscapingHtml – Python Wiki
html — HyperText Markup Language support — Python 3.7.1rc1 documentation
19.1. HTMLParser — Simple HTML and XHTML parser — Python 2.7.15 documentation
算了,还是用bs吧,毕竟效果够好
默认安装BeautifulSoup会去安装BeautifulSoup 3,所以此处报错,不给安装:
➜  crawler_scholastic_storybook git:(master) ✗ pipenv install BeautifulSoup
Installing BeautifulSoup...
Looking in indexes: 
https://mirrors.ustc.edu.cn/pypi/web/simple
Collecting BeautifulSoup
  Downloading 
https://mirrors.ustc.edu.cn/pypi/web/packages/1e/ee/295988deca1a5a7accd783d0dfe14524867e31abb05b6c0eeceee49c759d/BeautifulSoup-3.2.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/46/2hjxz38n22n3ypp_5f6_p__00000gn/T/pip-install-dxfdripk/BeautifulSoup/setup.py", line 22
        print "Unit tests have failed!"
                                      ^
    SyntaxError: Missing parentheses in call to 'print'. Did you mean print(int "Unit tests have failed!")?

    ----------------------------------------

Error:  An error occurred while installing BeautifulSoup!
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/46/2hjxz38n22n3ypp_5f6_p__00000gn/T/pip-install-dxfdripk/BeautifulSoup/
You are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

This is likely caused by a bug in BeautifulSoup. Report this to its maintainers.
所以去直接换成bs4:
➜  crawler_scholastic_storybook git:(master) ✗ pipenv install bs4
Installing bs4...
Looking in indexes: 
https://mirrors.ustc.edu.cn/pypi/web/simple
Collecting bs4
  Downloading 
https://mirrors.ustc.edu.cn/pypi/web/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
  Downloading 
https://mirrors.ustc.edu.cn/pypi/web/packages/21/0a/47fdf541c97fd9b6a610cb5fd518175308a7cc60569962e776ac52420387/beautifulsoup4-4.6.3-py3-none-any.whl
 (90kB)
Building wheels for collected packages: bs4
  Running setup.py bdist_wheel for bs4: started
  Running setup.py bdist_wheel for bs4: finished with status 'done'
  Stored in directory: /Users/crifan/Library/Caches/pipenv/wheels/d8/e6/2f/a8e9e4058de6bf1a3d0cd64e23ba5fba27e75dc282e47a5077
Successfully built bs4
Installing collected packages: beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.6.3 bs4-0.0.1

Adding bs4 to Pipfile's [packages]...
Pipfile.lock (225a5b) out of date, updating to (4a06ee)...
Locking [dev-packages] dependencies...
Locking [packages] dependencies...
Updated Pipfile.lock (4a06ee)!
Installing dependencies from Pipfile.lock (4a06ee)...
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 27/27 — 00:00
然后去用试试
from bs4 import BeautifulSoup

def htmlToString(htmlText):
    soup = BeautifulSoup(htmlText)
    print("soup=%s" % soup)
    pureText = soup.text
    print("pureText=%s" % pureText)
    return pureText
输出:
其中:
的确是可以得到text,但是缺少了:
把br替换为 换行
python html to text with new line
Python 3: Write newlines to HTML – Stack Overflow
python – how to create the new line character – Stack Overflow
python – Converting HTML to plain text while preserving line breaks – Stack Overflow
是我要问的,但是没有合适的解
Reading a text file with new lines in html with python – Stack Overflow
python html to text br
python – Using beautifulsoup to extract text between line breaks (e.g. <br /> tags) – Stack Overflow
html – python how to extract text after br? – Stack Overflow
先去试试把
<br>
<br/>
<br />
换成\n试试看
代码:
from bs4 import BeautifulSoup

def htmlToString(htmlText, retainNewLine=True):
    if retainNewLine:
        htmlText = htmlText.replace("<br>", '\n')
        htmlText = htmlText.replace("<br/>", '\n')
        htmlText = htmlText.replace("<br />", '\n')

    print("htmlText=%s" % htmlText)
    soup = BeautifulSoup(htmlText)
    print("soup=%s" % soup)
    pureText = soup.text
    print("pureText=%s" % pureText)
    return pureText
是可以到达到我要的效果的:
descriptionText=Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?<br />With colorful illustrations from Anne Kennedy and lively text from Susan Schade, <i>Riff Raff Sails the High Cheese</i> strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.<br /><i>Riff Raff Sails the High Cheese</i> is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.

htmlText=Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?
With colorful illustrations from Anne Kennedy and lively text from Susan Schade, <i>Riff Raff Sails the High Cheese</i> strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.
<i>Riff Raff Sails the High Cheese</i> is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.

soup=<html><body><p>Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?
With colorful illustrations from Anne Kennedy and lively text from Susan Schade, <i>Riff Raff Sails the High Cheese</i> strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.
<i>Riff Raff Sails the High Cheese</i> is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.</p></body></html>

pureText=Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?
With colorful illustrations from Anne Kennedy and lively text from Susan Schade, Riff Raff Sails the High Cheese strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.
Riff Raff Sails the High Cheese is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.
【总结】
此处最终还是选择了(之前觉得有点重量级)的BeautifulSoup,去从:
带<br>和<i>等tag的html的字符串:
Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?<br />With colorful illustrations from Anne Kennedy and lively text from Susan Schade, <i>Riff Raff Sails the High Cheese</i> strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.<br /><i>Riff Raff Sails the High Cheese</i> is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.
在安装bs
pipenv install bs4
后,去用代码:
from bs4 import BeautifulSoup

def htmlToString(htmlText, retainNewLine=True):
    if retainNewLine:
        htmlText = htmlText.replace("
", '\n')
        htmlText = htmlText.replace("<br/>", '\n')
        htmlText = htmlText.replace("<br />", '\n')

    print("htmlText=%s" % htmlText)
    soup = BeautifulSoup(htmlText)
    print("soup=%s" % soup)
    pureText = soup.text
    print("pureText=%s" % pureText)
    return pureText
得到我们要的,带换行的字符串:
Shiver me whiskers! Someone has stolen the mice pirates’ most valuable loot: a great big chunk of cheese! Captain Riff Raff and the gang set out to retrieve the stolen booty, but can the cheese be seized?
With colorful illustrations from Anne Kennedy and lively text from Susan Schade, Riff Raff Sails the High Cheese strengthens reading skills for beginning readers and buccaneers. Mice pirates and young readers use rhyming words and simple wordplay to solve the mystery of the missing cheese.
Riff Raff Sails the High Cheese is a Level Two I Can Read book, geared for kids who read on their own but still need a little help.
【后记】
看到:
https://stackoverflow.com/questions/14694482/converting-html-to-text-with-python
“from bs4 import Beautifulsoup
soup = Beautifulsoup(text)
print(soup.get_text(‘\n’))”
但是有人评论:
“In also places newlines in the middle of sentences if you have e.g. “<p>That’s <strong>not</strong> what I want</p>””
还是自己去看看官网文档:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#get-text
自己去试试get_text()
果然如前面所说的:
  • 直接用get_text():pureText = soup.get_text()
    • -》只能得到文本,但是没有br变成换行
  • 用get_text(‘\n’):pureText = soup.get_text(‘\n’)
    • -》会把所有的tag标签前后都换行
    • 导致<i>Riff Raff Sails the High Cheese</i>,也会变成单独的一行
    • 不是我们希望看到的
所以算了,还是用之前的办法吧。
【后记2】
后来注意到此处有警告:
<ScholasticStorybook>:23: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 23 of the file <ScholasticStorybook>. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
所以参考:
BeautifulSoup Parser Warning · Issue #49 · ckreibich/scholar.py
指定解析器为lxml:
soup = BeautifulSoup(htmlText, "lxml")
即可消除警告。

转载请注明:在路上 » 【已解决】PySpider中PyQuery中把得到的html的text转换为带换行的纯文本字符串

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
86 queries in 0.172 seconds, using 22.19MB memory