【部分解决】PySPider出错：TypeError: Object of type ‘ObjectId’ is not JSON serializable

折腾：

【记录】用PySpider去爬取scholastic的绘本书籍数据

期间，去真正Run批量爬取，结果看到输出的log中出错：

[I 181016 09:10:38 result_worker:33] result ScholasticStorybook:b9571bf852d2a10a2f14a999e5f8c51b 
https://www.scholastic.com/content/scholastic/books2/house-of-robots-by-chris-grabenstein
 -> {'originUrl': '
https://www.sch
[E 181016 09:10:38 result_worker:63] Object of type 'ObjectId' is not JSON serializable
    Traceback (most recent call last):
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/result/result_worker.py", line 54, in run
        self.on_result(task, result)
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/result/result_worker.py", line 38, in on_result
        result=result
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/database/sqlite/resultdb.py", line 58, in save
        return self._replace(tablename, **self._stringify(obj))
      File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/database/sqlite/resultdb.py", line 44, in _stringify
        data['result'] = json.dumps(data['result'])
      File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 231, in dumps
        return _default_encoder.encode(obj)
      File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py", line 199, in encode
        chunks = self.iterencode(o, _one_shot=True)
      File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py", line 257, in iterencode
        return _iterencode(o, 0)
      File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py", line 180, in default
        o.__class__.__name__)
    TypeError: Object of type 'ObjectId' is not JSON serializable

以为出错的url是：

https://www.scholastic.com/content/scholastic/books2/house-of-robots-by-chris-grabenstein

后来发现不是。

然后以为是ObjectId，是之前接触到的MongoDB中的变量类型

所以去把代码改为：

class ResultMongo(object):
...
    def on_result(self, result):
        """save result to mongodb"""
        print("ResultMongo on_result: result=%s" % result)
        respResult = None
        if result:
            respResult = self.collection.insert(result)
            print("respResult=%s" % respResult) # respResult=5bc45fad7f4d3847b78e8c69
        # return respResult

即：不去保存和返回mongodb的insert返回的ObjectId类型的变量了。

结果问题依旧。

怀疑是：

class Handler(BaseHandler):
    mongo = ResultMongo()
    print("mongo=%s" % mongo)
    。。。
    def on_result(self, result):
        print("PySpider on_result: result=%s" % result)
        self.mongo.on_result(result) # 执行插入数据的操作
        super(Handler, self).on_result(result) # 调用原有的数据存储

中

super(Handler, self).on_result(result)

的问题，都想要去注释掉呢，反正其实也用不到，数据都已保存到MongoDB了。

然后对于此处错误，开始以为只有一处出错，后来看log才发现：

是多处都出错了

-》证明不是某个url的问题

-》基本上确定就是：

super(Handler, self).on_result(result)

的问题

-》估计是json里面嵌套的值，此处PySPider中无法直接保存，而出错的。

-》打算去注释掉。

先不这么做，先去注释掉要保存的数据中的recommendations：

因为recommendations是对象的列表而不是普通变量类型的列表

-》而其他要保存的字段都是普通的类型，应该不会出现无法保存的问题。

然后再去运行看看：

如果没有出现此处问题，就说明之前猜测是对的。

竟然还是出错，问题依旧：

pyspider TypeError: Object of type ‘ObjectId’ is not JSON serializable

pyspider result_worker TypeError Object of type not JSON serializable

python – TypeError: ObjectId(”) is not JSON serializable – Stack Overflow

tornado&mongo：Object of type ‘ObjectId’ is not JSON serializable – weixin_42581501的博客 – CSDN博客

python – Serializing class instance to JSON – Stack Overflow

pyspider爬虫框架无法返回datetime对象的问题 – 阿超的博客 – CSDN博客

确定就是PySPider中的result_worker的报的错

且此处是不支持ObjectId

-》而ObjectId本身是pymongo的类型

-》所以要去搞清楚，此处保存的数据中，到底哪里包含了：

ObjectId

而此处感觉能和ObjectId有关系的，就只有这一处，所以强制转换为str吧：

    def on_result(self, result):
        """save result to mongodb"""
        print("ResultMongo on_result: result=%s" % result)
        respResult = None
        if result:
            respResult = self.collection.insert(result)
            print("type(respResult)=%s" % type(respResult))
            respResult = str(respResult)
            print("type(respResult)=%s" % type(respResult))
            print("respResult=%s" % respResult) # respResult=5bc45fad7f4d3847b78e8c69
        return respResult

先去调试确保输出是str：

type(respResult)=<class 'bson.objectid.ObjectId'>
type(respResult)=<class 'str'>
respResult=5bc54e0bbfaa44fcce305d8d

然后再去批量爬取，结果：

问题依旧。

去看Results：

http://0.0.0.0:5000/results?project=ScholasticStorybook

果然是空的：

都没有保存成功。

感觉：

super(Handler, self).on_result(result) # 调用原有的数据存储

的写法，难道有问题？

先去加上提调试代码：

        for eachValue in respDict.values():
            print("eachValue=%s, type(eachValue)=%s", eachValue, type(eachValue))

看看保存数据的类型是否全是普通类型

eachValue=
https://www.scholastic.com/content/scholastic/books2/house-of-robots-by-chris-grabenstein
, type(eachValue)=<class 'str'>
eachValue=
https://www.scholastic.com/teachers/books/house-of-robots-by-chris-grabenstein/
, type(eachValue)=<class 'str'>
eachValue=House of Robots, type(eachValue)=<class 'str'>
eachValue=
https://www.scholastic.com/content5/media/products/49/9780545912549_mres.jpg
, type(eachValue)=<class 'str'>
eachValue=['Chris Grabenstein', 'James Patterson'], type(eachValue)=<class 'list'>
eachValue=['Juliana Neufeld'], type(eachValue)=<class 'list'>
eachValue=, type(eachValue)=<class 'str'>
eachValue=0, type(eachValue)=<class 'int'>
eachValue=['3-5', '6-8'], type(eachValue)=<class 'list'>
eachValue=T, type(eachValue)=<class 'str'>
eachValue=750L, type(eachValue)=<class 'str'>
eachValue=, type(eachValue)=<class 'str'>
eachValue=50, type(eachValue)=<class 'str'>
eachValue=Fiction, type(eachValue)=<class 'str'>
eachValue=It's never been easy for Sammy Hayes-Rodriguez to fit in, so he's dreading the day when his genius mom insists he bring her newest invention to school: a walking, talking robot he calls E - for "Error." Sammy's no stranger to robots; his house is full of them. But this one not only thinks it's Sammy's brother; it's actually even nerdier than Sammy. Will E be Sammy's one-way ticket to Loserville? Or will he prove to the world that it's cool to be square? It's a roller-coaster ride for Sammy to discover the amazing secret E holds that could change his family forever, if all goes well on the trial run!, type(eachValue)=<class 'str'>
eachValue=336, type(eachValue)=<class 'int'>
eachValue=9780545912549, type(eachValue)=<class 'str'>
eachValue=['Fitting In', 'Inventors and Inventions', 'Middle School', 'Siblings'], type(eachValue)=<class 'list'>
eachValue=[{'url': '
https://www.scholastic.com/content/scholastic/books2/my-sister-the-vampire-11-vampire-school-dropout-by-sienna-mer
', 'title': 'Vampire School Dropout?'}, {'url': '
https://www.scholastic.com/content/scholastic/books2/middle-school-my-brother-is-a-big-fat-liar-by-james-patterson
', 'title': 'My Brother Is a Big, Fat Liar'}, {'url': '
https://www.scholastic.com/content/scholastic/books2/candy-apple-11-the-sister-switch-by-jane-b-mason
', 'title': 'The Sister Switch'}], type(eachValue)=<class 'list'>

好像是没问题的：

除了recommendations外，

都是str或int，或str的list，都是普通变量，没有ObjectId

算了，还是：

要么注释掉：super(Handler, self).on_result(result) # 调用原有的数据存储

要么想办法找到正确的写法？

先去找找是否有更好的写法

pyspider result worker

pyspider result worker on_result

pyspider on_result

Working with Results – pyspider

实现Pyspider爬虫结果的自定义ResultWorker – 简书

请问,怎么自定义resultdb和resultworker,教程里面写的好模糊. – Google Groups

demo.pyspider.org 部署经验 | Binuxの杂货铺

https://binux.blog/2016/05/deployment-of-demopyspiderorg/

pyspider process和result部分源码分析 – 简书

How to save result? · Issue #79 · binux/pyspider

“super(Handler, self).on_result(result)”

是作者自己这么写的，说明没问题。

Pyspider操作指南 | 思维之海

http://skvel.tk/Pyspider操作指南/

算了，去注释掉：

    def on_result(self, result):
        print("PySpider on_result: result=%s" % result)
        self.mongo.on_result(result) # 执行插入数据的操作
        # super(Handler, self).on_result(result) # 调用原有的数据存储

结果：

终于没有错误了，但是Results中也不会有数据保存了：

【总结】

此处PySpider保存的json字典中，没有特殊的值的类型，都是普通的str，int，str的list等，

尤其是没有（pymongo的）ObjectId

并且：（实际上是没关系，但是以防万一）我ResultMongo的on_result中的（真正是）ObjectId的变量，也去转为str了。

但是结果用：

from pyspider.libs.base_handler import *

import re
import json
# import html
import lxml
from bs4 import BeautifulSoup

from urllib.parse import quote_plus
from pymongo import MongoClient

class ResultMongo(object):

    def __init__(self):
        print("ResultMongo __init__")
        self.client = createMongoClient()
        print("self.client=%s" % self.client)

        self.db = self.client[MONGODB_DB_NAME]
        print("self.db=%s" % self.db) # self.db=Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'Scholastic')

        self.collection = self.db[MONGODB_COLLECTION_NAME]
        print("self.collection=%s" % self.collection) # self.collection=Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'Scholastic'), 'Storybook')

    def __del__(self):
        print("ResultMongo __del__")
        self.client.close()

    def on_result(self, result):
        """save result to mongodb"""
        print("ResultMongo on_result: result=%s" % result)
        respResult = None
        if result:
            respResult = self.collection.insert(result)
            print("type(respResult)=%s" % type(respResult))
            respResult = str(respResult)
            print("type(respResult)=%s" % type(respResult))
            print("respResult=%s" % respResult) # respResult=5bc45fad7f4d3847b78e8c69
        return respResult

class Handler(BaseHandler):
    mongo = ResultMongo()
    print("mongo=%s" % mongo)

        # for debug
        for eachValue in respDict.values():
            print("eachValue=%s, type(eachValue)=%s" % (eachValue, type(eachValue)))
        ...
        return respDict

    def on_result(self, result):
        print("PySpider on_result: result=%s" % result)
        self.mongo.on_result(result) # 执行插入数据的操作
        super(Handler, self).on_result(result) # 调用原有的数据存储

但是竟然竟然报错：

TypeError: Object of type ‘ObjectId’ is not JSON serializable

所以很是诡异。

最后没办法，只有去注释掉PySpider中的保存数据：

# super(Handler, self).on_result(result)

而规避此问题。

-》由此，当然PySpider中webui点击Results的话，也是看不到结果，是空的了。

TODO：

如果以后有机会和时间，再去深入研究，为何出现这么奇怪的问题，找到根本原因。

【后记1】

后来开始真正批量爬取时，又出现此错误了：

所以看来是其他方面的问题，不是此处的问题。

转载请注明：在路上 » 【部分解决】PySPider出错：TypeError: Object of type ‘ObjectId’ is not JSON serializable

Post Views: 1,522

与本文相关的文章