Python 爬虫之Scrapy《下》

liuian 2024-12-07 14:59 38 浏览

今天这篇文章主要是分享两个技术点。
第一：翻页数据如何处理；
第二：构建一个db pipeline来获取数据并入库。

第一部分：翻页处理

在前面的文章中已经能够正常地提取我们想要的页面数据了，但是这只是一个页面想要的数据，如果是有很多页面的数据要处理，我们如何来爬取了。

page=1
start_urls=[] #这个是scrapy框架中定义好的，不可以修改
while (page < 7): #根据自身爬取的页面来定义
    print("the page is:", page)
    url = 'http://lab.scrapyd.cn/page/' + str(page) #对翻页的链接进行拼接
    start_urls.append(url) #追加到上面定义好的列表中去
    page += 1 #好让循环可以退出


def parse(self, response):
    items=LabItem()
    for sel in response.xpath('//div[@class="col-mb-12 col-8"]'):
        print(sel)
        for i in range(len(sel.xpath('//div[@class="quote post"]//span[@class="text"]/text()'))):
            title = sel.xpath('//div[@class="quote post"]//span[@class="text"]/text()')[i].get()
            author = sel.xpath('//div[@class="quote post"]//small[@class="author"]/text()')[i].get()
            items["title"]=title
            items["author"] = author
            yield items

源码中会不断的遍历start_urls这个列表里面的链接地址，并向这个列表里面的链接地址发出request请求，拿到response后再来解析页面数据，源码如下图所示：

第二部分：db pipeline 数据处理

Step1: 创建数据库与表，如下图所示

Step2: 创建sqlitePipeline类并配置setting.py文件

sqlitePipeline类代码如下：
class sqlitePipeline(object):


    def __init__(self):
        print("当爬虫执行开始的时候回调:open_spider")
        self.conn = sqlite3.connect("test.db")
        self.cur = self.conn.cursor()
        self.table='''
        create TABLE IF NOT EXISTS scrapy0725( 
         id  INTEGER   PRIMARY KEY AUTOINCREMENT,
            `author` varchar(255) DEFAULT NULL,
            `title` varchar(2000) DEFAULT NULL
         );
        '''
        self.cur.execute(self.table)




    def process_item(self, item, spider):
        print("开始处理每一条提取出来的数据==============")
        # content = json.dumps(dict(item),ensure_ascii=False)+"\n"
        content = dict(item)
        print("*" * 100)
        insert_sql="INSERT INTO scrapy0725 (author,title) VALUES ('"+str(content['author']).replace("'","")+"','"+str(content['title']).replace("'","")+"')"
        print(insert_sql)
        print("*"*100)
        self.cur.execute(insert_sql)
        self.conn.commit()
        print("*"*100)
        return item


    def close_spider(self, spider):
        sql = "select * from scrapy0725"
        result=self.cur.execute(sql)
        for res in result:
            print(res)
        self.cur.close()
        self.conn.close()
        print("当爬虫执行结束的时候回调:close_spider")


setting.py 配置如下：
ITEM_PIPELINES = {
   'lab.pipelines.sqlitePipeline':500,
   # 'lab.pipelines.FilePipeline': 300,
}

Step3: 执行此命令 scrapy crawl labs

Step4: 查询数据库是否insert成功，如下图所示：

总结：

Python + Scrapy爬虫的文章暂时就分享到这里，Scrapy的爬虫效率还是不错的，大家动手开始实践吧。

sqlite3_step

上一篇：SQL.js 开源:在浏览器中运行 SQLite 数据库
下一篇：Ollama+Qwen2，轻松搭建支持函数调用的聊天系统

Python 爬虫之Scrapy《下》

相关推荐

飞牛OS入门安装遇到问题，如何解决?

C++ std::vector 简介

系统C盘清理:微信PC端文件清理，扩大C盘可用空间步骤

如何在 iPhone 和 Android 上恢复已删除的抖音消息

Boost高性能并发无锁队列指南:boost::lockfree::queue

大模型手册: 保姆级用CherryStudio知识库

西门子博途中如何输入读取和编辑date and time变量

用什么工具在Win中查看8G大的log文件?

我的Excel打开后是一堆乱码，如何解决?

威联通NAS安装阿里云盘WebDAV服务并添加到Infuse