今天我们使用 scrapy 写的第一个小 python爬虫 程序,实现一个最最简单的功能,获取当前抓取页面的title和url

1、准备工作
启动wind + R 启动命令行窗口

cd D:


创建目录

mkdir python-project


进入目录

cd python-project

2、安装 scrapy 

conda 工具没有安装的同学可以看这个安装说明

window10下Anaconda安装以及国内下载镜像配置_scrapy安装以及项目创建

conda install scrapy

如果安装出错,使用下面命令来安装

conda install -c conda-forge scrapy

3、创建项目

python -m scrapy startproject csdcb

4、进入项目

cd csdcb

5、创建爬虫程序

python -m scrapy genspider FistScrapy csdcb.com

6、修改配置settings.py 添加如下配置

ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'zh-CN,zh;q=0.9',
}


将下面这三行代码注释打开

ITEM_PIPELINES = {
   'csdcb.pipelines.CsdcbPipeline': 300,
}


7、文件目录
└─csdcb
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  │  FistScrapy.py
    │──│  __init__.py

8、代码编写

FistScrapy.py

# -*- coding: utf-8 -*-
import scrapy
from csdcb.items import CsdcbItem
class FistscrapySpider(scrapy.Spider):
    name = 'FistScrapy'
    allowed_domains = ['www.baidu.com']
    start_urls = ['https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=%E9%B1%BC%E7%AB%BF&rsv_pq=c401b1fa00059748&rsv_t=61fbB9BLqd84vika0t7yRD%2BUBTjc8LokKaBV0cxWxvshpxl0vmtphaj9RPA&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=19&rsv_sug1=20&rsv_sug7=100&rsv_sug2=0&inputT=10447&rsv_sug4=12075']
    def parse(self, response):
        print("===========start FistScrapy.py ============== ")
        torrent = CsdcbItem()
        torrent['url'] = response.url
        titles = response.xpath("//title/text()").extract()
        for title in titles:
            torrent['title'] = title
        print(torrent)
        print("=========== end FistScrapy.py ============== ")
        return torrent
        pass

items.py
 

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class CsdcbItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    # pass

pipelines.py

# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class CsdcbPipeline(object):
    def process_item(self, item, spider):
        print("==============item========start============")
        print(item)
        print("==============item========end============")
        return item

9、运行

python -m scrapy crawl FistScrapy


 

10、scrapy 更多的学习内容,给大家推荐两个网站

scrapy 中文教程网 http://www.scrapyd.cn/doc/185.html

最后修改于 2020-01-08 00:58:31
如果觉得我的文章对你有用,请随意赞赏
扫一扫支付
上一篇