今天我们使用 scrapy 写的第一个小 python爬虫 程序,实现一个最最简单的功能,获取当前抓取页面的title和url
1、准备工作
启动wind + R 启动命令行窗口
cd D:
创建目录
mkdir python-project
进入目录
cd python-project
2、安装 scrapy
conda 工具没有安装的同学可以看这个安装说明
window10下Anaconda安装以及国内下载镜像配置_scrapy安装以及项目创建
conda install scrapy
如果安装出错,使用下面命令来安装
conda install -c conda-forge scrapy
3、创建项目
python -m scrapy startproject csdcb
4、进入项目
cd csdcb
5、创建爬虫程序
python -m scrapy genspider FistScrapy csdcb.com
6、修改配置settings.py 添加如下配置
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
将下面这三行代码注释打开
ITEM_PIPELINES = {
'csdcb.pipelines.CsdcbPipeline': 300,
}
7、文件目录
└─csdcb
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─spiders
│ │ FistScrapy.py
│──│ __init__.py
8、代码编写
FistScrapy.py
# -*- coding: utf-8 -*-
import scrapy
from csdcb.items import CsdcbItem
class FistscrapySpider(scrapy.Spider):
name = 'FistScrapy'
allowed_domains = ['www.baidu.com']
start_urls = ['https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=%E9%B1%BC%E7%AB%BF&rsv_pq=c401b1fa00059748&rsv_t=61fbB9BLqd84vika0t7yRD%2BUBTjc8LokKaBV0cxWxvshpxl0vmtphaj9RPA&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=19&rsv_sug1=20&rsv_sug7=100&rsv_sug2=0&inputT=10447&rsv_sug4=12075']
def parse(self, response):
print("===========start FistScrapy.py ============== ")
torrent = CsdcbItem()
torrent['url'] = response.url
titles = response.xpath("//title/text()").extract()
for title in titles:
torrent['title'] = title
print(torrent)
print("=========== end FistScrapy.py ============== ")
return torrent
pass
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class CsdcbItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
url = scrapy.Field()
title = scrapy.Field()
description = scrapy.Field()
# pass
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class CsdcbPipeline(object):
def process_item(self, item, spider):
print("==============item========start============")
print(item)
print("==============item========end============")
return item
9、运行
python -m scrapy crawl FistScrapy
10、scrapy 更多的学习内容,给大家推荐两个网站
scrapy 中文教程网 http://www.scrapyd.cn/doc/185.html
最后修改于 2020-01-08 00:58:31
如果觉得我的文章对你有用,请随意赞赏
扫一扫支付

