一、常用命令
- 安装:pip install scrapy
- 创建项目:scrapy startproject myproject
- 创建spider:scrapy genspider mydomain mydomain.com
- 运行spider:scrapy crawl mydomain
二、基础功能
1、通过脚本运行Spider
import subprocess
def run_scrapy_spider(spider_name):
try:
subprocess.run(['scrapy', 'crawl', spider_name])
print(f"Spider {spider_name} finished.")
except Exception as e:
print(f"Error running spider: {e}")
# 使用函数运行爬虫,替换'your_spider_name'为你的爬虫名
run_scrapy_spider('your_spider_name')
2、Spider匹配多个类型URL
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = 'my_spider'
# 初始种子URL
start_urls = ['http://example.com/', 'http://example.com/posts']
# 定义第一种页面的爬取规则
# 元祖最后一个元素的逗号不能省略,否则报错
rules = (
# follow=True表示会跟随页面中提取的新链接继续爬取。
Rule(LinkExtractor(allow=r'/details/\d+'), callback='parse_details', follow=True),
# 定义第二种页面的爬取规则
Rule(LinkExtractor(allow=r'/posts/\d+'), callback='parse_post', follow=True),
)
def parse_details(self, response):
# 解析第一种类型的页面
pass
def parse_post(self, response):
# 解析第二种类型的页面
pass
3、一个pipeline处理多个item
class MultipleItemPipeline:
def process_item(self, item, spider):
rules = {
MyspiderItem: self.handle_first_item,
}
for key, view in rules.items():
if isinstance(item, key):
view(item)
break
return item
def handle_first_item(self, item):
# 处理FirstItem
pass
def handle_second_item(self, item):
# 处理SecondItem
pass
4、设置User-Agent
设置随机User-Agent
5、Setting常用配置
# 日志级别
LOG_LEVEL = 'CRITICAL'
# 设置Proxy
HTTP_PROXY = 'http://xxx'
# 机器人协议
ROBOTSTXT_OBEY = False