怎样用Scrapy爬取WHOIS信息

2023年1月8日 | 分类: 【技术】

参考：http://www.xiaoxiaoguo.cn/python/scrapy-chinaz.html

【需求】

部分域名后缀的注册管理局并不提供WHOIS服务器，仅而提供在线WHOIS查询。以下尝试通过 Scrapy 爬取WHOIS信息。

1. 理解 xpath ，在爬取相关内容的时候要用到。
2. 理解 items 和 pipelines，使用它们可以更好地来存储、处理爬取到的内容

【部署】

新建一个scrapy项目，比如项目名称是 whois_web ，执行如下命令:

注意：Project names must begin with a letter and contain only letters, numbers and underscores

scrapy startproject whois_web

输出：

New Scrapy project 'whois_web', using template directory '/home/klaudius/.local/lib/python3.8/site-packages/scrapy/templates/project', created in:
    /home/klaudius/whois_web

You can start your first spider with:
    cd whois_web
    scrapy genspider example example.com

默认整个项目的结构如下:

whois_web
├── whois_web
│ ├── __init__.py
│ ├── __init__.pyo
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── __init__.pyo
└── scrapy.cfg

scrapy.cfg是整个项目的配置文件，这里不用修改。
whois_web/whois_web/spiders/ 这个目录下存放爬虫程序。

【分析】

分析目标网站，目的有两个:
1. 查询多个域名时的交互问题
2. 找出对应信息的xpath

例如：.TO 公共WHOIS服务器仅提供 DNS 服务器信息，而 WEB 查询可以有时间信息。

https://www.tonic.to/whois?register.to

使用 Google Chrome 浏览器打开以上网址，使用右键》检查，
需要爬取的 WHOIS 数据，右键》Copy》Copy Xpath

可以看出，所有的内容都在 /html/body/pre 下，对应的内容是：

<pre>Domain:               register.to
Created on:           Tue Dec 18 15:28:03 2012
Last edited on:       Wed Jul 13 10:14:22 2022
Expires on:           Tue Dec 18 15:28:03 2131
Primary host add:     162.159.8.97
Primary host name:    NS11.REGISTER.TO
Secondary host add:   162.159.9.187
Secondary host name:  NS12.REGISTER.TO

END
</pre>

【代码】

直接上代码吧

1. whois_web_spider.py

#!/usr/bin/python  
#_*_ coding:utf-8 _*_ 

import scrapy
from chinaz.items import ChinazItem

class DmozSpider(scrapy.Spider):
    name = "chinaz"
    #allowed_domains = ["chinaz.com"]

    dates = file('url').readlines()
    start_urls = []
    for v in dates: 
        v = v.strip('\n')
        v = 'http://whois.chinaz.com/' + v 
        start_urls.append(v)

    def clean_str(self):
        data = self.data
        data = [t.strip() for t in data]
        data = [t.strip('\r') for t in data]
        data = [t for t in data if t != '']
        data = [t.encode('utf-8') for t in data]
        return data

    def parse(self, response):
        surl = response.url
        domain = surl.split('/')[3].strip()
        items = []
        regists = []
        dates = []
        dnss = []
        contects = []
        status = []
        for sel in response.xpath('//ul/li/div[@class="fr WhLeList-right"]'):
            #item = ChinazItem()
            regist = sel.xpath('./div[@class="block ball"]/span/text()').extract()
            date = sel.xpath('./span/text()').extract()
            dns = sel.xpath('./text()').extract()
            if regist != []:
                self.data = regist
                regist = self.clean_str()
                regists.append(regist)
            if date != []:
                self.data = date
                date = self.clean_str()
                dates.append(date)
            if dns !=[]:
                self.data = dns
                dns = self.clean_str()
                dnss.append(dns)

        for sel in response.xpath('//ul/li/div[@class="fr WhLeList-right block ball lh24"]'):
            item = ChinazItem()
            contect = sel.xpath('./span/text()').extract()
            if contect != []:
                self.data = contect
                contect = self.clean_str()
                contects.append(contect)
                #yield item

        for sel in response.xpath('//ul/li/div[@class="fr WhLeList-right clearfix"]'):
            item = ChinazItem()
            state = sel.xpath('./p[@class="lh30 pr tip-sh"]/span/text()').extract()
            if state != []:
                self.data = state
                state = self.clean_str()
                status.append(state)
                #yield item

        dates = [' '.join(t) for t in dates]
        if len(dates) == 4:
            dates.remove(dates[0])

        item = ChinazItem()
        item['regist'] = [' '.join(t) for t in regists]
        item['date'] = dates
        item['dns'] = ' '.join([' '.join(t) for t in dnss])
        item['contect'] = [' '.join(t) for t in contects]
        item['domain'] = domain
        item['status'] = ' '.join([' '.join(t) for t in status])
        return item

2. items.py

import scrapy
class ChinazItem(scrapy.Item):
    regist = scrapy.Field()
    date = scrapy.Field()
    dns = scrapy.Field()
    contect = scrapy.Field()
    domain = scrapy.Field()
    status = scrapy.Field()

3. pipelines.py

class ChinazPipeline(object):
    
    def process_item(self, item, spider):
        try: 
            line = [item['regist'][0],item['date'][0], item['date'][1],item['dns'], item['contect'][0],item['domain'],item['status']]
        except IndexError:
            line = [item['regist'][0],item['date'][0], item['date'][1],item['dns'], ' ',item['domain'],item['status']]
        line = '	'.join(line) + '\n'
        fp = open('/tmp/domain_new.txt','a')
        fp.write(line)
        fp.close()
        return item

4. setting.py里面要开启这个配置

ITEM_PIPELINES = {
    'chinaz.pipelines.ChinazPipeline': 300,
}

以上动作都完成后,就开始爬取吧

进入项目目录，执行：

cd chinaz
scrapy crawl chinaz

【排错】

报错：list index out of range

此错误出现的原因主要有两个。
1. 一个可能是下标超出范围，
2. 一个可能是list是空的，没有一个元素

参考：https://blog.csdn.net/qq_43082153/article/details/108579168