参考:http://www.xiaoxiaoguo.cn/python/scrapy-chinaz.html
【需求】
部分域名后缀的注册管理局并不提供WHOIS服务器,仅而提供在线WHOIS查询。以下尝试通过 Scrapy 爬取WHOIS信息。
1. 理解 xpath ,在爬取相关内容的时候要用到。
2. 理解 items 和 pipelines,使用它们可以更好地来存储、处理爬取到的内容
【部署】
新建一个scrapy项目,比如项目名称是 whois_web ,执行如下命令:
注意:Project names must begin with a letter and contain only letters, numbers and underscores
scrapy startproject whois_web
输出:
New Scrapy project 'whois_web', using template directory '/home/klaudius/.local/lib/python3.8/site-packages/scrapy/templates/project', created in: /home/klaudius/whois_web You can start your first spider with: cd whois_web scrapy genspider example example.com
默认整个项目的结构如下:
whois_web
├── whois_web
│ ├── __init__.py
│ ├── __init__.pyo
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── __init__.pyo
└── scrapy.cfg
scrapy.cfg是整个项目的配置文件,这里不用修改。
whois_web/whois_web/spiders/ 这个目录下存放爬虫程序。
【分析】
分析目标网站,目的有两个:
1. 查询多个域名时的交互问题
2. 找出对应信息的xpath
例如:.TO 公共WHOIS服务器仅提供 DNS 服务器信息,而 WEB 查询可以有时间信息。
https://www.tonic.to/whois?register.to
使用 Google Chrome 浏览器打开以上网址,使用右键》检查,
需要爬取的 WHOIS 数据,右键》Copy》Copy Xpath
可以看出,所有的内容都在 /html/body/pre 下,对应的内容是:
<pre>Domain: register.to Created on: Tue Dec 18 15:28:03 2012 Last edited on: Wed Jul 13 10:14:22 2022 Expires on: Tue Dec 18 15:28:03 2131 Primary host add: 162.159.8.97 Primary host name: NS11.REGISTER.TO Secondary host add: 162.159.9.187 Secondary host name: NS12.REGISTER.TO END </pre>
【代码】
直接上代码吧
1. whois_web_spider.py
#!/usr/bin/python #_*_ coding:utf-8 _*_ import scrapy from chinaz.items import ChinazItem class DmozSpider(scrapy.Spider): name = "chinaz" #allowed_domains = ["chinaz.com"] dates = file('url').readlines() start_urls = [] for v in dates: v = v.strip('\n') v = 'http://whois.chinaz.com/' + v start_urls.append(v) def clean_str(self): data = self.data data = [t.strip() for t in data] data = [t.strip('\r') for t in data] data = [t for t in data if t != ''] data = [t.encode('utf-8') for t in data] return data def parse(self, response): surl = response.url domain = surl.split('/')[3].strip() items = [] regists = [] dates = [] dnss = [] contects = [] status = [] for sel in response.xpath('//ul/li/div[@class="fr WhLeList-right"]'): #item = ChinazItem() regist = sel.xpath('./div[@class="block ball"]/span/text()').extract() date = sel.xpath('./span/text()').extract() dns = sel.xpath('./text()').extract() if regist != []: self.data = regist regist = self.clean_str() regists.append(regist) if date != []: self.data = date date = self.clean_str() dates.append(date) if dns !=[]: self.data = dns dns = self.clean_str() dnss.append(dns) for sel in response.xpath('//ul/li/div[@class="fr WhLeList-right block ball lh24"]'): item = ChinazItem() contect = sel.xpath('./span/text()').extract() if contect != []: self.data = contect contect = self.clean_str() contects.append(contect) #yield item for sel in response.xpath('//ul/li/div[@class="fr WhLeList-right clearfix"]'): item = ChinazItem() state = sel.xpath('./p[@class="lh30 pr tip-sh"]/span/text()').extract() if state != []: self.data = state state = self.clean_str() status.append(state) #yield item dates = [' '.join(t) for t in dates] if len(dates) == 4: dates.remove(dates[0]) item = ChinazItem() item['regist'] = [' '.join(t) for t in regists] item['date'] = dates item['dns'] = ' '.join([' '.join(t) for t in dnss]) item['contect'] = [' '.join(t) for t in contects] item['domain'] = domain item['status'] = ' '.join([' '.join(t) for t in status]) return item
2. items.py
import scrapy class ChinazItem(scrapy.Item): regist = scrapy.Field() date = scrapy.Field() dns = scrapy.Field() contect = scrapy.Field() domain = scrapy.Field() status = scrapy.Field()
3. pipelines.py
class ChinazPipeline(object): def process_item(self, item, spider): try: line = [item['regist'][0],item['date'][0], item['date'][1],item['dns'], item['contect'][0],item['domain'],item['status']] except IndexError: line = [item['regist'][0],item['date'][0], item['date'][1],item['dns'], ' ',item['domain'],item['status']] line = ' '.join(line) + '\n' fp = open('/tmp/domain_new.txt','a') fp.write(line) fp.close() return item
4. setting.py里面要开启这个配置
ITEM_PIPELINES = { 'chinaz.pipelines.ChinazPipeline': 300, }
以上动作都完成后,就开始爬取吧
进入项目目录,执行:
cd chinaz scrapy crawl chinaz
【排错】
报错:list index out of range
此错误出现的原因主要有两个。
1. 一个可能是下标超出范围,
2. 一个可能是list是空的,没有一个元素
参考:https://blog.csdn.net/qq_43082153/article/details/108579168