怎样使用Python抓取WHOIS字典

2018年6月13日 | 分类: 【技术】

参考:http://www.cnblogs.com/zhangmengqin/p/9167358.html

1. 各个域名后缀的whois server可能不一样,可能一样。
2. 各个whois server的whois格式 可能不一样,可能一样。

标题:有全世界各种后缀的whois server
网址:https://www.iana.org/domains/root/db

意图:使用Python爬虫,抓取whois server资料,在本地做成字典;使用时匹配这个字典的key就可以获取对应的whois server进行查询。
工具:BeautifulSoup,好处就是不用自己写正则,只要根据他的语法

1. 抓取域名后缀列表

import requests
from bs4 import BeautifulSoup
iurl = 'https://www.iana.org/domains/root/db'
res = requests.get(iurl,timeout=600)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text,'html.parser')
list1=[]
list2=[]
jsonStr={}
for tag in soup.find_all('span', class_='domain tld'):
	d_suffix = tag.get_text()
	print(d_suffix)

2. 抓取域名后缀对应的whois server列表

import requests
from bs4 import BeautifulSoup
import re
import time
iurl = 'https://www.iana.org/domains/root/db'
res = requests.get(iurl,timeout=600)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text,'html.parser')
list1=[]
list2=[]
jsonStr={}
for tag in soup.find_all('span', class_='domain tld'):
	d_suffix = tag.get_text()
	print(d_suffix)
	list2.append(d_suffix)
	n_suffix = d_suffix.split('.')[1]
	new_url = iurl + '/' + n_suffix
	server=''
	try:
		res2=requests.get(new_url,timeout=600)
		res2.encoding='utf-8'
		soup2= BeautifulSoup(res2.text,'html.parser')
		retxt = re.compile(r'<b>WHOIS Server:</b> (.*?)\n')
		arr = retxt.findall(res2.text)
		if len(arr) > 0:
			server = arr[0]
			list2.append(server)
		print(server)
		time.sleep(1)
	except Exception as e:
		print('超时')
	with open('suffixList.txt', "a",encoding='utf-8') as my_file:
		my_file.write(n_suffix + ":" + server+'\n')
print('抓取结束!!!')

本程序执行时间较长,可选择后台驻留执行:

nohup python servers.py &

3. 输入任何一个后缀的域名查询whois信息

temp = input('请输入你要查询的域名:')
result = temp.split('.')[0]
result1=temp.split('.')[1]
r_suf='.'+result1
print(type(r_suf))
# print(result)
print(r_suf)

# d = json.dumps(dictionary)
whois_server =dictionary.get(r_suf)
print(whois_server)
print(type(whois_server))

if whois_server is None:
	print(r_suf + '此后缀出小差啦~')
else:

	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
	s.connect((whois_server, 43))
	temp=( temp +'\r\n').encode()
	s.send(temp)
	response = b''
	while True:
		data = s.recv(4096)
		response += data
		if not data:
			break
	s.close()
	print(response.decode())