urllib3的抓取问题

November 24, 2014

Reading time ~2 minutes

前言

  昨天在ptipython下输入urllib代码补全时出现提示有一个urllib3的python包,可能是某些应用使用的依赖包,于是在pip搜索了一下,github地址,看到官方介绍,貌似很酷炫的样子:称该包是一个高性能,线程安全,且能重用tcp连接的库。

测试

  根据官方的benchmark test我添加了urllib2的测试,为了体现urllib3的重用连接优点,我特意找了一个接口,能够让urllib3最大限度重用tcp连接,测试demo如下:

#!/usr/bin/env python
	from __future__ import print_function
	import sys
	import time
	import urllib
	import urllib2

	sys.path.append('../')
	import urllib3


	# URLs to download. Doesn't matter as long as they're from the same host, so we
	# can take advantage of connection re-using.
	TO_DOWNLOAD = [
	    'http://www.gandong.com/xml/132.xml',
	    'http://www.gandong.com/xml/135.xml',
	    'http://www.gandong.com/xml/134.xml',
	    'http://www.gandong.com/xml/136.xml',
	    'http://www.gandong.com/xml/137.xml',
	    'http://www.gandong.com/xml/138.xml',
	    'http://www.gandong.com/xml/139.xml',
	    'http://www.gandong.com/xml/151.xml',
	    'http://www.gandong.com/xml/152.xml',
	    'http://www.gandong.com/xml/153.xml',
	    'http://www.gandong.com/xml/154.xml',
	    'http://www.gandong.com/xml/156.xml',
	    'http://www.gandong.com/xml/158.xml',
	    'http://www.gandong.com/xml/163.xml',
	    'http://www.gandong.com/xml/164.xml',
	]


	def urllib_get(url_list):
	    assert url_list
	    for url in url_list:
		now = time.time()
		r = urllib.urlopen(url)
		elapsed = time.time() - now
		print("Got in %0.3f: %s" % (elapsed, url))

	def urllib2_get(url_list):
	    assert url_list
	    for url in url_list:
		now = time.time()
		r = urllib2.urlopen(url)
		elapsed = time.time() -now
		print("Got in %0.3f: %s" % (elapsed, url))

	def pool_get(url_list):
	    assert url_list
	    pool = urllib3.PoolManager()
	    for url in url_list:
		now = time.time()
		r = pool.request('GET', url, assert_same_host=False)
		elapsed = time.time() - now
		print("Got in %0.3fs: %s" % (elapsed, url))


	if __name__ == '__main__':
	    print("Running pool_get ...")
	    now = time.time()
	    pool_get(TO_DOWNLOAD)
	    pool_elapsed = time.time() - now

	    print("Running urllib_get ...")
	    now = time.time()
	    urllib_get(TO_DOWNLOAD)
	    urllib_elapsed = time.time() - now

	    print("Running urllib2_get ...")
	    now = time.time()
	    urllib2_get(TO_DOWNLOAD)
	    urllib2_elapsed = time.time() - now

	    print("Completed pool_get in %0.3fs" % pool_elapsed)
	    print("Completed urllib_get in %0.3fs" % urllib_elapsed)
	    print("Completed urllib2_get in %0.3fs" % urllib_elapsed)
第一次结果:
    Completed pool_get in 0.504s
    Completed urllib_get in 0.323s
    Completed urllib2_get in 0.323s


     第二次结果:
     Completed pool_get in 0.317s
     Completed urllib_get in 0.457s
     Completed urllib2_get in 0.457s

  我的测试网络稳定,反而在第一次时urllib3慢于urllib2和urllib,明显urllib和urllib2在抓取上机制相同。我多次测试后,上述结果基本保持稳定。可见,urllib3并没有官方说的那么高性能。只能说明该包对于重用tcp连接的重用率极低。在速度上面并没有明显的提升。

总结

  相对于urllib,urllib3并没有在速度上有所提升,反而更慢,不过其官方特性说使用了安全连接,这点没有去测试,如果描述真实的话,对于学校的一些模拟登录来说,可能会更适用一点。对一些基本的网络抓取,还是使用urllib更快速。不过对于urllib的设计来说,还是有一个可取之处,类风格使用新式类,语义描述更明确直观,返回的对象是response等。