工具
这是我实现的第一个爬虫,抓取lifeofpix网站上下载量超过500的图片。它的功能非常简单,代码量也很小,展示了网络爬虫的基本原理和流程。我使用的Python版本是3.4.6,HTTP处理使用requests库,数据提取使用BeautifulSoup库。
准备工作
爬虫要伪装成浏览器,指定User-Agent:
| 12
 3
 4
 
 | headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) \
 Gecko/20100101 Firefox/52.0'
 }
 
 | 
先创建好保存图片的文件夹,用到了os模块:
| 12
 3
 4
 
 | path = os.getcwd()path = os.path.join(path, 'images')
 if not os.path.exists(path):
 os.mkdir(path)
 
 | 
页面处理的流程是:首先使用requests模块获取页面的HTML文件,然后根据标签特征提取所需信息,一般使用BeautifulSoup库:
| 12
 3
 
 | html = requests.get(url, headers=headers).contentsoup = BeautifulSoup(html, 'lxml')
 
 
 | 
翻页处理
进到首页后发现,首页等价于第1页,底部有显示总的页数,第n页的url为http://www.lifeofpix.com/page/n/。依次处理每一页的函数如下:
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 
 | def home_page(url, pages=10):html = requests.get(url, headers=headers).content
 soup = BeautifulSoup(html, 'lxml')
 
 pages_total = int(soup.find('div', attrs={'class': 'total'}).getText())
 if pages > pages_total:
 pages = pages_total
 
 for i in range(1, pages_total+1):
 page_url = HOME_URL + 'page/' + str(i) + '/'
 page_n(page_url)
 
 | 
单页处理
单个页面上会从上往下排列图片,点击图片进入之后,才能查看图片的详细信息:点赞次数(likes)、下载次数(downloads)、查看次数(views)。检查该页的每一张图片,并将其中下载数超过500的筛选出来并下载:
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 
 | def page_n(url):html = requests.get(url, headers=headers).content
 soup = BeautifulSoup(html, 'lxml')
 image_info_total = soup.find_all('a', attrs={'class': 'clickarea overlay'})
 for item in image_info_total:
 image_info_url = item['href']
 if 'lifeofpix' in image_info_url:
 
 url, likes, downloads, views = image_info(image_info_url)
 
 if downloads > 500:
 filename = get_filename(image_info_url, url)
 download_image(url, filename)
 print('  ', filename, 'saved to disk')
 
 | 
获取图片信息
获取图片的下载链接、点赞次数、下载次数和查看次数:
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 
 | def image_info(url):html = requests.get(url, headers=headers).content
 soup = BeautifulSoup(html, 'lxml')
 image_url = soup.find('img', attrs={'id': 'pic'})['src']
 image_data = soup.find('div', attrs={'class': 'col-md-3 col-md-offset-1 data'})
 for div in image_data.find_all('div'):
 image_detail = div.getText()
 
 
 if 'like' in image_detail:
 image_likes = int(re.sub("\D", "", image_detail))
 if 'download' in image_detail:
 image_downloads = int(re.sub("\D", "", image_detail))
 if 'view' in image_detail:
 image_views = int(re.sub("\D", "", image_detail))
 return image_url, image_likes, image_downloads, image_views
 
 | 
保存图片
首先构造文件名:
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 
 | 
 
 def get_filename(info_url, down_url):
 prefix = info_url.split('/')[-2]
 suffix = down_url.split('.')[-1]
 filename = 'images/' + prefix + suffix
 
 num = 0
 while os.path.exists(filename):
 filename = 'images/' + prefix + str(num) + suffix
 num = num + 1
 return filename
 
 | 
保存图片文件用到了shutil模块:
| 12
 3
 4
 5
 
 | def download_image(down_url, filename):response = requests.get(down_url, stream=True)
 with open(filename, 'wb') as out_file:
 shutil.copyfileobj(response.raw, out_file)
 del response
 
 | 
一个简单的爬虫程序就完成了,完整代码请看这里。需要注意,该爬虫可能因网站改版而失效,根据实际情况修改即可。