工具
这是我实现的第一个爬虫,抓取lifeofpix网站上下载量超过500的图片。它的功能非常简单,代码量也很小,展示了网络爬虫的基本原理和流程。我使用的Python版本是3.4.6,HTTP处理使用requests
库,数据提取使用BeautifulSoup
库。
准备工作
爬虫要伪装成浏览器,指定User-Agent:
1 2 3 4
| headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) \ Gecko/20100101 Firefox/52.0' }
|
先创建好保存图片的文件夹,用到了os
模块:
1 2 3 4
| path = os.getcwd() path = os.path.join(path, 'images') if not os.path.exists(path): os.mkdir(path)
|
页面处理的流程是:首先使用requests
模块获取页面的HTML文件,然后根据标签特征提取所需信息,一般使用BeautifulSoup
库:
1 2 3
| html = requests.get(url, headers=headers).content soup = BeautifulSoup(html, 'lxml')
|
翻页处理
进到首页后发现,首页等价于第1页,底部有显示总的页数,第n页的url为http://www.lifeofpix.com/page/n/
。依次处理每一页的函数如下:
1 2 3 4 5 6 7 8 9 10 11
| def home_page(url, pages=10): html = requests.get(url, headers=headers).content soup = BeautifulSoup(html, 'lxml') pages_total = int(soup.find('div', attrs={'class': 'total'}).getText()) if pages > pages_total: pages = pages_total
for i in range(1, pages_total+1): page_url = HOME_URL + 'page/' + str(i) + '/' page_n(page_url)
|
单页处理
单个页面上会从上往下排列图片,点击图片进入之后,才能查看图片的详细信息:点赞次数(likes)、下载次数(downloads)、查看次数(views)。检查该页的每一张图片,并将其中下载数超过500的筛选出来并下载:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| def page_n(url): html = requests.get(url, headers=headers).content soup = BeautifulSoup(html, 'lxml') image_info_total = soup.find_all('a', attrs={'class': 'clickarea overlay'}) for item in image_info_total: image_info_url = item['href'] if 'lifeofpix' in image_info_url: url, likes, downloads, views = image_info(image_info_url) if downloads > 500: filename = get_filename(image_info_url, url) download_image(url, filename) print(' ', filename, 'saved to disk')
|
获取图片信息
获取图片的下载链接、点赞次数、下载次数和查看次数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| def image_info(url): html = requests.get(url, headers=headers).content soup = BeautifulSoup(html, 'lxml') image_url = soup.find('img', attrs={'id': 'pic'})['src'] image_data = soup.find('div', attrs={'class': 'col-md-3 col-md-offset-1 data'}) for div in image_data.find_all('div'): image_detail = div.getText() if 'like' in image_detail: image_likes = int(re.sub("\D", "", image_detail)) if 'download' in image_detail: image_downloads = int(re.sub("\D", "", image_detail)) if 'view' in image_detail: image_views = int(re.sub("\D", "", image_detail)) return image_url, image_likes, image_downloads, image_views
|
保存图片
首先构造文件名:
1 2 3 4 5 6 7 8 9 10 11 12 13
|
def get_filename(info_url, down_url): prefix = info_url.split('/')[-2] suffix = down_url.split('.')[-1] filename = 'images/' + prefix + suffix num = 0 while os.path.exists(filename): filename = 'images/' + prefix + str(num) + suffix num = num + 1 return filename
|
保存图片文件用到了shutil
模块:
1 2 3 4 5
| def download_image(down_url, filename): response = requests.get(down_url, stream=True) with open(filename, 'wb') as out_file: shutil.copyfileobj(response.raw, out_file) del response
|
一个简单的爬虫程序就完成了,完整代码请看这里。需要注意,该爬虫可能因网站改版而失效,根据实际情况修改即可。