Python爬虫:下载lifeofpix图片

工具

这是我实现的第一个爬虫,抓取lifeofpix网站上下载量超过500的图片。它的功能非常简单,代码量也很小,展示了网络爬虫的基本原理和流程。我使用的Python版本是3.4.6,HTTP处理使用requests库,数据提取使用BeautifulSoup库。

准备工作

爬虫要伪装成浏览器,指定User-Agent:

1
2
3
4
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) \
Gecko/20100101 Firefox/52.0'
}

先创建好保存图片的文件夹,用到了os模块:

1
2
3
4
path = os.getcwd()
path = os.path.join(path, 'images')
if not os.path.exists(path):
os.mkdir(path)

页面处理的流程是:首先使用requests模块获取页面的HTML文件,然后根据标签特征提取所需信息,一般使用BeautifulSoup库:

1
2
3
html = requests.get(url, headers=headers).content
soup = BeautifulSoup(html, 'lxml')
# 后续处理...

翻页处理

进到首页后发现,首页等价于第1页,底部有显示总的页数,第n页的url为http://www.lifeofpix.com/page/n/。依次处理每一页的函数如下:

1
2
3
4
5
6
7
8
9
10
11
def home_page(url, pages=10):
html = requests.get(url, headers=headers).content
soup = BeautifulSoup(html, 'lxml')
# 获取总的页数
pages_total = int(soup.find('div', attrs={'class': 'total'}).getText())
if pages > pages_total:
pages = pages_total

for i in range(1, pages_total+1):
page_url = HOME_URL + 'page/' + str(i) + '/'
page_n(page_url)

单页处理

单个页面上会从上往下排列图片,点击图片进入之后,才能查看图片的详细信息:点赞次数(likes)、下载次数(downloads)、查看次数(views)。检查该页的每一张图片,并将其中下载数超过500的筛选出来并下载:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def page_n(url):
html = requests.get(url, headers=headers).content
soup = BeautifulSoup(html, 'lxml')
image_info_total = soup.find_all('a', attrs={'class': 'clickarea overlay'})
for item in image_info_total:
image_info_url = item['href']
if 'lifeofpix' in image_info_url:
# 获取图片详细信息
url, likes, downloads, views = image_info(image_info_url)
# 下载满足条件的图片
if downloads > 500:
filename = get_filename(image_info_url, url)
download_image(url, filename)
print(' ', filename, 'saved to disk')

获取图片信息

获取图片的下载链接、点赞次数、下载次数和查看次数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def image_info(url):
html = requests.get(url, headers=headers).content
soup = BeautifulSoup(html, 'lxml')
image_url = soup.find('img', attrs={'id': 'pic'})['src']
image_data = soup.find('div', attrs={'class': 'col-md-3 col-md-offset-1 data'})
for div in image_data.find_all('div'):
image_detail = div.getText()
# 当次数大于1,单词为复数,但也存在单数的可能性
# 使用正则表达式模块清除like(s), download(s), view(s),只剩下数字字符,然后转换成int
if 'like' in image_detail:
image_likes = int(re.sub("\D", "", image_detail))
if 'download' in image_detail:
image_downloads = int(re.sub("\D", "", image_detail))
if 'view' in image_detail:
image_views = int(re.sub("\D", "", image_detail))
return image_url, image_likes, image_downloads, image_views

保存图片

首先构造文件名:

1
2
3
4
5
6
7
8
9
10
11
12
13
# 图片信息链接 http://www.lifeofpix.com/photo/bouquet/
# 图片下载链接 http://www.lifeofpix.com/wp-content/uploads/2017/04/summer.jpg
# 文件名最终为bouquet.jpg
def get_filename(info_url, down_url):
prefix = info_url.split('/')[-2]
suffix = down_url.split('.')[-1]
filename = 'images/' + prefix + suffix
# 防止同名文件已存在,末尾加上数字标识
num = 0
while os.path.exists(filename):
filename = 'images/' + prefix + str(num) + suffix
num = num + 1
return filename

保存图片文件用到了shutil模块:

1
2
3
4
5
def download_image(down_url, filename):
response = requests.get(down_url, stream=True)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response

一个简单的爬虫程序就完成了,完整代码请看这里。需要注意,该爬虫可能因网站改版而失效,根据实际情况修改即可。