Pixiv的日榜爬虫

Pixiv日榜爬虫的原理与实现#

原理#

总结起来就两个字:抓包

首先,打开fiddler,如常访问p站日榜,如下图:

然后就可以套上BeautifulSoup直接解析html了。不过我后来发现了抓到一个这样的包,是在加载日榜51-100项的时候发出的请求。

标准的json格式,连html解析都不用了。打开contents一看,内容一目了然,ID、页数、标题、图片URL、作者、Tag等,一堆有用信息都已经显示出来了,所以直接拿出来用,比解析html更高效且信息量更全。

请求参数也很简单,mode=daily不需要改,p=2看上去是分页,1=1~50的内容,2=51~100的内容,以此类推,format=json也是固定好的,只有一个tt=382e...de是需要考虑怎样得到的。
在抓包历史中搜索382e...de,可以看到之前的日榜网页被标记了,说明这个字符串可以直接通过网页获得

在网页上查找一下,不难发现有一行是pixiv.context.token = "382e...de";

这就好办了嘛,套一个正则就可以拿出来了,正则大概长这样:pixiv\.context\.token\s*=\s*"(\w+)";,匹配完后直接用group(1)就能得到了。

验证一下参数p,的确如所想的那样,而且还发现了更改日期只要改date就好了

然后就看图片了,用上图的url看看抓包抓到的是什么。

嗯,没错,而且要注意了,左边的Referer可是不能漏的,不然就这样:

恶意满满的403~

所以从json数据中获得的一个图片缩略图的url长这样,如上图所示:
https://i.pximg.net/c/240x480/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg

感觉有点小,残念,点进去看会更清楚一点(前面那张图被换掉了,因为戳进去显示正在浏览敏感图片emm)

这时候的url变成这样
https://i.pximg.net/c/600x600/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg

找到规律了吧,把/c/?x?/img-master/...中的?改成更大的数值的话,就能获取更清晰的图片了~

当然,注册个账号登陆进去的话,图片就变得更大了:

这时候url就是
https://i.pximg.net/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg

点击查看原图,url就变成下面这样,过程就不贴图了
https://i.pximg.net/img-original/img/2019/01/17/23/28/48/72712034_p0.png

综上所述,图片的格式差不多摸透了,剩下的/c/后面能接多少就靠自己发现吧。
缩略图 (里面的?自己摸索吧,上面已经有240x480600x600的了):
https://i.pximg.net/c/?x?/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg
大图 (大于1000px):
https://i.pximg.net/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg
原图
https://i.pximg.net/img-original/img/2019/01/17/23/28/48/72712034_p0.png

替换json中的图片也就是一条正则的事,下面贴出代码(这段代码也包含在最后的代码中)。
两个参数,一个是url,就是上面从json得到的https://i.pximg.net/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg
另外一个是page,指定要爬第几张图片(针对多图投稿),第一张图就是0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import re
from warning import warn
def replace_url(url, page):
url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+_p)\d+'
r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
match = re.match(url_pattern, url)
if match:
schemas = match.group('schemas')
host = match.group('host')
path_prefix = match.group('path_prefix')
path_postfix = match.group('path_postfix')
return '%s://%s%s%d%s' % (schemas, host, path_prefix, page, path_postfix)
url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+)'
r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
match = re.match(url_pattern, url)
if match:
schemas = match.group('schemas')
host = match.group('host')
path_prefix = match.group('path_prefix')
path_postfix = match.group('path_postfix')
if page != 0:
warn('A non-pageable image url detected, your page should be 0 constantly, but got %d' % page)
return '%s://%s%s%s' % (schemas, host, path_prefix, path_postfix)

raise ValueError('The url "%s" could not match any replacement rules' % url)

爬的时候才发现:有些图中间的_p0是没有的,也就是直接剩下了72712034_master1200.jpg,这里也要注意一下,不然一不留神就出错了。

当然比较坑的是,原图是有png格式的,这东西在未登陆的时候比较难知道,所以要花费更多的时间在试图寻找jpg或png上。

程序#

一大串python代码,兼容python 3.5和3.6
写多线程爬虫就别纠结代码的整体美观了
爬的是全部的日榜的大图(非原图),Top 100,默认使用privoxy和ss代理(代理需要自己配置,如不需要则改为proxy = None),默认使用5线程下载
需要自己改save_path指定保存的位置,这段代码会生成1000个文件夹,按id尾数分开存储,如save_path\777文件夹保存的都是尾数是777的图片

代码我是部署在树莓派上的,为了提升速度,做了挺多的内存缓存的,所以吃掉500M的内存,每天更新大概只要花20分钟左右(10分钟爬取,10分钟更新数据库)
到目前为止,这个数据集有246G大小,有693k个文件

数据库表说明:
user:用户表,存有用户id、名称及头像url
illust_series:投稿的系列作品,这个是作者在投稿时指定的,存有系列id、创建用户id、标题、简介、属于本系列的投稿数量、创建时间和系列的url
illust:插画,存有标题、投稿时间、图片url、illust_type(未知)、book_style(未知)、页数、内容类型(如原创、暴力、X暗示等)、系列ID(不存在时为null)、id、宽高(多页投稿时默认指第一页)、用户id、评分数、浏览数、上传时间、属性(内容类型对应的string)
tag:插画标签,存有标签的id(自增字段)和标签名
illust_tags:插画-标签的关系表,一个插画对应多个标签,一个标签对应多个插画,存有标签id和插画id
illust_rank:插画的排行信息,存有插画id、时间、当前排名和昨日排名

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
import requests
from datetime import datetime, timedelta
import threading
import json
import re
from hashlib import md5
import os
from warnings import warn
import sqlite3
import numpy as np
import pickle
from tqdm import tqdm


ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/53.0.2785.143 Safari/537.36'
# path to save images from pixiv
save_path = '/share/disk/ML-TRAINING-SET/PixivRanking'
# save_path = 'd:/ML-TRAINING-SET/PixivRanking'
# path to save ranking data cache
cache_path = os.path.join(save_path, '.cache')
# path to generate sqlite database
db_path = os.path.join(save_path, 'database.db')
# proxy, for those who could not access pixiv directly
proxy = {'https': 'https://localhost:8118'}


def calc_md5(str_data):
hash_obj = md5()
hash_obj.update(str_data.encode('utf8'))
return hash_obj.hexdigest()


def create_dir(path):
parent = os.path.abspath(path)
dir_to_create = []
while not os.path.exists(parent):
dir_to_create.append(parent)
parent = os.path.abspath(os.path.join(parent, '..'))
dir_to_create = dir_to_create[::-1]
for dir_path in dir_to_create:
os.mkdir(dir_path)
print('Directory %s created' % dir_path)


class FileCacher:
def __init__(self):
self._cache_files = dict()
self._lock = threading.RLock()

def add_cache_dir(self, directory, create_dir_if_not_exist=True):
with self._lock:
path = os.path.abspath(directory)
if os.path.exists(path):
files = set(os.listdir(path))
else:
files = set()
if create_dir_if_not_exist:
create_dir(path)
self._cache_files[path] = files

def append_file(self, file_path):
with self._lock:
dir_path = os.path.abspath(os.path.join(file_path, '..'))
files = self._cache_files.get(dir_path, None)
if files is None:
warn('%s is not in the cached directory, calling add_cache_dir implicitly' % dir_path)
self.add_cache_dir(dir_path, True)
files = self._cache_files.get(dir_path, None)
assert files is not None
file_name = os.path.basename(file_path)
self._cache_files[dir_path].add(file_name)

def remove_file(self, file_path):
with self._lock:
dir_path = os.path.abspath(os.path.join(file_path, '..'))
files = self._cache_files.get(dir_path, None)
if files is None:
warn('%s is not in the cached directory, calling add_cache_dir implicitly' % dir_path)
self.add_cache_dir(dir_path, True)
files = self._cache_files.get(dir_path, None)
assert files is not None
file_name = os.path.basename(file_path)
self._cache_files[dir_path].remove(file_name)

def exist_file(self, file_path):
with self._lock:
dir_path = os.path.abspath(os.path.join(file_path, '..'))
files = self._cache_files.get(dir_path, None)
if files is None:
warn('%s is not in the cached directory, calling add_cache_dir implicitly' % dir_path)
self.add_cache_dir(dir_path, True)
files = self._cache_files.get(dir_path, None)
file_name = os.path.basename(file_path)
return file_name in files

def exist_dir_in_cache(self, dir_path):
with self._lock:
return self._cache_files.get(os.path.abspath(dir_path), None) is not None

def validate_dir(self, dir_path):
with self._lock:
dir_path = os.path.abspath(dir_path)
files = self._cache_files.get(dir_path, None)
if files is not None:
files = set(files)
actual_files = set(os.listdir(dir_path))
same_file_count = len(files.intersection(actual_files))
is_same = len(files) == same_file_count and len(actual_files) == same_file_count
if not is_same:
warn('cache inconsistency detected in directory %s, cleared all cache' % dir_path)
self._cache_files[dir_path] = actual_files

def save(self, file_path):
with self._lock:
if not self.exist_file(file_path):
self.append_file(file_path)
with open(file_path, 'wb') as f:
pickle.dump(self._cache_files, f)

def load(self, file_path, validate_on_load=True):
with self._lock:
if os.path.exists(file_path):
with open(file_path, 'rb') as f:
self._cache_files = pickle.load(f)
if validate_on_load:
cache_dirs = list(self._cache_files)
print('validating files')
for cache_dir in tqdm(cache_dirs, ascii=True):
self.validate_dir(cache_dir)
print('done')


global_file_cache = FileCacher()


class Cacher:
def __init__(self, path):
self._path = path
# create dir if not exists
create_dir(self._path)

def __getitem__(self, item):
if type(item) != str:
item = str(item)
path = os.path.join(self._path, calc_md5(item))
if not global_file_cache.exist_file(path):
raise KeyError('Item not exists')
with open(path, 'rb') as f:
return f.read()

def __setitem__(self, key, value):
if type(key) != str:
key = str(key)
path = os.path.join(self._path, calc_md5(key))
if type(value) == str:
value = bytes(value, 'utf8')
elif type(value) != bytes:
raise TypeError('value should be string or bytes')
with open(path, 'wb') as f:
f.write(value)
global_file_cache.append_file(path)

def get(self, item, default_item=None):
try:
return self.__getitem__(item)
except KeyError:
return default_item


class Crawler:
def __init__(self, save_path_=None, cache_path_=None, nums_thread=5, begin_date=None,
max_page=2, max_buffer_size=3000):
self._num_threads = nums_thread
self._main_thd = None
self._main_thd_started = threading.Event()
self._fetch_finished = None
self._max_page = max_page
if begin_date is None or type(begin_date) != datetime:
begin_date = datetime.fromordinal(datetime.now().date().toordinal()) - timedelta(days=2)
self._date = begin_date
self._page = 1
if not save_path_:
save_path_ = save_path
self._save_path = save_path_
self._cache = Cacher(cache_path_ if cache_path_ else cache_path)
# handling abort event
self._abort_event = threading.Event()
self._abort_wait = []

# handling variable buffer for main thread
self._buffer_data = []
self._buffer_lock = threading.RLock()
self._buffer_empty = threading.Event() # an event telling main thread to fetch more data
self._buffer_empty.set()
self._max_buffer_size = max_buffer_size

# creating directory
for i in range(1000):
dst_path = os.path.join(save_path, str(i))
create_dir(dst_path)
if not global_file_cache.exist_dir_in_cache(dst_path):
global_file_cache.add_cache_dir(dst_path)

def _main_thd_cb(self):
self._abort_wait = []
self._abort_event.clear()
self._fetch_finished = False

try:
# fetch ranking page
print('Fetching ranking page (html mode)')
# external loop for handling retrying
while not self._abort_event.is_set():
suc = False
req = None
while not suc:
if self._abort_event.is_set():
return
try:
req = requests.get('https://www.pixiv.net/ranking.php?mode=daily',
headers={'User-Agent': ua}, proxies=proxy, timeout=15)
suc = True
except Exception as ex:
warn(str(ex))
rep = req.content.decode('utf8')
# handling non-200
if req.status_code != 200:
print('HTTP Get failed with response code %d, retry in 0.5s' % req.status_code)
# wait 0.5s
if self._abort_event.wait(0.5):
break
# parse tt
pattern = re.compile(r'pixiv\.context\.token\s*=\s*"(?P<tt>\w+)";')
match_result = re.finditer(pattern, rep)
try:
match_result = next(match_result)
except StopIteration:
match_result = None
if not match_result:
print('Could not get tt from html, exited')
self._main_thd_started.set()
return
self._tt = match_result.group('tt')
break
print('Got tt = "%s"' % self._tt)

# starting parallel download thread here
for _ in range(self._num_threads):
event_to_wait = threading.Event()
self._abort_wait.append(event_to_wait)
worker = threading.Thread(target=self._worker_thd_cb, args=(event_to_wait,))
worker.start()
self._main_thd_started.set()

headers = {'X-Requested-With': 'XMLHttpRequest',
'Referer': 'https://www.pixiv.net/ranking.php?mode=daily'}
while self._buffer_empty.wait():
if self._abort_event.is_set():
break

# fetch from cacher
key = '%s-p%d' % (self._date.strftime('%Y%m%d'), self._page)
result = self._cache.get(key)
if not result:
with self._buffer_lock:
print('Fetching ranking page(json mode), date=%s, page=%d, buffer=%d/%d' %
(str(self._date.date()), self._page, len(self._buffer_data), self._max_buffer_size))
params = {'mode': 'daily', 'date': self._date.strftime('%Y%m%d'), 'p': self._page,
'format': 'json', 'tt': self._tt}
suc = False
req = None
while not suc:
if self._abort_event.is_set():
return
try:
req = requests.get('https://www.pixiv.net/ranking.php', params=params, headers=headers,
proxies=proxy, timeout=15)
suc = True
except Exception as ex:
warn(str(ex))
rep = req.content.decode('utf8')
# terminated state
if req.status_code == 404:
break
# append to cacher
self._cache[key] = rep
result = rep
else:
result = result.decode('utf8')

json_data = json.loads(result)
buffer_data = self._parse_data(json_data)

# append to buffer
with self._buffer_lock:
self._buffer_data += buffer_data
# check buffer size
if len(self._buffer_data) >= self._max_buffer_size:
self._buffer_empty.clear()

# next page
self._page += 1

if self._page > self._max_page:
self._page = 1
self._date -= timedelta(days=1)

finally:
print('main thd exited')
self._fetch_finished = True
for item in self._abort_wait:
item.wait()

def _parse_data(self, data):
ret_data = []
if data.get('contents', None):
contents = data['contents']
for content in contents:
url = content['url']
ranking_date = self._date
ranking_page = self._page
illust_id = int(content['illust_id'])
illust_page_count = int(content['illust_page_count'])
for page in range(illust_page_count):
single_illust_url = self._replace_url(url, page)
ret_data.append({'date': ranking_date, 'page': ranking_page,
'illust_id': illust_id, 'illust_page': page,
'url': single_illust_url})
return ret_data

@staticmethod
def _replace_url(url, page):
url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+_p)\d+'
r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
match = re.match(url_pattern, url)
if match:
schemas = match.group('schemas')
host = match.group('host')
path_prefix = match.group('path_prefix')
path_postfix = match.group('path_postfix')
return '%s://%s%s%d%s' % (schemas, host, path_prefix, page, path_postfix)
url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+)'
r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
match = re.match(url_pattern, url)
if match:
schemas = match.group('schemas')
host = match.group('host')
path_prefix = match.group('path_prefix')
path_postfix = match.group('path_postfix')
if page != 0:
warn('A non-pageable image url detected, your page should be 0 constantly, but got %d' % page)
return '%s://%s%s%s' % (schemas, host, path_prefix, path_postfix)

raise ValueError('The url "%s" could not match any replacement rules' % url)

def _worker_thd_cb(self, thd_wait_event):
try:
while not self._abort_event.is_set():
buffer_item = None
with self._buffer_lock:
if len(self._buffer_data) > 0:
buffer_item = self._buffer_data[0]
self._buffer_data = self._buffer_data[1:]
if len(self._buffer_data) < self._max_buffer_size:
self._buffer_empty.set()

# fetch failed, wait more time
if buffer_item is None:
if self._fetch_finished or self._abort_event.wait(0.1):
break
continue

# unpacking value
date = buffer_item['date']
page = buffer_item['page']
illust_id = buffer_item['illust_id']
illust_page = buffer_item['illust_page']
url = buffer_item['url']

# download file here
dst_path = os.path.join(save_path, str(illust_id % 1000), '%dp%d.jpg' % (illust_id, illust_page))
if not global_file_cache.exist_file(dst_path):
print('Downloading [%s #%d] [%d p%d] %s' % (date.strftime('%Y%m%d'), page,
illust_id, illust_page, url))
suc = False
while not suc:
try:
req = requests.get(url, headers={'Referer': 'https://www.pixiv.net/member_illust.php'
'?mode=medium&illust_id=%d' % illust_id},
timeout=15)
if req.status_code != 200:
warn('Error while downloading %d p%d : HTTP %d' %
(illust_id, illust_page, req.status_code))
break

image = req.content
with open(dst_path, 'wb') as f:
f.write(image)
global_file_cache.append_file(dst_path)

suc = True
except Exception as ex:
print(ex)
finally:
thd_wait_event.set()
print('thd exited')

def start(self):
self.abort()
self._main_thd = threading.Thread(target=self._main_thd_cb)
self._main_thd.start()

def abort(self):
self._abort_event.set()
self.wait()

def wait(self):
if self._main_thd:
self._main_thd_started.wait()
for item in self._abort_wait:
item.wait()


class DatabaseGenerator:
# flags for illust_content_type
ILLUST_CONTENT_TYPE_SEXUAL = 1
ILLUST_CONTENT_TYPE_LO = 2
ILLUST_CONTENT_TYPE_GROTESQUE = 4
ILLUST_CONTENT_TYPE_VIOLENT = 8
ILLUST_CONTENT_TYPE_HOMOSEXUAL = 16
ILLUST_CONTENT_TYPE_DRUG = 32
ILLUST_CONTENT_TYPE_THOUGHTS = 64
ILLUST_CONTENT_TYPE_ANTISOCIAL = 128
ILLUST_CONTENT_TYPE_RELIGION = 256
ILLUST_CONTENT_TYPE_ORIGINAL = 512
ILLUST_CONTENT_TYPE_FURRY = 1024
ILLUST_CONTENT_TYPE_BL = 2048
ILLUST_CONTENT_TYPE_YURI = 4096

def __init__(self, path_to_save=None, cacher_path=None, max_page=2):
self._cacher = Cacher(cacher_path if cacher_path else cache_path)
if not path_to_save:
path_to_save = db_path
with open(path_to_save, 'w'):
pass
self._conn = sqlite3.connect(path_to_save)
self._cursor = self._conn.cursor()
self._max_page = max_page

self._initialize()
self._user_id_set = set()
self._tag_id_dict = dict()
self._rank_set = set()
self._illust_id_set = set()
self._illust_series_id_set = set()

def _initialize(self):
# initialize tables
csr = self._cursor
csr.execute("create table user (user_id bigint primary key, user_name varchar(255) not null,"
"profile_img varchar(255) not null)")
csr.execute("create table illust_series (illust_series_id integer primary key, "
"illust_series_user_id bigint not null, illust_series_title varchar(255) not null,"
"illust_series_caption text(16383), illust_series_content_count integer not null,"
"illust_series_create_datetime datetime not null, page_url varchar(255) not null,"
"foreign key (illust_series_user_id) references user(user_id))")
csr.execute("create table illust (title varchar(255), date datetime, url varchar(255), illust_type integer,"
"illust_book_style integer, illust_page_count integer, illust_content_type integer not null, "
"illust_series_id integer, illust_id bigint primary key, width integer not null, "
"height integer not null, user_id bigint not null, rating_count integer not null, "
"view_count integer not null, illust_upload_timestamp datetime not null, attr varchar(255),"
"foreign key (user_id) references user, foreign key (illust_series_id) references illust_series)")
csr.execute("create table tag (tag_id integer primary key autoincrement, name varchar(255) not null unique)")
csr.execute("create table illust_tags (illust_id bigint not null, tag_id integer not null,"
"foreign key (illust_id) references illust, foreign key (tag_id) references tag)")
csr.execute("create table illust_rank (illust_id bigint not null, date datetime not null, "
"rank integer not null, yes_rank integer not null, foreign key (illust_id) references illust)")
# indices to accelerate date-based query
csr.execute("create index illust_date on illust(date)")
csr.execute("create index illust_rank_date on illust_rank(date, rank)")
self._conn.commit()

def start(self):
cur_date = datetime.now().date() - timedelta(days=2)
cur_page = 1

key = '%s-p%d' % (cur_date.strftime('%Y%m%d'), cur_page)
data = self._cacher.get(key)
while data:
data = data.decode("utf8")
# print('Parsing %s' % key)
json_data = json.loads(data)
try:
contents = json_data['contents']
except KeyError:
break

for item in contents:
self._parse(item, cur_date)

# next
cur_page += 1
if cur_page > self._max_page:
cur_date -= timedelta(days=1)
cur_page = 1
key = '%s-p%d' % (cur_date.strftime('%Y%m%d'), cur_page)
data = self._cacher.get(key)

self._conn.commit()

def _parse(self, json_obj, ranking_date):
title = json_obj['title']
date = json_obj['date']
tags = json_obj['tags']
url = json_obj['url']
illust_type = json_obj['illust_type']
illust_book_style = json_obj['illust_book_style']
illust_page_count = json_obj['illust_page_count']
user_name = json_obj['user_name']
profile_img = json_obj['profile_img']
illust_content_type = json_obj['illust_content_type']
illust_series = json_obj['illust_series']
illust_id = json_obj['illust_id']
width = json_obj['width']
height = json_obj['height']
user_id = json_obj['user_id']
rank = json_obj['rank']
# hint: yes_rank is not YES! rank!, it's just the rank of yesterday, don't be treated XD
yes_rank = json_obj['yes_rank']
rating_count = json_obj['rating_count']
view_count = json_obj['view_count']
illust_upload_timestamp = json_obj['illust_upload_timestamp']
attr = json_obj['attr']
# converting illust_content_type
flag_illust_content_type = 0
if illust_content_type['sexual'] != 0:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_SEXUAL
if illust_content_type['lo']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_LO
if illust_content_type['grotesque']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_GROTESQUE
if illust_content_type['violent']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_VIOLENT
if illust_content_type['homosexual']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_HOMOSEXUAL
if illust_content_type['drug']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_DRUG
if illust_content_type['thoughts']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_THOUGHTS
if illust_content_type['antisocial']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_ANTISOCIAL
if illust_content_type['religion']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_RELIGION
if illust_content_type['original']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_ORIGINAL
if illust_content_type['furry']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_FURRY
if illust_content_type['bl']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_BL
if illust_content_type['yuri']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_YURI
# querying user data
csr = self._cursor
if not self._user_id_set.issuperset({user_id}):
csr.execute("insert into user(user_id, user_name, profile_img) values (?, ?, ?)",
(user_id, user_name, profile_img))
self._user_id_set.add(user_id)
# handling illust_series
if type(illust_series) != bool:
illust_series_id = illust_series['illust_series_id']
illust_series_user_id = illust_series['illust_series_user_id']
illust_series_title = illust_series['illust_series_title']
illust_series_caption = illust_series['illust_series_caption']
illust_series_content_count = illust_series['illust_series_content_count']
illust_series_create_datetime = illust_series['illust_series_create_datetime']
page_url = illust_series['page_url']
if not self._illust_series_id_set.issuperset({illust_series_id}):
csr.execute("insert into illust_series(illust_series_id, illust_series_user_id, "
"illust_series_title, illust_series_caption, illust_series_content_count, "
"illust_series_create_datetime, page_url) values (?, ?, ?, ?, ?, ?, ?)",
(illust_series_id, illust_series_user_id, illust_series_title, illust_series_caption,
illust_series_content_count, illust_series_create_datetime, page_url))
self._illust_series_id_set.add(illust_series_id)
illust_series = illust_series_id
else:
illust_series = None
# tags
for tag in tags:
if self._tag_id_dict.get(tag, None):
tag_id = self._tag_id_dict[tag]
else:
csr.execute("insert into tag(name) values (?)", (tag,))
tag_id = len(self._tag_id_dict) + 1
self._tag_id_dict[tag] = tag_id
csr.execute("insert into illust_tags(illust_id, tag_id) values (?, ?)", (illust_id, tag_id))
# converting date
reg_ptn = re.compile('(\\d+)年(\\d+)月(\\d+)日\\s(\\d+):(\\d+)')
match = re.match(reg_ptn, date)
if match:
date_year, date_month, date_day, date_hour, date_minute = (int(match.group(x)) for x in range(1, 6))
date = datetime(date_year, date_month, date_day, date_hour, date_minute)
illust_upload_timestamp = datetime.fromtimestamp(illust_upload_timestamp)
if not self._illust_id_set.issuperset({illust_id}):
csr.execute("insert into illust(title, date, url, illust_type, illust_book_style, illust_page_count, "
"illust_content_type, illust_series_id, illust_id, width, height, user_id, rating_count, "
"view_count, illust_upload_timestamp, attr) "
"values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
(title, date, url, illust_type, illust_book_style, illust_page_count, flag_illust_content_type,
illust_series, illust_id, width, height, user_id, rating_count, view_count,
illust_upload_timestamp, attr))
self._illust_id_set.add(illust_id)

if not self._rank_set.issuperset((illust_id, ranking_date, rank, yes_rank)):
csr.execute("insert into illust_rank(illust_id, date, rank, yes_rank) values (?, ?, ?, ?)",
(illust_id, ranking_date, rank, yes_rank))
self._rank_set.add((illust_id, ranking_date, rank, yes_rank))


if __name__ == '__main__':
global_file_cache.load(os.path.join(cache_path, 'index'))
print('Crawler starting')
a = Crawler()
a.start()
a.wait()
global_file_cache.save(os.path.join(cache_path, 'index'))
print('Database generator starting')
a = DatabaseGenerator()
a.start()
global_file_cache.save(os.path.join(cache_path, 'index'))

后日谈#

给这个爬虫爬到的图片标了下数据,自己跑了个faster-rcnn的动漫脸识别,结合动漫脸识别里面提到的基于opencv的识别方法,可以把opencv中CascadeClassifier那惨不忍睹的识别结果过滤到几乎为100%正确的结果,这些几乎没有错误的结果可以用来实验一下各种的GAN。

CascadeClassifier的优点就是检测出来的脸型比较单一,当然这也是它的缺点所在,就因为这一点,过滤掉了一大半的结果(心痛)。

后续会放出处理faster-rcnn和CascadeClassifier生成的结果,并且对两个结果求IoU比例、根据IoU进行边框匹配并裁剪缩放的过程及代码。