用动漫脸数据集跑StyleGAN

前言#

在前面两篇博文中,已经分别介绍了p站图片的爬取和预处理。在这篇博文中,只是发发跑网络时的小牢骚罢了。

用的GAN叫Style-GAN,来自一篇不算太久远的Paper:A Style-based generator architecture for generative adversarial networks(arXiv 1812.04948)。

NVLab开源了官方的实现代码,直接clone,跟着教程走,就能跑起来了。

硬件要求#

只跑128x128大小的GAN的话并没有官方给出的数据那么夸张。个人估测最低配置如下:

RAM: 至少32G
GPU: VRAM至少8G,flops越高越好
CPU: 还行吧,不拖累显卡就好了

用1080 Ti跑完预定的2500万次迭代(25000k img)的至少估算时间:12天

我当然没跑完了,抛开几十块的电费不说,16G的内存实在是无能为力,到后面基本都是swap置换了,再跑下去也是难受。因此在保存了权重之后,果断Ctrl+C。

结果#

生成网络的生成的大部分结果都已经很棒了(起码比我厉害,也比我跑过的一些传统的GAN要厉害)。具体可以看上面的图,只节选了全图中的一部分。

完整的大图:

论文肯定是会看的,但是是什么时候才看就说不准了。(咕咕咕)

运行的输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
dnnlib: Running training.training_loop.training_loop() on localhost...
Streaming data using training.dataset.TFRecordDataset...
Dataset shape = [3, 128, 128]
Dynamic range = [0, 255]
Label size = 0
Constructing networks...

G Params OutputShape WeightShape
--- --- --- ---
latents_in - (?, 512) -
labels_in - (?, 0) -
lod - () -
dlatent_avg - (512,) -
G_mapping/latents_in - (?, 512) -
G_mapping/labels_in - (?, 0) -
G_mapping/PixelNorm - (?, 512) -
G_mapping/Dense0 262656 (?, 512) (512, 512)
G_mapping/Dense1 262656 (?, 512) (512, 512)
G_mapping/Dense2 262656 (?, 512) (512, 512)
G_mapping/Dense3 262656 (?, 512) (512, 512)
G_mapping/Dense4 262656 (?, 512) (512, 512)
G_mapping/Dense5 262656 (?, 512) (512, 512)
G_mapping/Dense6 262656 (?, 512) (512, 512)
G_mapping/Dense7 262656 (?, 512) (512, 512)
G_mapping/Broadcast - (?, 12, 512) -
G_mapping/dlatents_out - (?, 12, 512) -
Truncation - (?, 12, 512) -
G_synthesis/dlatents_in - (?, 12, 512) -
G_synthesis/4x4/Const 534528 (?, 512, 4, 4) (512,)
G_synthesis/4x4/Conv 2885632 (?, 512, 4, 4) (3, 3, 512, 512)
G_synthesis/ToRGB_lod5 1539 (?, 3, 4, 4) (1, 1, 512, 3)
G_synthesis/8x8/Conv0_up 2885632 (?, 512, 8, 8) (3, 3, 512, 512)
G_synthesis/8x8/Conv1 2885632 (?, 512, 8, 8) (3, 3, 512, 512)
G_synthesis/ToRGB_lod4 1539 (?, 3, 8, 8) (1, 1, 512, 3)
G_synthesis/Upscale2D - (?, 3, 8, 8) -
G_synthesis/Grow_lod4 - (?, 3, 8, 8) -
G_synthesis/16x16/Conv0_up 2885632 (?, 512, 16, 16) (3, 3, 512, 512)
G_synthesis/16x16/Conv1 2885632 (?, 512, 16, 16) (3, 3, 512, 512)
G_synthesis/ToRGB_lod3 1539 (?, 3, 16, 16) (1, 1, 512, 3)
G_synthesis/Upscale2D_1 - (?, 3, 16, 16) -
G_synthesis/Grow_lod3 - (?, 3, 16, 16) -
G_synthesis/32x32/Conv0_up 2885632 (?, 512, 32, 32) (3, 3, 512, 512)
G_synthesis/32x32/Conv1 2885632 (?, 512, 32, 32) (3, 3, 512, 512)
G_synthesis/ToRGB_lod2 1539 (?, 3, 32, 32) (1, 1, 512, 3)
G_synthesis/Upscale2D_2 - (?, 3, 32, 32) -
G_synthesis/Grow_lod2 - (?, 3, 32, 32) -
G_synthesis/64x64/Conv0_up 1442816 (?, 256, 64, 64) (3, 3, 512, 256)
G_synthesis/64x64/Conv1 852992 (?, 256, 64, 64) (3, 3, 256, 256)
G_synthesis/ToRGB_lod1 771 (?, 3, 64, 64) (1, 1, 256, 3)
G_synthesis/Upscale2D_3 - (?, 3, 64, 64) -
G_synthesis/Grow_lod1 - (?, 3, 64, 64) -
G_synthesis/128x128/Conv0_up 426496 (?, 128, 128, 128) (3, 3, 256, 128)
G_synthesis/128x128/Conv1 279040 (?, 128, 128, 128) (3, 3, 128, 128)
G_synthesis/ToRGB_lod0 387 (?, 3, 128, 128) (1, 1, 128, 3)
G_synthesis/Upscale2D_4 - (?, 3, 128, 128) -
G_synthesis/Grow_lod0 - (?, 3, 128, 128) -
G_synthesis/images_out - (?, 3, 128, 128) -
G_synthesis/lod - () -
G_synthesis/noise0 - (1, 1, 4, 4) -
G_synthesis/noise1 - (1, 1, 4, 4) -
G_synthesis/noise2 - (1, 1, 8, 8) -
G_synthesis/noise3 - (1, 1, 8, 8) -
G_synthesis/noise4 - (1, 1, 16, 16) -
G_synthesis/noise5 - (1, 1, 16, 16) -
G_synthesis/noise6 - (1, 1, 32, 32) -
G_synthesis/noise7 - (1, 1, 32, 32) -
G_synthesis/noise8 - (1, 1, 64, 64) -
G_synthesis/noise9 - (1, 1, 64, 64) -
G_synthesis/noise10 - (1, 1, 128, 128) -
G_synthesis/noise11 - (1, 1, 128, 128) -
images_out - (?, 3, 128, 128) -
--- --- --- ---
Total 25843858


D Params OutputShape WeightShape
--- --- --- ---
images_in - (?, 3, 128, 128) -
labels_in - (?, 0) -
lod - () -
FromRGB_lod0 512 (?, 128, 128, 128) (1, 1, 3, 128)
128x128/Conv0 147584 (?, 128, 128, 128) (3, 3, 128, 128)
128x128/Conv1_down 295168 (?, 256, 64, 64) (3, 3, 128, 256)
Downscale2D - (?, 3, 64, 64) -
FromRGB_lod1 1024 (?, 256, 64, 64) (1, 1, 3, 256)
Grow_lod0 - (?, 256, 64, 64) -
64x64/Conv0 590080 (?, 256, 64, 64) (3, 3, 256, 256)
64x64/Conv1_down 1180160 (?, 512, 32, 32) (3, 3, 256, 512)
Downscale2D_1 - (?, 3, 32, 32) -
FromRGB_lod2 2048 (?, 512, 32, 32) (1, 1, 3, 512)
Grow_lod1 - (?, 512, 32, 32) -
32x32/Conv0 2359808 (?, 512, 32, 32) (3, 3, 512, 512)
32x32/Conv1_down 2359808 (?, 512, 16, 16) (3, 3, 512, 512)
Downscale2D_2 - (?, 3, 16, 16) -
FromRGB_lod3 2048 (?, 512, 16, 16) (1, 1, 3, 512)
Grow_lod2 - (?, 512, 16, 16) -
16x16/Conv0 2359808 (?, 512, 16, 16) (3, 3, 512, 512)
16x16/Conv1_down 2359808 (?, 512, 8, 8) (3, 3, 512, 512)
Downscale2D_3 - (?, 3, 8, 8) -
FromRGB_lod4 2048 (?, 512, 8, 8) (1, 1, 3, 512)
Grow_lod3 - (?, 512, 8, 8) -
8x8/Conv0 2359808 (?, 512, 8, 8) (3, 3, 512, 512)
8x8/Conv1_down 2359808 (?, 512, 4, 4) (3, 3, 512, 512)
Downscale2D_4 - (?, 3, 4, 4) -
FromRGB_lod5 2048 (?, 512, 4, 4) (1, 1, 3, 512)
Grow_lod4 - (?, 512, 4, 4) -
4x4/MinibatchStddev - (?, 513, 4, 4) -
4x4/Conv 2364416 (?, 512, 4, 4) (3, 3, 513, 512)
4x4/Dense0 4194816 (?, 512) (8192, 512)
4x4/Dense1 513 (?, 1) (512, 1)
scores_out - (?, 1) -
--- --- --- ---
Total 22941313

Building TensorFlow graph...
Setting up snapshot image grid...
Setting up run dir...
Training...

tick 1 kimg 140.3 lod 4.00 minibatch 128 time 3m 05s sec/tick 143.2 sec/kimg 1.02 maintenance 41.7 gpumem 3.0
network-snapshot-000140 time 3m 35s fid50k 374.5350
tick 2 kimg 280.6 lod 4.00 minibatch 128 time 9m 31s sec/tick 163.3 sec/kimg 1.16 maintenance 222.9 gpumem 3.4
tick 3 kimg 420.9 lod 4.00 minibatch 128 time 12m 08s sec/tick 156.9 sec/kimg 1.12 maintenance 0.4 gpumem 3.4
tick 4 kimg 561.2 lod 4.00 minibatch 128 time 14m 47s sec/tick 157.8 sec/kimg 1.12 maintenance 0.4 gpumem 3.4
tick 5 kimg 681.5 lod 3.87 minibatch 128 time 20m 12s sec/tick 324.6 sec/kimg 2.70 maintenance 0.4 gpumem 4.5
tick 6 kimg 801.8 lod 3.66 minibatch 128 time 26m 49s sec/tick 396.3 sec/kimg 3.29 maintenance 0.6 gpumem 4.5
tick 7 kimg 922.1 lod 3.46 minibatch 128 time 33m 20s sec/tick 391.1 sec/kimg 3.25 maintenance 0.6 gpumem 4.5
tick 8 kimg 1042.4 lod 3.26 minibatch 128 time 40m 00s sec/tick 399.4 sec/kimg 3.32 maintenance 0.5 gpumem 4.5
tick 9 kimg 1162.8 lod 3.06 minibatch 128 time 46m 36s sec/tick 394.9 sec/kimg 3.28 maintenance 0.6 gpumem 4.5
tick 10 kimg 1283.1 lod 3.00 minibatch 128 time 53m 11s sec/tick 395.0 sec/kimg 3.28 maintenance 0.6 gpumem 4.5
network-snapshot-001283 time 3m 50s fid50k 302.4119
tick 11 kimg 1403.4 lod 3.00 minibatch 128 time 1h 03m 35s sec/tick 391.7 sec/kimg 3.26 maintenance 231.8 gpumem 4.5
tick 12 kimg 1523.7 lod 3.00 minibatch 128 time 1h 10m 04s sec/tick 388.5 sec/kimg 3.23 maintenance 0.5 gpumem 4.5
tick 13 kimg 1644.0 lod 3.00 minibatch 128 time 1h 16m 32s sec/tick 387.8 sec/kimg 3.22 maintenance 0.6 gpumem 4.5
tick 14 kimg 1764.4 lod 3.00 minibatch 128 time 1h 23m 02s sec/tick 388.6 sec/kimg 3.23 maintenance 0.6 gpumem 4.5
tick 15 kimg 1864.4 lod 2.89 minibatch 64 time 1h 37m 36s sec/tick 873.9 sec/kimg 8.73 maintenance 0.6 gpumem 4.7
tick 16 kimg 1964.5 lod 2.73 minibatch 64 time 1h 56m 27s sec/tick 1130.0 sec/kimg 11.29 maintenance 1.1 gpumem 4.7
tick 17 kimg 2064.6 lod 2.56 minibatch 64 time 2h 15m 18s sec/tick 1129.7 sec/kimg 11.29 maintenance 1.0 gpumem 4.7
tick 18 kimg 2164.7 lod 2.39 minibatch 64 time 2h 34m 11s sec/tick 1132.3 sec/kimg 11.31 maintenance 1.0 gpumem 4.7
tick 19 kimg 2264.8 lod 2.23 minibatch 64 time 2h 53m 04s sec/tick 1132.1 sec/kimg 11.31 maintenance 1.0 gpumem 4.7
tick 20 kimg 2364.9 lod 2.06 minibatch 64 time 3h 11m 57s sec/tick 1132.2 sec/kimg 11.31 maintenance 1.0 gpumem 4.7
network-snapshot-002364 time 4m 33s fid50k 297.8871
tick 21 kimg 2465.0 lod 2.00 minibatch 64 time 3h 35m 39s sec/tick 1146.3 sec/kimg 11.45 maintenance 275.5 gpumem 4.7
tick 22 kimg 2565.1 lod 2.00 minibatch 64 time 3h 54m 34s sec/tick 1133.9 sec/kimg 11.33 maintenance 1.0 gpumem 4.7
tick 23 kimg 2665.2 lod 2.00 minibatch 64 time 4h 13m 31s sec/tick 1135.8 sec/kimg 11.35 maintenance 1.0 gpumem 4.7
tick 24 kimg 2765.3 lod 2.00 minibatch 64 time 4h 32m 02s sec/tick 1109.7 sec/kimg 11.09 maintenance 1.0 gpumem 4.7
tick 25 kimg 2865.4 lod 2.00 minibatch 64 time 4h 50m 52s sec/tick 1129.1 sec/kimg 11.28 maintenance 1.0 gpumem 4.7
tick 26 kimg 2965.5 lod 2.00 minibatch 64 time 5h 09m 32s sec/tick 1118.7 sec/kimg 11.18 maintenance 1.1 gpumem 4.7
tick 27 kimg 3045.5 lod 1.92 minibatch 32 time 5h 38m 21s sec/tick 1728.0 sec/kimg 21.60 maintenance 1.0 gpumem 4.9
tick 28 kimg 3125.5 lod 1.79 minibatch 32 time 6h 16m 31s sec/tick 2288.1 sec/kimg 28.60 maintenance 2.3 gpumem 4.9
tick 29 kimg 3205.5 lod 1.66 minibatch 32 time 6h 54m 45s sec/tick 2291.4 sec/kimg 28.64 maintenance 2.5 gpumem 4.9
tick 30 kimg 3285.5 lod 1.52 minibatch 32 time 7h 33m 34s sec/tick 2326.7 sec/kimg 29.08 maintenance 2.8 gpumem 4.9
network-snapshot-003285 time 6m 02s fid50k 243.0076
tick 31 kimg 3365.5 lod 1.39 minibatch 32 time 8h 17m 27s sec/tick 2266.9 sec/kimg 28.34 maintenance 366.3 gpumem 4.9
tick 32 kimg 3445.5 lod 1.26 minibatch 32 time 8h 55m 02s sec/tick 2251.5 sec/kimg 28.14 maintenance 2.8 gpumem 4.9
tick 33 kimg 3525.5 lod 1.12 minibatch 32 time 9h 32m 35s sec/tick 2250.6 sec/kimg 28.13 maintenance 2.7 gpumem 4.9
tick 34 kimg 3605.5 lod 1.00 minibatch 32 time 10h 10m 04s sec/tick 2246.4 sec/kimg 28.08 maintenance 2.7 gpumem 4.9
tick 35 kimg 3685.5 lod 1.00 minibatch 32 time 10h 46m 35s sec/tick 2188.4 sec/kimg 27.35 maintenance 2.6 gpumem 4.9
tick 36 kimg 3765.5 lod 1.00 minibatch 32 time 11h 23m 07s sec/tick 2189.2 sec/kimg 27.37 maintenance 2.6 gpumem 4.9
tick 37 kimg 3845.5 lod 1.00 minibatch 32 time 11h 59m 38s sec/tick 2188.1 sec/kimg 27.35 maintenance 2.7 gpumem 4.9
tick 38 kimg 3925.5 lod 1.00 minibatch 32 time 12h 36m 08s sec/tick 2188.1 sec/kimg 27.35 maintenance 2.6 gpumem 4.9
tick 39 kimg 4005.5 lod 1.00 minibatch 32 time 13h 12m 40s sec/tick 2189.1 sec/kimg 27.36 maintenance 2.7 gpumem 4.9
tick 40 kimg 4085.5 lod 1.00 minibatch 32 time 13h 49m 12s sec/tick 2188.8 sec/kimg 27.36 maintenance 2.6 gpumem 4.9
network-snapshot-004085 time 5m 49s fid50k 116.0598
tick 41 kimg 4165.5 lod 1.00 minibatch 32 time 14h 31m 32s sec/tick 2188.0 sec/kimg 27.35 maintenance 352.6 gpumem 4.9
tick 42 kimg 4225.5 lod 0.96 minibatch 16 time 15h 11m 38s sec/tick 2403.0 sec/kimg 40.03 maintenance 2.6 gpumem 5.1
tick 43 kimg 4285.6 lod 0.86 minibatch 16 time 16h 09m 48s sec/tick 3480.7 sec/kimg 57.98 maintenance 9.6 gpumem 5.1
tick 44 kimg 4345.6 lod 0.76 minibatch 16 time 17h 06m 03s sec/tick 3368.3 sec/kimg 56.11 maintenance 7.1 gpumem 5.1
tick 45 kimg 4405.6 lod 0.66 minibatch 16 time 18h 02m 18s sec/tick 3366.8 sec/kimg 56.08 maintenance 8.3 gpumem 5.1
tick 46 kimg 4465.7 lod 0.56 minibatch 16 time 18h 57m 51s sec/tick 3323.2 sec/kimg 55.36 maintenance 9.6 gpumem 5.1
tick 47 kimg 4525.7 lod 0.46 minibatch 16 time 19h 53m 22s sec/tick 3323.7 sec/kimg 55.37 maintenance 7.3 gpumem 5.1
tick 48 kimg 4585.7 lod 0.36 minibatch 16 time 20h 50m 32s sec/tick 3424.0 sec/kimg 57.04 maintenance 5.6 gpumem 5.1
tick 49 kimg 4645.8 lod 0.26 minibatch 16 time 21h 47m 58s sec/tick 3436.1 sec/kimg 57.24 maintenance 9.7 gpumem 5.1
tick 50 kimg 4705.8 lod 0.16 minibatch 16 time 22h 45m 04s sec/tick 3418.3 sec/kimg 56.94 maintenance 7.8 gpumem 5.1
network-snapshot-004705 time 8m 55s fid50k 27.1257
tick 51 kimg 4765.8 lod 0.06 minibatch 16 time 23h 51m 22s sec/tick 3426.3 sec/kimg 57.07 maintenance 552.1 gpumem 5.1
tick 52 kimg 4825.9 lod 0.00 minibatch 16 time 1d 00h 46m sec/tick 3292.1 sec/kimg 54.84 maintenance 5.0 gpumem 5.1
tick 53 kimg 4885.9 lod 0.00 minibatch 16 time 1d 01h 40m sec/tick 3234.1 sec/kimg 53.87 maintenance 5.2 gpumem 5.1
tick 54 kimg 4945.9 lod 0.00 minibatch 16 time 1d 02h 35m sec/tick 3291.5 sec/kimg 54.83 maintenance 6.6 gpumem 5.1
tick 55 kimg 5006.0 lod 0.00 minibatch 16 time 1d 03h 30m sec/tick 3280.6 sec/kimg 54.65 maintenance 5.1 gpumem 5.1
tick 56 kimg 5066.0 lod 0.00 minibatch 16 time 1d 04h 24m sec/tick 3272.8 sec/kimg 54.52 maintenance 8.7 gpumem 5.1
tick 57 kimg 5126.0 lod 0.00 minibatch 16 time 1d 05h 18m sec/tick 3234.0 sec/kimg 53.87 maintenance 5.0 gpumem 5.1
tick 58 kimg 5186.0 lod 0.00 minibatch 16 time 1d 06h 12m sec/tick 3250.0 sec/kimg 54.14 maintenance 5.4 gpumem 5.1
tick 59 kimg 5246.1 lod 0.00 minibatch 16 time 1d 07h 07m sec/tick 3245.4 sec/kimg 54.06 maintenance 33.8 gpumem 5.1
tick 60 kimg 5306.1 lod 0.00 minibatch 16 time 1d 08h 01m sec/tick 3232.7 sec/kimg 53.85 maintenance 6.1 gpumem 5.1
network-snapshot-005306 time 9m 33s fid50k 14.5950
tick 61 kimg 5366.1 lod 0.00 minibatch 16 time 1d 09h 06m sec/tick 3283.7 sec/kimg 54.70 maintenance 596.8 gpumem 5.1
tick 62 kimg 5426.2 lod 0.00 minibatch 16 time 1d 10h 00m sec/tick 3244.0 sec/kimg 54.04 maintenance 15.0 gpumem 5.1
tick 63 kimg 5486.2 lod 0.00 minibatch 16 time 1d 10h 54m sec/tick 3234.6 sec/kimg 53.88 maintenance 9.4 gpumem 5.1
tick 64 kimg 5546.2 lod 0.00 minibatch 16 time 1d 11h 48m sec/tick 3236.9 sec/kimg 53.92 maintenance 7.0 gpumem 5.1
tick 65 kimg 5606.3 lod 0.00 minibatch 16 time 1d 12h 44m sec/tick 3316.2 sec/kimg 55.24 maintenance 9.7 gpumem 5.1
tick 66 kimg 5666.3 lod 0.00 minibatch 16 time 1d 13h 38m sec/tick 3233.3 sec/kimg 53.86 maintenance 5.7 gpumem 5.1
tick 67 kimg 5726.3 lod 0.00 minibatch 16 time 1d 14h 32m sec/tick 3232.1 sec/kimg 53.84 maintenance 5.5 gpumem 5.1
tick 68 kimg 5786.4 lod 0.00 minibatch 16 time 1d 15h 26m sec/tick 3247.6 sec/kimg 54.10 maintenance 5.6 gpumem 5.1
tick 69 kimg 5846.4 lod 0.00 minibatch 16 time 1d 16h 20m sec/tick 3250.5 sec/kimg 54.15 maintenance 16.9 gpumem 5.1
tick 70 kimg 5906.4 lod 0.00 minibatch 16 time 1d 17h 14m sec/tick 3232.9 sec/kimg 53.85 maintenance 6.6 gpumem 5.1

Pixiv的日榜爬虫

Pixiv日榜爬虫的原理与实现#

原理#

总结起来就两个字:抓包

首先,打开fiddler,如常访问p站日榜,如下图:

然后就可以套上BeautifulSoup直接解析html了。不过我后来发现了抓到一个这样的包,是在加载日榜51-100项的时候发出的请求。

标准的json格式,连html解析都不用了。打开contents一看,内容一目了然,ID、页数、标题、图片URL、作者、Tag等,一堆有用信息都已经显示出来了,所以直接拿出来用,比解析html更高效且信息量更全。

请求参数也很简单,mode=daily不需要改,p=2看上去是分页,1=1~50的内容,2=51~100的内容,以此类推,format=json也是固定好的,只有一个tt=382e...de是需要考虑怎样得到的。
在抓包历史中搜索382e...de,可以看到之前的日榜网页被标记了,说明这个字符串可以直接通过网页获得

在网页上查找一下,不难发现有一行是pixiv.context.token = "382e...de";

这就好办了嘛,套一个正则就可以拿出来了,正则大概长这样:pixiv\.context\.token\s*=\s*"(\w+)";,匹配完后直接用group(1)就能得到了。

验证一下参数p,的确如所想的那样,而且还发现了更改日期只要改date就好了

然后就看图片了,用上图的url看看抓包抓到的是什么。

嗯,没错,而且要注意了,左边的Referer可是不能漏的,不然就这样:

恶意满满的403~

所以从json数据中获得的一个图片缩略图的url长这样,如上图所示:
https://i.pximg.net/c/240x480/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg

感觉有点小,残念,点进去看会更清楚一点(前面那张图被换掉了,因为戳进去显示正在浏览敏感图片emm)

这时候的url变成这样
https://i.pximg.net/c/600x600/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg

找到规律了吧,把/c/?x?/img-master/...中的?改成更大的数值的话,就能获取更清晰的图片了~

当然,注册个账号登陆进去的话,图片就变得更大了:

这时候url就是
https://i.pximg.net/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg

点击查看原图,url就变成下面这样,过程就不贴图了
https://i.pximg.net/img-original/img/2019/01/17/23/28/48/72712034_p0.png

综上所述,图片的格式差不多摸透了,剩下的/c/后面能接多少就靠自己发现吧。
缩略图 (里面的?自己摸索吧,上面已经有240x480600x600的了):
https://i.pximg.net/c/?x?/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg
大图 (大于1000px):
https://i.pximg.net/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg
原图
https://i.pximg.net/img-original/img/2019/01/17/23/28/48/72712034_p0.png

替换json中的图片也就是一条正则的事,下面贴出代码(这段代码也包含在最后的代码中)。
两个参数,一个是url,就是上面从json得到的https://i.pximg.net/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg
另外一个是page,指定要爬第几张图片(针对多图投稿),第一张图就是0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import re
from warning import warn
def replace_url(url, page):
url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+_p)\d+'
r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
match = re.match(url_pattern, url)
if match:
schemas = match.group('schemas')
host = match.group('host')
path_prefix = match.group('path_prefix')
path_postfix = match.group('path_postfix')
return '%s://%s%s%d%s' % (schemas, host, path_prefix, page, path_postfix)
url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+)'
r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
match = re.match(url_pattern, url)
if match:
schemas = match.group('schemas')
host = match.group('host')
path_prefix = match.group('path_prefix')
path_postfix = match.group('path_postfix')
if page != 0:
warn('A non-pageable image url detected, your page should be 0 constantly, but got %d' % page)
return '%s://%s%s%s' % (schemas, host, path_prefix, path_postfix)

raise ValueError('The url "%s" could not match any replacement rules' % url)

爬的时候才发现:有些图中间的_p0是没有的,也就是直接剩下了72712034_master1200.jpg,这里也要注意一下,不然一不留神就出错了。

当然比较坑的是,原图是有png格式的,这东西在未登陆的时候比较难知道,所以要花费更多的时间在试图寻找jpg或png上。

程序#

一大串python代码,兼容python 3.5和3.6
写多线程爬虫就别纠结代码的整体美观了
爬的是全部的日榜的大图(非原图),Top 100,默认使用privoxy和ss代理(代理需要自己配置,如不需要则改为proxy = None),默认使用5线程下载
需要自己改save_path指定保存的位置,这段代码会生成1000个文件夹,按id尾数分开存储,如save_path\777文件夹保存的都是尾数是777的图片

代码我是部署在树莓派上的,为了提升速度,做了挺多的内存缓存的,所以吃掉500M的内存,每天更新大概只要花20分钟左右(10分钟爬取,10分钟更新数据库)
到目前为止,这个数据集有246G大小,有693k个文件

数据库表说明:
user:用户表,存有用户id、名称及头像url
illust_series:投稿的系列作品,这个是作者在投稿时指定的,存有系列id、创建用户id、标题、简介、属于本系列的投稿数量、创建时间和系列的url
illust:插画,存有标题、投稿时间、图片url、illust_type(未知)、book_style(未知)、页数、内容类型(如原创、暴力、X暗示等)、系列ID(不存在时为null)、id、宽高(多页投稿时默认指第一页)、用户id、评分数、浏览数、上传时间、属性(内容类型对应的string)
tag:插画标签,存有标签的id(自增字段)和标签名
illust_tags:插画-标签的关系表,一个插画对应多个标签,一个标签对应多个插画,存有标签id和插画id
illust_rank:插画的排行信息,存有插画id、时间、当前排名和昨日排名

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
import requests
from datetime import datetime, timedelta
import threading
import json
import re
from hashlib import md5
import os
from warnings import warn
import sqlite3
import numpy as np
import pickle
from tqdm import tqdm


ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/53.0.2785.143 Safari/537.36'
# path to save images from pixiv
save_path = '/share/disk/ML-TRAINING-SET/PixivRanking'
# save_path = 'd:/ML-TRAINING-SET/PixivRanking'
# path to save ranking data cache
cache_path = os.path.join(save_path, '.cache')
# path to generate sqlite database
db_path = os.path.join(save_path, 'database.db')
# proxy, for those who could not access pixiv directly
proxy = {'https': 'https://localhost:8118'}


def calc_md5(str_data):
hash_obj = md5()
hash_obj.update(str_data.encode('utf8'))
return hash_obj.hexdigest()


def create_dir(path):
parent = os.path.abspath(path)
dir_to_create = []
while not os.path.exists(parent):
dir_to_create.append(parent)
parent = os.path.abspath(os.path.join(parent, '..'))
dir_to_create = dir_to_create[::-1]
for dir_path in dir_to_create:
os.mkdir(dir_path)
print('Directory %s created' % dir_path)


class FileCacher:
def __init__(self):
self._cache_files = dict()
self._lock = threading.RLock()

def add_cache_dir(self, directory, create_dir_if_not_exist=True):
with self._lock:
path = os.path.abspath(directory)
if os.path.exists(path):
files = set(os.listdir(path))
else:
files = set()
if create_dir_if_not_exist:
create_dir(path)
self._cache_files[path] = files

def append_file(self, file_path):
with self._lock:
dir_path = os.path.abspath(os.path.join(file_path, '..'))
files = self._cache_files.get(dir_path, None)
if files is None:
warn('%s is not in the cached directory, calling add_cache_dir implicitly' % dir_path)
self.add_cache_dir(dir_path, True)
files = self._cache_files.get(dir_path, None)
assert files is not None
file_name = os.path.basename(file_path)
self._cache_files[dir_path].add(file_name)

def remove_file(self, file_path):
with self._lock:
dir_path = os.path.abspath(os.path.join(file_path, '..'))
files = self._cache_files.get(dir_path, None)
if files is None:
warn('%s is not in the cached directory, calling add_cache_dir implicitly' % dir_path)
self.add_cache_dir(dir_path, True)
files = self._cache_files.get(dir_path, None)
assert files is not None
file_name = os.path.basename(file_path)
self._cache_files[dir_path].remove(file_name)

def exist_file(self, file_path):
with self._lock:
dir_path = os.path.abspath(os.path.join(file_path, '..'))
files = self._cache_files.get(dir_path, None)
if files is None:
warn('%s is not in the cached directory, calling add_cache_dir implicitly' % dir_path)
self.add_cache_dir(dir_path, True)
files = self._cache_files.get(dir_path, None)
file_name = os.path.basename(file_path)
return file_name in files

def exist_dir_in_cache(self, dir_path):
with self._lock:
return self._cache_files.get(os.path.abspath(dir_path), None) is not None

def validate_dir(self, dir_path):
with self._lock:
dir_path = os.path.abspath(dir_path)
files = self._cache_files.get(dir_path, None)
if files is not None:
files = set(files)
actual_files = set(os.listdir(dir_path))
same_file_count = len(files.intersection(actual_files))
is_same = len(files) == same_file_count and len(actual_files) == same_file_count
if not is_same:
warn('cache inconsistency detected in directory %s, cleared all cache' % dir_path)
self._cache_files[dir_path] = actual_files

def save(self, file_path):
with self._lock:
if not self.exist_file(file_path):
self.append_file(file_path)
with open(file_path, 'wb') as f:
pickle.dump(self._cache_files, f)

def load(self, file_path, validate_on_load=True):
with self._lock:
if os.path.exists(file_path):
with open(file_path, 'rb') as f:
self._cache_files = pickle.load(f)
if validate_on_load:
cache_dirs = list(self._cache_files)
print('validating files')
for cache_dir in tqdm(cache_dirs, ascii=True):
self.validate_dir(cache_dir)
print('done')


global_file_cache = FileCacher()


class Cacher:
def __init__(self, path):
self._path = path
# create dir if not exists
create_dir(self._path)

def __getitem__(self, item):
if type(item) != str:
item = str(item)
path = os.path.join(self._path, calc_md5(item))
if not global_file_cache.exist_file(path):
raise KeyError('Item not exists')
with open(path, 'rb') as f:
return f.read()

def __setitem__(self, key, value):
if type(key) != str:
key = str(key)
path = os.path.join(self._path, calc_md5(key))
if type(value) == str:
value = bytes(value, 'utf8')
elif type(value) != bytes:
raise TypeError('value should be string or bytes')
with open(path, 'wb') as f:
f.write(value)
global_file_cache.append_file(path)

def get(self, item, default_item=None):
try:
return self.__getitem__(item)
except KeyError:
return default_item


class Crawler:
def __init__(self, save_path_=None, cache_path_=None, nums_thread=5, begin_date=None,
max_page=2, max_buffer_size=3000):
self._num_threads = nums_thread
self._main_thd = None
self._main_thd_started = threading.Event()
self._fetch_finished = None
self._max_page = max_page
if begin_date is None or type(begin_date) != datetime:
begin_date = datetime.fromordinal(datetime.now().date().toordinal()) - timedelta(days=2)
self._date = begin_date
self._page = 1
if not save_path_:
save_path_ = save_path
self._save_path = save_path_
self._cache = Cacher(cache_path_ if cache_path_ else cache_path)
# handling abort event
self._abort_event = threading.Event()
self._abort_wait = []

# handling variable buffer for main thread
self._buffer_data = []
self._buffer_lock = threading.RLock()
self._buffer_empty = threading.Event() # an event telling main thread to fetch more data
self._buffer_empty.set()
self._max_buffer_size = max_buffer_size

# creating directory
for i in range(1000):
dst_path = os.path.join(save_path, str(i))
create_dir(dst_path)
if not global_file_cache.exist_dir_in_cache(dst_path):
global_file_cache.add_cache_dir(dst_path)

def _main_thd_cb(self):
self._abort_wait = []
self._abort_event.clear()
self._fetch_finished = False

try:
# fetch ranking page
print('Fetching ranking page (html mode)')
# external loop for handling retrying
while not self._abort_event.is_set():
suc = False
req = None
while not suc:
if self._abort_event.is_set():
return
try:
req = requests.get('https://www.pixiv.net/ranking.php?mode=daily',
headers={'User-Agent': ua}, proxies=proxy, timeout=15)
suc = True
except Exception as ex:
warn(str(ex))
rep = req.content.decode('utf8')
# handling non-200
if req.status_code != 200:
print('HTTP Get failed with response code %d, retry in 0.5s' % req.status_code)
# wait 0.5s
if self._abort_event.wait(0.5):
break
# parse tt
pattern = re.compile(r'pixiv\.context\.token\s*=\s*"(?P<tt>\w+)";')
match_result = re.finditer(pattern, rep)
try:
match_result = next(match_result)
except StopIteration:
match_result = None
if not match_result:
print('Could not get tt from html, exited')
self._main_thd_started.set()
return
self._tt = match_result.group('tt')
break
print('Got tt = "%s"' % self._tt)

# starting parallel download thread here
for _ in range(self._num_threads):
event_to_wait = threading.Event()
self._abort_wait.append(event_to_wait)
worker = threading.Thread(target=self._worker_thd_cb, args=(event_to_wait,))
worker.start()
self._main_thd_started.set()

headers = {'X-Requested-With': 'XMLHttpRequest',
'Referer': 'https://www.pixiv.net/ranking.php?mode=daily'}
while self._buffer_empty.wait():
if self._abort_event.is_set():
break

# fetch from cacher
key = '%s-p%d' % (self._date.strftime('%Y%m%d'), self._page)
result = self._cache.get(key)
if not result:
with self._buffer_lock:
print('Fetching ranking page(json mode), date=%s, page=%d, buffer=%d/%d' %
(str(self._date.date()), self._page, len(self._buffer_data), self._max_buffer_size))
params = {'mode': 'daily', 'date': self._date.strftime('%Y%m%d'), 'p': self._page,
'format': 'json', 'tt': self._tt}
suc = False
req = None
while not suc:
if self._abort_event.is_set():
return
try:
req = requests.get('https://www.pixiv.net/ranking.php', params=params, headers=headers,
proxies=proxy, timeout=15)
suc = True
except Exception as ex:
warn(str(ex))
rep = req.content.decode('utf8')
# terminated state
if req.status_code == 404:
break
# append to cacher
self._cache[key] = rep
result = rep
else:
result = result.decode('utf8')

json_data = json.loads(result)
buffer_data = self._parse_data(json_data)

# append to buffer
with self._buffer_lock:
self._buffer_data += buffer_data
# check buffer size
if len(self._buffer_data) >= self._max_buffer_size:
self._buffer_empty.clear()

# next page
self._page += 1

if self._page > self._max_page:
self._page = 1
self._date -= timedelta(days=1)

finally:
print('main thd exited')
self._fetch_finished = True
for item in self._abort_wait:
item.wait()

def _parse_data(self, data):
ret_data = []
if data.get('contents', None):
contents = data['contents']
for content in contents:
url = content['url']
ranking_date = self._date
ranking_page = self._page
illust_id = int(content['illust_id'])
illust_page_count = int(content['illust_page_count'])
for page in range(illust_page_count):
single_illust_url = self._replace_url(url, page)
ret_data.append({'date': ranking_date, 'page': ranking_page,
'illust_id': illust_id, 'illust_page': page,
'url': single_illust_url})
return ret_data

@staticmethod
def _replace_url(url, page):
url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+_p)\d+'
r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
match = re.match(url_pattern, url)
if match:
schemas = match.group('schemas')
host = match.group('host')
path_prefix = match.group('path_prefix')
path_postfix = match.group('path_postfix')
return '%s://%s%s%d%s' % (schemas, host, path_prefix, page, path_postfix)
url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+)'
r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
match = re.match(url_pattern, url)
if match:
schemas = match.group('schemas')
host = match.group('host')
path_prefix = match.group('path_prefix')
path_postfix = match.group('path_postfix')
if page != 0:
warn('A non-pageable image url detected, your page should be 0 constantly, but got %d' % page)
return '%s://%s%s%s' % (schemas, host, path_prefix, path_postfix)

raise ValueError('The url "%s" could not match any replacement rules' % url)

def _worker_thd_cb(self, thd_wait_event):
try:
while not self._abort_event.is_set():
buffer_item = None
with self._buffer_lock:
if len(self._buffer_data) > 0:
buffer_item = self._buffer_data[0]
self._buffer_data = self._buffer_data[1:]
if len(self._buffer_data) < self._max_buffer_size:
self._buffer_empty.set()

# fetch failed, wait more time
if buffer_item is None:
if self._fetch_finished or self._abort_event.wait(0.1):
break
continue

# unpacking value
date = buffer_item['date']
page = buffer_item['page']
illust_id = buffer_item['illust_id']
illust_page = buffer_item['illust_page']
url = buffer_item['url']

# download file here
dst_path = os.path.join(save_path, str(illust_id % 1000), '%dp%d.jpg' % (illust_id, illust_page))
if not global_file_cache.exist_file(dst_path):
print('Downloading [%s #%d] [%d p%d] %s' % (date.strftime('%Y%m%d'), page,
illust_id, illust_page, url))
suc = False
while not suc:
try:
req = requests.get(url, headers={'Referer': 'https://www.pixiv.net/member_illust.php'
'?mode=medium&illust_id=%d' % illust_id},
timeout=15)
if req.status_code != 200:
warn('Error while downloading %d p%d : HTTP %d' %
(illust_id, illust_page, req.status_code))
break

image = req.content
with open(dst_path, 'wb') as f:
f.write(image)
global_file_cache.append_file(dst_path)

suc = True
except Exception as ex:
print(ex)
finally:
thd_wait_event.set()
print('thd exited')

def start(self):
self.abort()
self._main_thd = threading.Thread(target=self._main_thd_cb)
self._main_thd.start()

def abort(self):
self._abort_event.set()
self.wait()

def wait(self):
if self._main_thd:
self._main_thd_started.wait()
for item in self._abort_wait:
item.wait()


class DatabaseGenerator:
# flags for illust_content_type
ILLUST_CONTENT_TYPE_SEXUAL = 1
ILLUST_CONTENT_TYPE_LO = 2
ILLUST_CONTENT_TYPE_GROTESQUE = 4
ILLUST_CONTENT_TYPE_VIOLENT = 8
ILLUST_CONTENT_TYPE_HOMOSEXUAL = 16
ILLUST_CONTENT_TYPE_DRUG = 32
ILLUST_CONTENT_TYPE_THOUGHTS = 64
ILLUST_CONTENT_TYPE_ANTISOCIAL = 128
ILLUST_CONTENT_TYPE_RELIGION = 256
ILLUST_CONTENT_TYPE_ORIGINAL = 512
ILLUST_CONTENT_TYPE_FURRY = 1024
ILLUST_CONTENT_TYPE_BL = 2048
ILLUST_CONTENT_TYPE_YURI = 4096

def __init__(self, path_to_save=None, cacher_path=None, max_page=2):
self._cacher = Cacher(cacher_path if cacher_path else cache_path)
if not path_to_save:
path_to_save = db_path
with open(path_to_save, 'w'):
pass
self._conn = sqlite3.connect(path_to_save)
self._cursor = self._conn.cursor()
self._max_page = max_page

self._initialize()
self._user_id_set = set()
self._tag_id_dict = dict()
self._rank_set = set()
self._illust_id_set = set()
self._illust_series_id_set = set()

def _initialize(self):
# initialize tables
csr = self._cursor
csr.execute("create table user (user_id bigint primary key, user_name varchar(255) not null,"
"profile_img varchar(255) not null)")
csr.execute("create table illust_series (illust_series_id integer primary key, "
"illust_series_user_id bigint not null, illust_series_title varchar(255) not null,"
"illust_series_caption text(16383), illust_series_content_count integer not null,"
"illust_series_create_datetime datetime not null, page_url varchar(255) not null,"
"foreign key (illust_series_user_id) references user(user_id))")
csr.execute("create table illust (title varchar(255), date datetime, url varchar(255), illust_type integer,"
"illust_book_style integer, illust_page_count integer, illust_content_type integer not null, "
"illust_series_id integer, illust_id bigint primary key, width integer not null, "
"height integer not null, user_id bigint not null, rating_count integer not null, "
"view_count integer not null, illust_upload_timestamp datetime not null, attr varchar(255),"
"foreign key (user_id) references user, foreign key (illust_series_id) references illust_series)")
csr.execute("create table tag (tag_id integer primary key autoincrement, name varchar(255) not null unique)")
csr.execute("create table illust_tags (illust_id bigint not null, tag_id integer not null,"
"foreign key (illust_id) references illust, foreign key (tag_id) references tag)")
csr.execute("create table illust_rank (illust_id bigint not null, date datetime not null, "
"rank integer not null, yes_rank integer not null, foreign key (illust_id) references illust)")
# indices to accelerate date-based query
csr.execute("create index illust_date on illust(date)")
csr.execute("create index illust_rank_date on illust_rank(date, rank)")
self._conn.commit()

def start(self):
cur_date = datetime.now().date() - timedelta(days=2)
cur_page = 1

key = '%s-p%d' % (cur_date.strftime('%Y%m%d'), cur_page)
data = self._cacher.get(key)
while data:
data = data.decode("utf8")
# print('Parsing %s' % key)
json_data = json.loads(data)
try:
contents = json_data['contents']
except KeyError:
break

for item in contents:
self._parse(item, cur_date)

# next
cur_page += 1
if cur_page > self._max_page:
cur_date -= timedelta(days=1)
cur_page = 1
key = '%s-p%d' % (cur_date.strftime('%Y%m%d'), cur_page)
data = self._cacher.get(key)

self._conn.commit()

def _parse(self, json_obj, ranking_date):
title = json_obj['title']
date = json_obj['date']
tags = json_obj['tags']
url = json_obj['url']
illust_type = json_obj['illust_type']
illust_book_style = json_obj['illust_book_style']
illust_page_count = json_obj['illust_page_count']
user_name = json_obj['user_name']
profile_img = json_obj['profile_img']
illust_content_type = json_obj['illust_content_type']
illust_series = json_obj['illust_series']
illust_id = json_obj['illust_id']
width = json_obj['width']
height = json_obj['height']
user_id = json_obj['user_id']
rank = json_obj['rank']
# hint: yes_rank is not YES! rank!, it's just the rank of yesterday, don't be treated XD
yes_rank = json_obj['yes_rank']
rating_count = json_obj['rating_count']
view_count = json_obj['view_count']
illust_upload_timestamp = json_obj['illust_upload_timestamp']
attr = json_obj['attr']
# converting illust_content_type
flag_illust_content_type = 0
if illust_content_type['sexual'] != 0:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_SEXUAL
if illust_content_type['lo']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_LO
if illust_content_type['grotesque']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_GROTESQUE
if illust_content_type['violent']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_VIOLENT
if illust_content_type['homosexual']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_HOMOSEXUAL
if illust_content_type['drug']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_DRUG
if illust_content_type['thoughts']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_THOUGHTS
if illust_content_type['antisocial']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_ANTISOCIAL
if illust_content_type['religion']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_RELIGION
if illust_content_type['original']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_ORIGINAL
if illust_content_type['furry']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_FURRY
if illust_content_type['bl']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_BL
if illust_content_type['yuri']:
flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_YURI
# querying user data
csr = self._cursor
if not self._user_id_set.issuperset({user_id}):
csr.execute("insert into user(user_id, user_name, profile_img) values (?, ?, ?)",
(user_id, user_name, profile_img))
self._user_id_set.add(user_id)
# handling illust_series
if type(illust_series) != bool:
illust_series_id = illust_series['illust_series_id']
illust_series_user_id = illust_series['illust_series_user_id']
illust_series_title = illust_series['illust_series_title']
illust_series_caption = illust_series['illust_series_caption']
illust_series_content_count = illust_series['illust_series_content_count']
illust_series_create_datetime = illust_series['illust_series_create_datetime']
page_url = illust_series['page_url']
if not self._illust_series_id_set.issuperset({illust_series_id}):
csr.execute("insert into illust_series(illust_series_id, illust_series_user_id, "
"illust_series_title, illust_series_caption, illust_series_content_count, "
"illust_series_create_datetime, page_url) values (?, ?, ?, ?, ?, ?, ?)",
(illust_series_id, illust_series_user_id, illust_series_title, illust_series_caption,
illust_series_content_count, illust_series_create_datetime, page_url))
self._illust_series_id_set.add(illust_series_id)
illust_series = illust_series_id
else:
illust_series = None
# tags
for tag in tags:
if self._tag_id_dict.get(tag, None):
tag_id = self._tag_id_dict[tag]
else:
csr.execute("insert into tag(name) values (?)", (tag,))
tag_id = len(self._tag_id_dict) + 1
self._tag_id_dict[tag] = tag_id
csr.execute("insert into illust_tags(illust_id, tag_id) values (?, ?)", (illust_id, tag_id))
# converting date
reg_ptn = re.compile('(\\d+)年(\\d+)月(\\d+)日\\s(\\d+):(\\d+)')
match = re.match(reg_ptn, date)
if match:
date_year, date_month, date_day, date_hour, date_minute = (int(match.group(x)) for x in range(1, 6))
date = datetime(date_year, date_month, date_day, date_hour, date_minute)
illust_upload_timestamp = datetime.fromtimestamp(illust_upload_timestamp)
if not self._illust_id_set.issuperset({illust_id}):
csr.execute("insert into illust(title, date, url, illust_type, illust_book_style, illust_page_count, "
"illust_content_type, illust_series_id, illust_id, width, height, user_id, rating_count, "
"view_count, illust_upload_timestamp, attr) "
"values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
(title, date, url, illust_type, illust_book_style, illust_page_count, flag_illust_content_type,
illust_series, illust_id, width, height, user_id, rating_count, view_count,
illust_upload_timestamp, attr))
self._illust_id_set.add(illust_id)

if not self._rank_set.issuperset((illust_id, ranking_date, rank, yes_rank)):
csr.execute("insert into illust_rank(illust_id, date, rank, yes_rank) values (?, ?, ?, ?)",
(illust_id, ranking_date, rank, yes_rank))
self._rank_set.add((illust_id, ranking_date, rank, yes_rank))


if __name__ == '__main__':
global_file_cache.load(os.path.join(cache_path, 'index'))
print('Crawler starting')
a = Crawler()
a.start()
a.wait()
global_file_cache.save(os.path.join(cache_path, 'index'))
print('Database generator starting')
a = DatabaseGenerator()
a.start()
global_file_cache.save(os.path.join(cache_path, 'index'))

后日谈#

给这个爬虫爬到的图片标了下数据,自己跑了个faster-rcnn的动漫脸识别,结合动漫脸识别里面提到的基于opencv的识别方法,可以把opencv中CascadeClassifier那惨不忍睹的识别结果过滤到几乎为100%正确的结果,这些几乎没有错误的结果可以用来实验一下各种的GAN。

CascadeClassifier的优点就是检测出来的脸型比较单一,当然这也是它的缺点所在,就因为这一点,过滤掉了一大半的结果(心痛)。

后续会放出处理faster-rcnn和CascadeClassifier生成的结果,并且对两个结果求IoU比例、根据IoU进行边框匹配并裁剪缩放的过程及代码。

碧蓝航线的Live2D提取与播放

碧(窑)蓝(子)航线的Live2D提取与播放#

这篇博文纯属是闲得蛋疼的产物(博文中的泥石流),借助了prefare大佬的文章,跟着大佬的教程自己走了一遍,效果拔群。

0x00 需要的工具#

提取Unity资源

Live2D播放

我选择了后者,毕竟能动些小手脚

0x01 提取Unity资源#

首先得从比例比例官网下个碧蓝的apk,装到手机/模拟器上,然后随便你游客登陆也好,用自己b站账号登陆也好,进到游戏主界面点击右上角的设置->资源->Live2D资源更新,下载全部的Live2D资源,下载完就可以退出游戏了。

随便掏出个文件浏览器,在SD卡的目录下找到Android/data/com.bilibili.azurlane/files/AssetBundles/live2d,把里面的文件全部copy出来,保存到电脑上。如下图所示。

运行UABE的AssetBundleExtractor,在File->Open依次打开上面的文件(如第一个就是aidang_2),然后它会提问是否要解压,选是,然后随便输入个文件名保存就ok了,覆盖掉原文件也是没问题的。如果嫌弃累的话,后面有个能够一键操作 (站上去自己动) 的python脚本,直接运行就好了。

然后把解压的文件拖动到大佬写的exe上就好了。(好一会儿脑子秀逗了还以为要点开exe才拖,结果看了大佬的代码之后才明白只要拖到文件资源管理器上就ok了)

在上面一步执行完后,生成了一个live2d文件夹,里面每个文件夹对应了每个live2d模型,如点开aidang_2,里面会有aidang_2.moc3aidang_2.model3.jsonaidang_2.physics3.json,以及motionstextures两个文件夹。

下面一步需要的就是这些文件了。

0x02 查看Live2D#

最简单直接的方法,下载Cubism官网的Live2D Cubism Viewer并打开,把moc3文件或者model3.json文件拖到窗口上就能直接看了。在左边双击motions下的文件就可以播放对应的动画。

这样就ok了。

动起来就是这样子的了:

如果有个大胆的想法(比如把右边的动画窗口嵌入到自己的一个程序界面上的话),就需要自己动动手了。

0x03 魔改Native demo#

目标:实现一个不依赖Unity并且能够自主控制的播放窗口。 其实就是想把它当成自己写的背景小程序中的一个插件而已啦

第一步,下载CubismCore:在https://live2d.github.io/#native上点击下载Download Cubism 3 SDK for Native betag

第二步,下载GLEW:在http://glew.sourceforge.net下载Binaries Windows 32-bit and 64-bit

第三步,下载GLFW:由于一些功能需要最新版(3.3.0),但是预编译的只有3.2.1版本,所以要自己动手build,丰衣足食。

  1. GitHub 从复制到粘贴:运行git clone https://github.com/glfw/glfw
  2. 教科书般的cmake编译,我习惯将build文件夹设成.../glfw/build,把CMAKE_INSTALL_PREFIX设成.../glfw/build/install,打开VS,生成INSTALL,完事。

第四步,下载CubismNativeSamples

  1. git clone --recursive https://github.com/Live2D/CubismNativeSamples
  2. 对cubism native framework进行教科书般cmake
  3. ↑出事了,因为cmake脚本下的include_directories("${FRAMEWORK_GLFW_PATH}")include_directories("${FRAMEWORK_GLEW_PATH}")这两行找不到

第五步,Build自己的项目,让cmake玩蛋去吧

  1. 看了看代码,感觉也就这样吧,打开vs,新建一个空的c++项目,取名就叫CubismBuild吧,自己编译去
  2. 把Framework下的src文件夹复制到这个项目(有.vcxproj文件)的文件夹下,右键CubismBuild项目,添加->添加现有项,除了Rendering下的源文件之外,把src下面的所有源文件都加进来,Rendering只要加OpenGL下的源文件和CubismRenderer两个文件就好了。
  3. CubismRenderer_OpenGLES2.hpp第一行加上#define CSM_TARGET_WIN_GL,手动指定相应的cmake宏
  4. 解压CubismCore和GLEW
  5. ...\CubismNativeSamples\Samples\OpenGL\Demo\proj.win.cmake\Demo下的所有文件也都一同复制到项目文件夹下,在项目中添加这些文件
  6. ...\CubismNativeSamples\Samples\OpenGL\thirdParty\stb\include\stb_image.h这个也复制并添加到项目中
  7. 改一下编译参数,右键项目,点击属性,转到VC++目录,把上面复制过来的src、解压的Core、GLEW和编译过的GLFW的路径都添加一下
    包含路径

    引用路径库路径
  8. 然后在左边的链接器->输入->附加依赖项,加上Live2DCubismCore_MDd.libopengl32.libglu32.libglew32.libglfw3.lib
  9. 直接编译就ok了,运行的时候要把glew的dll复制过去,否则会提示dll缺失。
  10. LAppDefine.cpp下的ResourcesPath,改成CubismNativeSamples\Samples\Res的绝对路径就大功告成了(路径用/分割,不要用\,最后一个/要保留)

It’s time to 魔改。

有了源代码,全程debug一遍基本上就知道哪些代码在干哪些活了。

那个power和齿轮按钮没用,去掉。
改下LAppView就好了

窗体背景透明,隐藏标题。
改下LAppDelegate就好了,这里就是为什么要选择3.3.0的GLFW了,因为3.2.1没有更改窗口背景的API。

要得到窗体的handle
LAppDelegate.cpp加上

1
2
#define GLFW_EXPOSE_NATIVE_WIN32
#include <GLFW/glfw3native.h>

调用glfwGetWin32Window(GLFWwindow *)就会返回一个hwnd,调用WinAPI把父窗体设置成自己的窗体就可以为所欲为了。

设置窗体大小
LAppDefine.cpp改就好了

更改模型的控制
LAppModel上改就好了

在main上面加一个命令行解析,大概也就这样吧。(第二行输出的就是GLFW窗体的hwnd,可以用于后续的窗体嵌入)

0x04 流水作业,解放双手#

很麻烦对吧,要下很多东西对吧,其实要用到的只有UABE,大佬写的AzurLaneLive2DExtract而已,用一个python脚本执行足够了。

下载脚本(整合了UABEAzurLaneLive2DExtract和魔改的native viewer)

你要做的:

自己动手把Unity资源从手机/模拟器复制到电脑上
打开process.py,把上面的路径改成上面的文件夹路径就ok了

运行: python process.py
自动解压Unity资源、提取Live2D

查看Live2D:

  1. 使用Cubism 3自带的Viewer
  2. 或在命令行敲.\player\CubismBuild.exe -d 模型所在的文件夹就好了
    更多的参数(其实也就改了几个,可以通过敲.\player\CubismBuild.exe查看)
    (可能需要VC++ 2017的运行环境)

0xff 引用#

魔改的代码连自己都看不下去,就不开源了(遮脸)

最后:适度游戏益脑,沉迷游戏伤身(物理)