2019-02-14

用动漫脸数据集跑StyleGAN

前言#

在前面两篇博文中，已经分别介绍了p站图片的爬取和预处理。在这篇博文中，只是发发跑网络时的小牢骚罢了。

用的GAN叫Style-GAN，来自一篇不算太久远的Paper：A Style-based generator architecture for generative adversarial networks（arXiv 1812.04948）。

NVLab开源了官方的实现代码，直接clone，跟着教程走，就能跑起来了。

硬件要求#

只跑128x128大小的GAN的话并没有官方给出的数据那么夸张。个人估测最低配置如下：

RAM： 至少32G
GPU： VRAM至少8G，flops越高越好
CPU： 还行吧，不拖累显卡就好了

用1080 Ti跑完预定的2500万次迭代（25000k img）的至少估算时间：12天

我当然没跑完了，抛开几十块的电费不说，16G的内存实在是无能为力，到后面基本都是swap置换了，再跑下去也是难受。因此在保存了权重之后，果断Ctrl+C。

结果#

生成网络的生成的大部分结果都已经很棒了（起码比我厉害，也比我跑过的一些传统的GAN要厉害）。具体可以看上面的图，只节选了全图中的一部分。

完整的大图：

论文肯定是会看的，但是是什么时候才看就说不准了。（咕咕咕）

运行的输出

dnnlib: Running training.training_loop.training_loop() on localhost...
Streaming data using training.dataset.TFRecordDataset...
Dataset shape = [3, 128, 128]
Dynamic range = [0, 255]
Label size    = 0
Constructing networks...

G                             Params    OutputShape         WeightShape     
---                           ---       ---                 ---             
latents_in                    -         (?, 512)            -               
labels_in                     -         (?, 0)              -               
lod                           -         ()                  -               
dlatent_avg                   -         (512,)              -               
G_mapping/latents_in          -         (?, 512)            -               
G_mapping/labels_in           -         (?, 0)              -               
G_mapping/PixelNorm           -         (?, 512)            -               
G_mapping/Dense0              262656    (?, 512)            (512, 512)      
G_mapping/Dense1              262656    (?, 512)            (512, 512)      
G_mapping/Dense2              262656    (?, 512)            (512, 512)      
G_mapping/Dense3              262656    (?, 512)            (512, 512)      
G_mapping/Dense4              262656    (?, 512)            (512, 512)      
G_mapping/Dense5              262656    (?, 512)            (512, 512)      
G_mapping/Dense6              262656    (?, 512)            (512, 512)      
G_mapping/Dense7              262656    (?, 512)            (512, 512)      
G_mapping/Broadcast           -         (?, 12, 512)        -               
G_mapping/dlatents_out        -         (?, 12, 512)        -               
Truncation                    -         (?, 12, 512)        -               
G_synthesis/dlatents_in       -         (?, 12, 512)        -               
G_synthesis/4x4/Const         534528    (?, 512, 4, 4)      (512,)          
G_synthesis/4x4/Conv          2885632   (?, 512, 4, 4)      (3, 3, 512, 512)
G_synthesis/ToRGB_lod5        1539      (?, 3, 4, 4)        (1, 1, 512, 3)  
G_synthesis/8x8/Conv0_up      2885632   (?, 512, 8, 8)      (3, 3, 512, 512)
G_synthesis/8x8/Conv1         2885632   (?, 512, 8, 8)      (3, 3, 512, 512)
G_synthesis/ToRGB_lod4        1539      (?, 3, 8, 8)        (1, 1, 512, 3)  
G_synthesis/Upscale2D         -         (?, 3, 8, 8)        -               
G_synthesis/Grow_lod4         -         (?, 3, 8, 8)        -               
G_synthesis/16x16/Conv0_up    2885632   (?, 512, 16, 16)    (3, 3, 512, 512)
G_synthesis/16x16/Conv1       2885632   (?, 512, 16, 16)    (3, 3, 512, 512)
G_synthesis/ToRGB_lod3        1539      (?, 3, 16, 16)      (1, 1, 512, 3)  
G_synthesis/Upscale2D_1       -         (?, 3, 16, 16)      -               
G_synthesis/Grow_lod3         -         (?, 3, 16, 16)      -               
G_synthesis/32x32/Conv0_up    2885632   (?, 512, 32, 32)    (3, 3, 512, 512)
G_synthesis/32x32/Conv1       2885632   (?, 512, 32, 32)    (3, 3, 512, 512)
G_synthesis/ToRGB_lod2        1539      (?, 3, 32, 32)      (1, 1, 512, 3)  
G_synthesis/Upscale2D_2       -         (?, 3, 32, 32)      -               
G_synthesis/Grow_lod2         -         (?, 3, 32, 32)      -               
G_synthesis/64x64/Conv0_up    1442816   (?, 256, 64, 64)    (3, 3, 512, 256)
G_synthesis/64x64/Conv1       852992    (?, 256, 64, 64)    (3, 3, 256, 256)
G_synthesis/ToRGB_lod1        771       (?, 3, 64, 64)      (1, 1, 256, 3)  
G_synthesis/Upscale2D_3       -         (?, 3, 64, 64)      -               
G_synthesis/Grow_lod1         -         (?, 3, 64, 64)      -               
G_synthesis/128x128/Conv0_up  426496    (?, 128, 128, 128)  (3, 3, 256, 128)
G_synthesis/128x128/Conv1     279040    (?, 128, 128, 128)  (3, 3, 128, 128)
G_synthesis/ToRGB_lod0        387       (?, 3, 128, 128)    (1, 1, 128, 3)  
G_synthesis/Upscale2D_4       -         (?, 3, 128, 128)    -               
G_synthesis/Grow_lod0         -         (?, 3, 128, 128)    -               
G_synthesis/images_out        -         (?, 3, 128, 128)    -               
G_synthesis/lod               -         ()                  -               
G_synthesis/noise0            -         (1, 1, 4, 4)        -               
G_synthesis/noise1            -         (1, 1, 4, 4)        -               
G_synthesis/noise2            -         (1, 1, 8, 8)        -               
G_synthesis/noise3            -         (1, 1, 8, 8)        -               
G_synthesis/noise4            -         (1, 1, 16, 16)      -               
G_synthesis/noise5            -         (1, 1, 16, 16)      -               
G_synthesis/noise6            -         (1, 1, 32, 32)      -               
G_synthesis/noise7            -         (1, 1, 32, 32)      -               
G_synthesis/noise8            -         (1, 1, 64, 64)      -               
G_synthesis/noise9            -         (1, 1, 64, 64)      -               
G_synthesis/noise10           -         (1, 1, 128, 128)    -               
G_synthesis/noise11           -         (1, 1, 128, 128)    -               
images_out                    -         (?, 3, 128, 128)    -               
---                           ---       ---                 ---             
Total                         25843858                                      


D                    Params    OutputShape         WeightShape     
---                  ---       ---                 ---             
images_in            -         (?, 3, 128, 128)    -               
labels_in            -         (?, 0)              -               
lod                  -         ()                  -               
FromRGB_lod0         512       (?, 128, 128, 128)  (1, 1, 3, 128)  
128x128/Conv0        147584    (?, 128, 128, 128)  (3, 3, 128, 128)
128x128/Conv1_down   295168    (?, 256, 64, 64)    (3, 3, 128, 256)
Downscale2D          -         (?, 3, 64, 64)      -               
FromRGB_lod1         1024      (?, 256, 64, 64)    (1, 1, 3, 256)  
Grow_lod0            -         (?, 256, 64, 64)    -               
64x64/Conv0          590080    (?, 256, 64, 64)    (3, 3, 256, 256)
64x64/Conv1_down     1180160   (?, 512, 32, 32)    (3, 3, 256, 512)
Downscale2D_1        -         (?, 3, 32, 32)      -               
FromRGB_lod2         2048      (?, 512, 32, 32)    (1, 1, 3, 512)  
Grow_lod1            -         (?, 512, 32, 32)    -               
32x32/Conv0          2359808   (?, 512, 32, 32)    (3, 3, 512, 512)
32x32/Conv1_down     2359808   (?, 512, 16, 16)    (3, 3, 512, 512)
Downscale2D_2        -         (?, 3, 16, 16)      -               
FromRGB_lod3         2048      (?, 512, 16, 16)    (1, 1, 3, 512)  
Grow_lod2            -         (?, 512, 16, 16)    -               
16x16/Conv0          2359808   (?, 512, 16, 16)    (3, 3, 512, 512)
16x16/Conv1_down     2359808   (?, 512, 8, 8)      (3, 3, 512, 512)
Downscale2D_3        -         (?, 3, 8, 8)        -               
FromRGB_lod4         2048      (?, 512, 8, 8)      (1, 1, 3, 512)  
Grow_lod3            -         (?, 512, 8, 8)      -               
8x8/Conv0            2359808   (?, 512, 8, 8)      (3, 3, 512, 512)
8x8/Conv1_down       2359808   (?, 512, 4, 4)      (3, 3, 512, 512)
Downscale2D_4        -         (?, 3, 4, 4)        -               
FromRGB_lod5         2048      (?, 512, 4, 4)      (1, 1, 3, 512)  
Grow_lod4            -         (?, 512, 4, 4)      -               
4x4/MinibatchStddev  -         (?, 513, 4, 4)      -               
4x4/Conv             2364416   (?, 512, 4, 4)      (3, 3, 513, 512)
4x4/Dense0           4194816   (?, 512)            (8192, 512)     
4x4/Dense1           513       (?, 1)              (512, 1)        
scores_out           -         (?, 1)              -               
---                  ---       ---                 ---             
Total                22941313                                      

Building TensorFlow graph...
Setting up snapshot image grid...
Setting up run dir...
Training...

tick 1     kimg 140.3    lod 4.00  minibatch 128  time 3m 05s       sec/tick 143.2   sec/kimg 1.02    maintenance 41.7   gpumem 3.0 
network-snapshot-000140        time 3m 35s       fid50k 374.5350  
tick 2     kimg 280.6    lod 4.00  minibatch 128  time 9m 31s       sec/tick 163.3   sec/kimg 1.16    maintenance 222.9  gpumem 3.4 
tick 3     kimg 420.9    lod 4.00  minibatch 128  time 12m 08s      sec/tick 156.9   sec/kimg 1.12    maintenance 0.4    gpumem 3.4 
tick 4     kimg 561.2    lod 4.00  minibatch 128  time 14m 47s      sec/tick 157.8   sec/kimg 1.12    maintenance 0.4    gpumem 3.4 
tick 5     kimg 681.5    lod 3.87  minibatch 128  time 20m 12s      sec/tick 324.6   sec/kimg 2.70    maintenance 0.4    gpumem 4.5 
tick 6     kimg 801.8    lod 3.66  minibatch 128  time 26m 49s      sec/tick 396.3   sec/kimg 3.29    maintenance 0.6    gpumem 4.5 
tick 7     kimg 922.1    lod 3.46  minibatch 128  time 33m 20s      sec/tick 391.1   sec/kimg 3.25    maintenance 0.6    gpumem 4.5 
tick 8     kimg 1042.4   lod 3.26  minibatch 128  time 40m 00s      sec/tick 399.4   sec/kimg 3.32    maintenance 0.5    gpumem 4.5 
tick 9     kimg 1162.8   lod 3.06  minibatch 128  time 46m 36s      sec/tick 394.9   sec/kimg 3.28    maintenance 0.6    gpumem 4.5 
tick 10    kimg 1283.1   lod 3.00  minibatch 128  time 53m 11s      sec/tick 395.0   sec/kimg 3.28    maintenance 0.6    gpumem 4.5 
network-snapshot-001283        time 3m 50s       fid50k 302.4119  
tick 11    kimg 1403.4   lod 3.00  minibatch 128  time 1h 03m 35s   sec/tick 391.7   sec/kimg 3.26    maintenance 231.8  gpumem 4.5 
tick 12    kimg 1523.7   lod 3.00  minibatch 128  time 1h 10m 04s   sec/tick 388.5   sec/kimg 3.23    maintenance 0.5    gpumem 4.5 
tick 13    kimg 1644.0   lod 3.00  minibatch 128  time 1h 16m 32s   sec/tick 387.8   sec/kimg 3.22    maintenance 0.6    gpumem 4.5 
tick 14    kimg 1764.4   lod 3.00  minibatch 128  time 1h 23m 02s   sec/tick 388.6   sec/kimg 3.23    maintenance 0.6    gpumem 4.5 
tick 15    kimg 1864.4   lod 2.89  minibatch 64   time 1h 37m 36s   sec/tick 873.9   sec/kimg 8.73    maintenance 0.6    gpumem 4.7 
tick 16    kimg 1964.5   lod 2.73  minibatch 64   time 1h 56m 27s   sec/tick 1130.0  sec/kimg 11.29   maintenance 1.1    gpumem 4.7 
tick 17    kimg 2064.6   lod 2.56  minibatch 64   time 2h 15m 18s   sec/tick 1129.7  sec/kimg 11.29   maintenance 1.0    gpumem 4.7 
tick 18    kimg 2164.7   lod 2.39  minibatch 64   time 2h 34m 11s   sec/tick 1132.3  sec/kimg 11.31   maintenance 1.0    gpumem 4.7 
tick 19    kimg 2264.8   lod 2.23  minibatch 64   time 2h 53m 04s   sec/tick 1132.1  sec/kimg 11.31   maintenance 1.0    gpumem 4.7 
tick 20    kimg 2364.9   lod 2.06  minibatch 64   time 3h 11m 57s   sec/tick 1132.2  sec/kimg 11.31   maintenance 1.0    gpumem 4.7 
network-snapshot-002364        time 4m 33s       fid50k 297.8871  
tick 21    kimg 2465.0   lod 2.00  minibatch 64   time 3h 35m 39s   sec/tick 1146.3  sec/kimg 11.45   maintenance 275.5  gpumem 4.7 
tick 22    kimg 2565.1   lod 2.00  minibatch 64   time 3h 54m 34s   sec/tick 1133.9  sec/kimg 11.33   maintenance 1.0    gpumem 4.7 
tick 23    kimg 2665.2   lod 2.00  minibatch 64   time 4h 13m 31s   sec/tick 1135.8  sec/kimg 11.35   maintenance 1.0    gpumem 4.7 
tick 24    kimg 2765.3   lod 2.00  minibatch 64   time 4h 32m 02s   sec/tick 1109.7  sec/kimg 11.09   maintenance 1.0    gpumem 4.7 
tick 25    kimg 2865.4   lod 2.00  minibatch 64   time 4h 50m 52s   sec/tick 1129.1  sec/kimg 11.28   maintenance 1.0    gpumem 4.7 
tick 26    kimg 2965.5   lod 2.00  minibatch 64   time 5h 09m 32s   sec/tick 1118.7  sec/kimg 11.18   maintenance 1.1    gpumem 4.7 
tick 27    kimg 3045.5   lod 1.92  minibatch 32   time 5h 38m 21s   sec/tick 1728.0  sec/kimg 21.60   maintenance 1.0    gpumem 4.9 
tick 28    kimg 3125.5   lod 1.79  minibatch 32   time 6h 16m 31s   sec/tick 2288.1  sec/kimg 28.60   maintenance 2.3    gpumem 4.9 
tick 29    kimg 3205.5   lod 1.66  minibatch 32   time 6h 54m 45s   sec/tick 2291.4  sec/kimg 28.64   maintenance 2.5    gpumem 4.9 
tick 30    kimg 3285.5   lod 1.52  minibatch 32   time 7h 33m 34s   sec/tick 2326.7  sec/kimg 29.08   maintenance 2.8    gpumem 4.9 
network-snapshot-003285        time 6m 02s       fid50k 243.0076  
tick 31    kimg 3365.5   lod 1.39  minibatch 32   time 8h 17m 27s   sec/tick 2266.9  sec/kimg 28.34   maintenance 366.3  gpumem 4.9 
tick 32    kimg 3445.5   lod 1.26  minibatch 32   time 8h 55m 02s   sec/tick 2251.5  sec/kimg 28.14   maintenance 2.8    gpumem 4.9 
tick 33    kimg 3525.5   lod 1.12  minibatch 32   time 9h 32m 35s   sec/tick 2250.6  sec/kimg 28.13   maintenance 2.7    gpumem 4.9 
tick 34    kimg 3605.5   lod 1.00  minibatch 32   time 10h 10m 04s  sec/tick 2246.4  sec/kimg 28.08   maintenance 2.7    gpumem 4.9 
tick 35    kimg 3685.5   lod 1.00  minibatch 32   time 10h 46m 35s  sec/tick 2188.4  sec/kimg 27.35   maintenance 2.6    gpumem 4.9 
tick 36    kimg 3765.5   lod 1.00  minibatch 32   time 11h 23m 07s  sec/tick 2189.2  sec/kimg 27.37   maintenance 2.6    gpumem 4.9 
tick 37    kimg 3845.5   lod 1.00  minibatch 32   time 11h 59m 38s  sec/tick 2188.1  sec/kimg 27.35   maintenance 2.7    gpumem 4.9 
tick 38    kimg 3925.5   lod 1.00  minibatch 32   time 12h 36m 08s  sec/tick 2188.1  sec/kimg 27.35   maintenance 2.6    gpumem 4.9 
tick 39    kimg 4005.5   lod 1.00  minibatch 32   time 13h 12m 40s  sec/tick 2189.1  sec/kimg 27.36   maintenance 2.7    gpumem 4.9 
tick 40    kimg 4085.5   lod 1.00  minibatch 32   time 13h 49m 12s  sec/tick 2188.8  sec/kimg 27.36   maintenance 2.6    gpumem 4.9 
network-snapshot-004085        time 5m 49s       fid50k 116.0598  
tick 41    kimg 4165.5   lod 1.00  minibatch 32   time 14h 31m 32s  sec/tick 2188.0  sec/kimg 27.35   maintenance 352.6  gpumem 4.9 
tick 42    kimg 4225.5   lod 0.96  minibatch 16   time 15h 11m 38s  sec/tick 2403.0  sec/kimg 40.03   maintenance 2.6    gpumem 5.1 
tick 43    kimg 4285.6   lod 0.86  minibatch 16   time 16h 09m 48s  sec/tick 3480.7  sec/kimg 57.98   maintenance 9.6    gpumem 5.1 
tick 44    kimg 4345.6   lod 0.76  minibatch 16   time 17h 06m 03s  sec/tick 3368.3  sec/kimg 56.11   maintenance 7.1    gpumem 5.1 
tick 45    kimg 4405.6   lod 0.66  minibatch 16   time 18h 02m 18s  sec/tick 3366.8  sec/kimg 56.08   maintenance 8.3    gpumem 5.1 
tick 46    kimg 4465.7   lod 0.56  minibatch 16   time 18h 57m 51s  sec/tick 3323.2  sec/kimg 55.36   maintenance 9.6    gpumem 5.1 
tick 47    kimg 4525.7   lod 0.46  minibatch 16   time 19h 53m 22s  sec/tick 3323.7  sec/kimg 55.37   maintenance 7.3    gpumem 5.1 
tick 48    kimg 4585.7   lod 0.36  minibatch 16   time 20h 50m 32s  sec/tick 3424.0  sec/kimg 57.04   maintenance 5.6    gpumem 5.1 
tick 49    kimg 4645.8   lod 0.26  minibatch 16   time 21h 47m 58s  sec/tick 3436.1  sec/kimg 57.24   maintenance 9.7    gpumem 5.1 
tick 50    kimg 4705.8   lod 0.16  minibatch 16   time 22h 45m 04s  sec/tick 3418.3  sec/kimg 56.94   maintenance 7.8    gpumem 5.1 
network-snapshot-004705        time 8m 55s       fid50k 27.1257   
tick 51    kimg 4765.8   lod 0.06  minibatch 16   time 23h 51m 22s  sec/tick 3426.3  sec/kimg 57.07   maintenance 552.1  gpumem 5.1 
tick 52    kimg 4825.9   lod 0.00  minibatch 16   time 1d 00h 46m   sec/tick 3292.1  sec/kimg 54.84   maintenance 5.0    gpumem 5.1 
tick 53    kimg 4885.9   lod 0.00  minibatch 16   time 1d 01h 40m   sec/tick 3234.1  sec/kimg 53.87   maintenance 5.2    gpumem 5.1 
tick 54    kimg 4945.9   lod 0.00  minibatch 16   time 1d 02h 35m   sec/tick 3291.5  sec/kimg 54.83   maintenance 6.6    gpumem 5.1 
tick 55    kimg 5006.0   lod 0.00  minibatch 16   time 1d 03h 30m   sec/tick 3280.6  sec/kimg 54.65   maintenance 5.1    gpumem 5.1 
tick 56    kimg 5066.0   lod 0.00  minibatch 16   time 1d 04h 24m   sec/tick 3272.8  sec/kimg 54.52   maintenance 8.7    gpumem 5.1 
tick 57    kimg 5126.0   lod 0.00  minibatch 16   time 1d 05h 18m   sec/tick 3234.0  sec/kimg 53.87   maintenance 5.0    gpumem 5.1 
tick 58    kimg 5186.0   lod 0.00  minibatch 16   time 1d 06h 12m   sec/tick 3250.0  sec/kimg 54.14   maintenance 5.4    gpumem 5.1 
tick 59    kimg 5246.1   lod 0.00  minibatch 16   time 1d 07h 07m   sec/tick 3245.4  sec/kimg 54.06   maintenance 33.8   gpumem 5.1 
tick 60    kimg 5306.1   lod 0.00  minibatch 16   time 1d 08h 01m   sec/tick 3232.7  sec/kimg 53.85   maintenance 6.1    gpumem 5.1 
network-snapshot-005306        time 9m 33s       fid50k 14.5950   
tick 61    kimg 5366.1   lod 0.00  minibatch 16   time 1d 09h 06m   sec/tick 3283.7  sec/kimg 54.70   maintenance 596.8  gpumem 5.1 
tick 62    kimg 5426.2   lod 0.00  minibatch 16   time 1d 10h 00m   sec/tick 3244.0  sec/kimg 54.04   maintenance 15.0   gpumem 5.1 
tick 63    kimg 5486.2   lod 0.00  minibatch 16   time 1d 10h 54m   sec/tick 3234.6  sec/kimg 53.88   maintenance 9.4    gpumem 5.1 
tick 64    kimg 5546.2   lod 0.00  minibatch 16   time 1d 11h 48m   sec/tick 3236.9  sec/kimg 53.92   maintenance 7.0    gpumem 5.1 
tick 65    kimg 5606.3   lod 0.00  minibatch 16   time 1d 12h 44m   sec/tick 3316.2  sec/kimg 55.24   maintenance 9.7    gpumem 5.1 
tick 66    kimg 5666.3   lod 0.00  minibatch 16   time 1d 13h 38m   sec/tick 3233.3  sec/kimg 53.86   maintenance 5.7    gpumem 5.1 
tick 67    kimg 5726.3   lod 0.00  minibatch 16   time 1d 14h 32m   sec/tick 3232.1  sec/kimg 53.84   maintenance 5.5    gpumem 5.1 
tick 68    kimg 5786.4   lod 0.00  minibatch 16   time 1d 15h 26m   sec/tick 3247.6  sec/kimg 54.10   maintenance 5.6    gpumem 5.1 
tick 69    kimg 5846.4   lod 0.00  minibatch 16   time 1d 16h 20m   sec/tick 3250.5  sec/kimg 54.15   maintenance 16.9   gpumem 5.1 
tick 70    kimg 5906.4   lod 0.00  minibatch 16   time 1d 17h 14m   sec/tick 3232.9  sec/kimg 53.85   maintenance 6.6    gpumem 5.1

2019-01-21

crawler

Pixiv的日榜爬虫

Pixiv日榜爬虫的原理与实现#

原理#

总结起来就两个字：抓包

首先，打开fiddler，如常访问p站日榜，如下图：

然后就可以套上BeautifulSoup直接解析html了。不过我后来发现了抓到一个这样的包，是在加载日榜51-100项的时候发出的请求。

标准的json格式，连html解析都不用了。打开contents一看，内容一目了然，ID、页数、标题、图片URL、作者、Tag等，一堆有用信息都已经显示出来了，所以直接拿出来用，比解析html更高效且信息量更全。

请求参数也很简单，mode=daily不需要改，p=2看上去是分页，1=1~50的内容，2=51~100的内容，以此类推，format=json也是固定好的，只有一个tt=382e...de是需要考虑怎样得到的。
在抓包历史中搜索382e...de，可以看到之前的日榜网页被标记了，说明这个字符串可以直接通过网页获得

在网页上查找一下，不难发现有一行是pixiv.context.token = "382e...de";：

这就好办了嘛，套一个正则就可以拿出来了，正则大概长这样：pixiv\.context\.token\s*=\s*"(\w+)";，匹配完后直接用group(1)就能得到了。

验证一下参数p，的确如所想的那样，而且还发现了更改日期只要改date就好了

然后就看图片了，用上图的url看看抓包抓到的是什么。

嗯，没错，而且要注意了，左边的Referer可是不能漏的，不然就这样：

恶意满满的403~

所以从json数据中获得的一个图片缩略图的url长这样，如上图所示：
https://i.pximg.net/c/240x480/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg

感觉有点小，残念，点进去看会更清楚一点（前面那张图被换掉了，因为戳进去显示正在浏览敏感图片emm）

这时候的url变成这样
https://i.pximg.net/c/600x600/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg

找到规律了吧，把/c/?x?/img-master/...中的?改成更大的数值的话，就能获取更清晰的图片了~

当然，注册个账号登陆进去的话，图片就变得更大了：

这时候url就是
https://i.pximg.net/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg

点击查看原图，url就变成下面这样，过程就不贴图了
https://i.pximg.net/img-original/img/2019/01/17/23/28/48/72712034_p0.png

综上所述，图片的格式差不多摸透了，剩下的/c/后面能接多少就靠自己发现吧。
缩略图 （里面的?自己摸索吧，上面已经有240x480和600x600的了）：
https://i.pximg.net/c/?x?/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg
大图（大于1000px）：
https://i.pximg.net/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg
原图：
https://i.pximg.net/img-original/img/2019/01/17/23/28/48/72712034_p0.png

替换json中的图片也就是一条正则的事，下面贴出代码（这段代码也包含在最后的代码中）。
两个参数，一个是url，就是上面从json得到的https://i.pximg.net/img-master/img/2019/01/17/23/28/48/72712034_p0_master1200.jpg
另外一个是page，指定要爬第几张图片（针对多图投稿），第一张图就是0

import re
from warning import warn
def replace_url(url, page):
        url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
                                 r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+_p)\d+'
                                 r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
        match = re.match(url_pattern, url)
        if match:
            schemas = match.group('schemas')
            host = match.group('host')
            path_prefix = match.group('path_prefix')
            path_postfix = match.group('path_postfix')
            return '%s://%s%s%d%s' % (schemas, host, path_prefix, page, path_postfix)
        url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
                                 r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+)'
                                 r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
        match = re.match(url_pattern, url)
        if match:
            schemas = match.group('schemas')
            host = match.group('host')
            path_prefix = match.group('path_prefix')
            path_postfix = match.group('path_postfix')
            if page != 0:
                warn('A non-pageable image url detected, your page should be 0 constantly, but got %d' % page)
            return '%s://%s%s%s' % (schemas, host, path_prefix, path_postfix)

        raise ValueError('The url "%s" could not match any replacement rules' % url)

爬的时候才发现：有些图中间的_p0是没有的，也就是直接剩下了72712034_master1200.jpg，这里也要注意一下，不然一不留神就出错了。

当然比较坑的是，原图是有png格式的，这东西在未登陆的时候比较难知道，所以要花费更多的时间在试图寻找jpg或png上。

程序#

一大串python代码，兼容python 3.5和3.6
~~写多线程爬虫就别纠结代码的整体美观了~~
爬的是全部的日榜的大图（非原图），Top 100，默认使用privoxy和ss代理（代理需要自己配置，如不需要则改为proxy = None），默认使用5线程下载
需要自己改save_path指定保存的位置，这段代码会生成1000个文件夹，按id尾数分开存储，如save_path\777文件夹保存的都是尾数是777的图片

代码我是部署在树莓派上的，为了提升速度，做了挺多的内存缓存的，所以吃掉500M的内存，每天更新大概只要花20分钟左右（10分钟爬取，10分钟更新数据库）
到目前为止，这个数据集有246G大小，有693k个文件

数据库表说明：
user：用户表，存有用户id、名称及头像url
illust_series：投稿的系列作品，这个是作者在投稿时指定的，存有系列id、创建用户id、标题、简介、属于本系列的投稿数量、创建时间和系列的url
illust：插画，存有标题、投稿时间、图片url、illust_type（未知）、book_style（未知）、页数、内容类型（如原创、暴力、X暗示等）、系列ID（不存在时为null）、id、宽高（多页投稿时默认指第一页）、用户id、评分数、浏览数、上传时间、属性（内容类型对应的string）
tag：插画标签，存有标签的id（自增字段）和标签名
illust_tags：插画-标签的关系表，一个插画对应多个标签，一个标签对应多个插画，存有标签id和插画id
illust_rank：插画的排行信息，存有插画id、时间、当前排名和昨日排名

import requests
from datetime import datetime, timedelta
import threading
import json
import re
from hashlib import md5
import os
from warnings import warn
import sqlite3
import numpy as np
import pickle
from tqdm import tqdm


ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' \
     'Chrome/53.0.2785.143 Safari/537.36'
# path to save images from pixiv
save_path = '/share/disk/ML-TRAINING-SET/PixivRanking'
# save_path = 'd:/ML-TRAINING-SET/PixivRanking'
# path to save ranking data cache
cache_path = os.path.join(save_path, '.cache')
# path to generate sqlite database
db_path = os.path.join(save_path, 'database.db')
# proxy, for those who could not access pixiv directly
proxy = {'https': 'https://localhost:8118'}


def calc_md5(str_data):
    hash_obj = md5()
    hash_obj.update(str_data.encode('utf8'))
    return hash_obj.hexdigest()


def create_dir(path):
    parent = os.path.abspath(path)
    dir_to_create = []
    while not os.path.exists(parent):
        dir_to_create.append(parent)
        parent = os.path.abspath(os.path.join(parent, '..'))
    dir_to_create = dir_to_create[::-1]
    for dir_path in dir_to_create:
        os.mkdir(dir_path)
        print('Directory %s created' % dir_path)


class FileCacher:
    def __init__(self):
        self._cache_files = dict()
        self._lock = threading.RLock()

    def add_cache_dir(self, directory, create_dir_if_not_exist=True):
        with self._lock:
            path = os.path.abspath(directory)
            if os.path.exists(path):
                files = set(os.listdir(path))
            else:
                files = set()
                if create_dir_if_not_exist:
                    create_dir(path)
            self._cache_files[path] = files

    def append_file(self, file_path):
        with self._lock:
            dir_path = os.path.abspath(os.path.join(file_path, '..'))
            files = self._cache_files.get(dir_path, None)
            if files is None:
                warn('%s is not in the cached directory, calling add_cache_dir implicitly' % dir_path)
                self.add_cache_dir(dir_path, True)
                files = self._cache_files.get(dir_path, None)
                assert files is not None
            file_name = os.path.basename(file_path)
            self._cache_files[dir_path].add(file_name)
    
    def remove_file(self, file_path):
        with self._lock:
            dir_path = os.path.abspath(os.path.join(file_path, '..'))
            files = self._cache_files.get(dir_path, None)
            if files is None:
                warn('%s is not in the cached directory, calling add_cache_dir implicitly' % dir_path)
                self.add_cache_dir(dir_path, True)
                files = self._cache_files.get(dir_path, None)
                assert files is not None
            file_name = os.path.basename(file_path)
            self._cache_files[dir_path].remove(file_name)

    def exist_file(self, file_path):
        with self._lock:
            dir_path = os.path.abspath(os.path.join(file_path, '..'))
            files = self._cache_files.get(dir_path, None)
            if files is None:
                warn('%s is not in the cached directory, calling add_cache_dir implicitly' % dir_path)
                self.add_cache_dir(dir_path, True)
                files = self._cache_files.get(dir_path, None)
            file_name = os.path.basename(file_path)
            return file_name in files

    def exist_dir_in_cache(self, dir_path):
        with self._lock:
            return self._cache_files.get(os.path.abspath(dir_path), None) is not None

    def validate_dir(self, dir_path):
        with self._lock:
            dir_path = os.path.abspath(dir_path)
            files = self._cache_files.get(dir_path, None)
            if files is not None:
                files = set(files)
                actual_files = set(os.listdir(dir_path))
                same_file_count = len(files.intersection(actual_files))
                is_same = len(files) == same_file_count and len(actual_files) == same_file_count
                if not is_same:
                    warn('cache inconsistency detected in directory %s, cleared all cache' % dir_path)
                    self._cache_files[dir_path] = actual_files

    def save(self, file_path):
        with self._lock:
            if not self.exist_file(file_path):
                self.append_file(file_path)
            with open(file_path, 'wb') as f:
                pickle.dump(self._cache_files, f)

    def load(self, file_path, validate_on_load=True):
        with self._lock:
            if os.path.exists(file_path):
                with open(file_path, 'rb') as f:
                    self._cache_files = pickle.load(f)
                if validate_on_load:
                    cache_dirs = list(self._cache_files)
                    print('validating files')
                    for cache_dir in tqdm(cache_dirs, ascii=True):
                        self.validate_dir(cache_dir)
                    print('done')


global_file_cache = FileCacher()


class Cacher:
    def __init__(self, path):
        self._path = path
        # create dir if not exists
        create_dir(self._path)

    def __getitem__(self, item):
        if type(item) != str:
            item = str(item)
        path = os.path.join(self._path, calc_md5(item))
        if not global_file_cache.exist_file(path):
            raise KeyError('Item not exists')
        with open(path, 'rb') as f:
            return f.read()

    def __setitem__(self, key, value):
        if type(key) != str:
            key = str(key)
        path = os.path.join(self._path, calc_md5(key))
        if type(value) == str:
            value = bytes(value, 'utf8')
        elif type(value) != bytes:
            raise TypeError('value should be string or bytes')
        with open(path, 'wb') as f:
            f.write(value)
            global_file_cache.append_file(path)

    def get(self, item, default_item=None):
        try:
            return self.__getitem__(item)
        except KeyError:
            return default_item


class Crawler:
    def __init__(self, save_path_=None, cache_path_=None, nums_thread=5, begin_date=None,
                 max_page=2, max_buffer_size=3000):
        self._num_threads = nums_thread
        self._main_thd = None
        self._main_thd_started = threading.Event()
        self._fetch_finished = None
        self._max_page = max_page
        if begin_date is None or type(begin_date) != datetime:
            begin_date = datetime.fromordinal(datetime.now().date().toordinal()) - timedelta(days=2)
        self._date = begin_date
        self._page = 1
        if not save_path_:
            save_path_ = save_path
        self._save_path = save_path_
        self._cache = Cacher(cache_path_ if cache_path_ else cache_path)
        # handling abort event
        self._abort_event = threading.Event()
        self._abort_wait = []

        # handling variable buffer for main thread
        self._buffer_data = []
        self._buffer_lock = threading.RLock()
        self._buffer_empty = threading.Event()  # an event telling main thread to fetch more data
        self._buffer_empty.set()
        self._max_buffer_size = max_buffer_size

        # creating directory
        for i in range(1000):
            dst_path = os.path.join(save_path, str(i))
            create_dir(dst_path)
            if not global_file_cache.exist_dir_in_cache(dst_path):
                global_file_cache.add_cache_dir(dst_path)

    def _main_thd_cb(self):
        self._abort_wait = []
        self._abort_event.clear()
        self._fetch_finished = False

        try:
            # fetch ranking page
            print('Fetching ranking page (html mode)')
            # external loop for handling retrying
            while not self._abort_event.is_set():
                suc = False
                req = None
                while not suc:
                    if self._abort_event.is_set():
                        return
                    try:
                        req = requests.get('https://www.pixiv.net/ranking.php?mode=daily',
                                           headers={'User-Agent': ua}, proxies=proxy, timeout=15)
                        suc = True
                    except Exception as ex:
                        warn(str(ex))
                rep = req.content.decode('utf8')
                # handling non-200
                if req.status_code != 200:
                    print('HTTP Get failed with response code %d, retry in 0.5s' % req.status_code)
                    # wait 0.5s
                    if self._abort_event.wait(0.5):
                        break
                # parse tt
                pattern = re.compile(r'pixiv\.context\.token\s*=\s*"(?P<tt>\w+)";')
                match_result = re.finditer(pattern, rep)
                try:
                    match_result = next(match_result)
                except StopIteration:
                    match_result = None
                if not match_result:
                    print('Could not get tt from html, exited')
                    self._main_thd_started.set()
                    return
                self._tt = match_result.group('tt')
                break
            print('Got tt = "%s"' % self._tt)

            # starting parallel download thread here
            for _ in range(self._num_threads):
                event_to_wait = threading.Event()
                self._abort_wait.append(event_to_wait)
                worker = threading.Thread(target=self._worker_thd_cb, args=(event_to_wait,))
                worker.start()
            self._main_thd_started.set()

            headers = {'X-Requested-With': 'XMLHttpRequest',
                       'Referer': 'https://www.pixiv.net/ranking.php?mode=daily'}
            while self._buffer_empty.wait():
                if self._abort_event.is_set():
                    break

                # fetch from cacher
                key = '%s-p%d' % (self._date.strftime('%Y%m%d'), self._page)
                result = self._cache.get(key)
                if not result:
                    with self._buffer_lock:
                        print('Fetching ranking page(json mode), date=%s, page=%d, buffer=%d/%d' %
                              (str(self._date.date()), self._page, len(self._buffer_data), self._max_buffer_size))
                    params = {'mode': 'daily', 'date': self._date.strftime('%Y%m%d'), 'p': self._page,
                              'format': 'json', 'tt': self._tt}
                    suc = False
                    req = None
                    while not suc:
                        if self._abort_event.is_set():
                            return
                        try:
                            req = requests.get('https://www.pixiv.net/ranking.php', params=params, headers=headers,
                                               proxies=proxy, timeout=15)
                            suc = True
                        except Exception as ex:
                            warn(str(ex))
                    rep = req.content.decode('utf8')
                    # terminated state
                    if req.status_code == 404:
                        break
                    # append to cacher
                    self._cache[key] = rep
                    result = rep
                else:
                    result = result.decode('utf8')

                json_data = json.loads(result)
                buffer_data = self._parse_data(json_data)

                # append to buffer
                with self._buffer_lock:
                    self._buffer_data += buffer_data
                    # check buffer size
                    if len(self._buffer_data) >= self._max_buffer_size:
                        self._buffer_empty.clear()

                # next page
                self._page += 1

                if self._page > self._max_page:
                    self._page = 1
                    self._date -= timedelta(days=1)

        finally:
            print('main thd exited')
            self._fetch_finished = True
            for item in self._abort_wait:
                item.wait()

    def _parse_data(self, data):
        ret_data = []
        if data.get('contents', None):
            contents = data['contents']
            for content in contents:
                url = content['url']
                ranking_date = self._date
                ranking_page = self._page
                illust_id = int(content['illust_id'])
                illust_page_count = int(content['illust_page_count'])
                for page in range(illust_page_count):
                    single_illust_url = self._replace_url(url, page)
                    ret_data.append({'date': ranking_date, 'page': ranking_page,
                                     'illust_id': illust_id, 'illust_page': page,
                                     'url': single_illust_url})
        return ret_data

    @staticmethod
    def _replace_url(url, page):
        url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
                                 r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+_p)\d+'
                                 r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
        match = re.match(url_pattern, url)
        if match:
            schemas = match.group('schemas')
            host = match.group('host')
            path_prefix = match.group('path_prefix')
            path_postfix = match.group('path_postfix')
            return '%s://%s%s%d%s' % (schemas, host, path_prefix, page, path_postfix)
        url_pattern = re.compile(r'(?P<schemas>https?)://(?P<host>([^./]+\.)+[^./]+)(/c/\d+x\d+)?'
                                 r'(?P<path_prefix>/img-master/img(/\d+){6}/\d+)'
                                 r'(?P<path_postfix>_(master|square)\d+\.(jpg|png)).*')
        match = re.match(url_pattern, url)
        if match:
            schemas = match.group('schemas')
            host = match.group('host')
            path_prefix = match.group('path_prefix')
            path_postfix = match.group('path_postfix')
            if page != 0:
                warn('A non-pageable image url detected, your page should be 0 constantly, but got %d' % page)
            return '%s://%s%s%s' % (schemas, host, path_prefix, path_postfix)

        raise ValueError('The url "%s" could not match any replacement rules' % url)

    def _worker_thd_cb(self, thd_wait_event):
        try:
            while not self._abort_event.is_set():
                buffer_item = None
                with self._buffer_lock:
                    if len(self._buffer_data) > 0:
                        buffer_item = self._buffer_data[0]
                        self._buffer_data = self._buffer_data[1:]
                    if len(self._buffer_data) < self._max_buffer_size:
                        self._buffer_empty.set()

                # fetch failed, wait more time
                if buffer_item is None:
                    if self._fetch_finished or self._abort_event.wait(0.1):
                        break
                    continue

                # unpacking value
                date = buffer_item['date']
                page = buffer_item['page']
                illust_id = buffer_item['illust_id']
                illust_page = buffer_item['illust_page']
                url = buffer_item['url']

                # download file here
                dst_path = os.path.join(save_path, str(illust_id % 1000), '%dp%d.jpg' % (illust_id, illust_page))
                if not global_file_cache.exist_file(dst_path):
                    print('Downloading [%s #%d] [%d p%d] %s' % (date.strftime('%Y%m%d'), page,
                                                                illust_id, illust_page, url))
                    suc = False
                    while not suc:
                        try:
                            req = requests.get(url, headers={'Referer': 'https://www.pixiv.net/member_illust.php'
                                                                        '?mode=medium&illust_id=%d' % illust_id},
                                               timeout=15)
                            if req.status_code != 200:
                                warn('Error while downloading %d p%d : HTTP %d' %
                                     (illust_id, illust_page, req.status_code))
                                break

                            image = req.content
                            with open(dst_path, 'wb') as f:
                                f.write(image)
                            global_file_cache.append_file(dst_path)

                            suc = True
                        except Exception as ex:
                            print(ex)
        finally:
            thd_wait_event.set()
            print('thd exited')

    def start(self):
        self.abort()
        self._main_thd = threading.Thread(target=self._main_thd_cb)
        self._main_thd.start()

    def abort(self):
        self._abort_event.set()
        self.wait()

    def wait(self):
        if self._main_thd:
            self._main_thd_started.wait()
        for item in self._abort_wait:
            item.wait()


class DatabaseGenerator:
    # flags for illust_content_type
    ILLUST_CONTENT_TYPE_SEXUAL = 1
    ILLUST_CONTENT_TYPE_LO = 2
    ILLUST_CONTENT_TYPE_GROTESQUE = 4
    ILLUST_CONTENT_TYPE_VIOLENT = 8
    ILLUST_CONTENT_TYPE_HOMOSEXUAL = 16
    ILLUST_CONTENT_TYPE_DRUG = 32
    ILLUST_CONTENT_TYPE_THOUGHTS = 64
    ILLUST_CONTENT_TYPE_ANTISOCIAL = 128
    ILLUST_CONTENT_TYPE_RELIGION = 256
    ILLUST_CONTENT_TYPE_ORIGINAL = 512
    ILLUST_CONTENT_TYPE_FURRY = 1024
    ILLUST_CONTENT_TYPE_BL = 2048
    ILLUST_CONTENT_TYPE_YURI = 4096

    def __init__(self, path_to_save=None, cacher_path=None, max_page=2):
        self._cacher = Cacher(cacher_path if cacher_path else cache_path)
        if not path_to_save:
            path_to_save = db_path
        with open(path_to_save, 'w'):
            pass
        self._conn = sqlite3.connect(path_to_save)
        self._cursor = self._conn.cursor()
        self._max_page = max_page

        self._initialize()
        self._user_id_set = set()
        self._tag_id_dict = dict()
        self._rank_set = set()
        self._illust_id_set = set()
        self._illust_series_id_set = set()

    def _initialize(self):
        # initialize tables
        csr = self._cursor
        csr.execute("create table user (user_id bigint primary key, user_name varchar(255) not null,"
                    "profile_img varchar(255) not null)")
        csr.execute("create table illust_series (illust_series_id integer primary key, "
                    "illust_series_user_id bigint not null, illust_series_title varchar(255) not null,"
                    "illust_series_caption text(16383), illust_series_content_count integer not null,"
                    "illust_series_create_datetime datetime not null, page_url varchar(255) not null,"
                    "foreign key (illust_series_user_id) references user(user_id))")
        csr.execute("create table illust (title varchar(255), date datetime, url varchar(255), illust_type integer,"
                    "illust_book_style integer, illust_page_count integer, illust_content_type integer not null, "
                    "illust_series_id integer, illust_id bigint primary key, width integer not null, "
                    "height integer not null, user_id bigint not null, rating_count integer not null, "
                    "view_count integer not null, illust_upload_timestamp datetime not null, attr varchar(255),"
                    "foreign key (user_id) references user, foreign key (illust_series_id) references illust_series)")
        csr.execute("create table tag (tag_id integer primary key autoincrement, name varchar(255) not null unique)")
        csr.execute("create table illust_tags (illust_id bigint not null, tag_id integer not null,"
                    "foreign key (illust_id) references illust, foreign key (tag_id) references tag)")
        csr.execute("create table illust_rank (illust_id bigint not null, date datetime not null, "
                    "rank integer not null, yes_rank integer not null, foreign key (illust_id) references illust)")
        # indices to accelerate date-based query
        csr.execute("create index illust_date on illust(date)")
        csr.execute("create index illust_rank_date on illust_rank(date, rank)")
        self._conn.commit()

    def start(self):
        cur_date = datetime.now().date() - timedelta(days=2)
        cur_page = 1

        key = '%s-p%d' % (cur_date.strftime('%Y%m%d'), cur_page)
        data = self._cacher.get(key)
        while data:
            data = data.decode("utf8")
            # print('Parsing %s' % key)
            json_data = json.loads(data)
            try:
                contents = json_data['contents']
            except KeyError:
                break

            for item in contents:
                self._parse(item, cur_date)

            # next
            cur_page += 1
            if cur_page > self._max_page:
                cur_date -= timedelta(days=1)
                cur_page = 1
            key = '%s-p%d' % (cur_date.strftime('%Y%m%d'), cur_page)
            data = self._cacher.get(key)

        self._conn.commit()

    def _parse(self, json_obj, ranking_date):
        title = json_obj['title']
        date = json_obj['date']
        tags = json_obj['tags']
        url = json_obj['url']
        illust_type = json_obj['illust_type']
        illust_book_style = json_obj['illust_book_style']
        illust_page_count = json_obj['illust_page_count']
        user_name = json_obj['user_name']
        profile_img = json_obj['profile_img']
        illust_content_type = json_obj['illust_content_type']
        illust_series = json_obj['illust_series']
        illust_id = json_obj['illust_id']
        width = json_obj['width']
        height = json_obj['height']
        user_id = json_obj['user_id']
        rank = json_obj['rank']
        # hint: yes_rank is not YES! rank!, it's just the rank of yesterday, don't be treated XD
        yes_rank = json_obj['yes_rank']
        rating_count = json_obj['rating_count']
        view_count = json_obj['view_count']
        illust_upload_timestamp = json_obj['illust_upload_timestamp']
        attr = json_obj['attr']
        # converting illust_content_type
        flag_illust_content_type = 0
        if illust_content_type['sexual'] != 0:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_SEXUAL
        if illust_content_type['lo']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_LO
        if illust_content_type['grotesque']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_GROTESQUE
        if illust_content_type['violent']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_VIOLENT
        if illust_content_type['homosexual']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_HOMOSEXUAL
        if illust_content_type['drug']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_DRUG
        if illust_content_type['thoughts']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_THOUGHTS
        if illust_content_type['antisocial']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_ANTISOCIAL
        if illust_content_type['religion']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_RELIGION
        if illust_content_type['original']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_ORIGINAL
        if illust_content_type['furry']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_FURRY
        if illust_content_type['bl']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_BL
        if illust_content_type['yuri']:
            flag_illust_content_type |= self.ILLUST_CONTENT_TYPE_YURI
        # querying user data
        csr = self._cursor
        if not self._user_id_set.issuperset({user_id}):
            csr.execute("insert into user(user_id, user_name, profile_img) values (?, ?, ?)",
                        (user_id, user_name, profile_img))
            self._user_id_set.add(user_id)
        # handling illust_series
        if type(illust_series) != bool:
            illust_series_id = illust_series['illust_series_id']
            illust_series_user_id = illust_series['illust_series_user_id']
            illust_series_title = illust_series['illust_series_title']
            illust_series_caption = illust_series['illust_series_caption']
            illust_series_content_count = illust_series['illust_series_content_count']
            illust_series_create_datetime = illust_series['illust_series_create_datetime']
            page_url = illust_series['page_url']
            if not self._illust_series_id_set.issuperset({illust_series_id}):
                csr.execute("insert into illust_series(illust_series_id, illust_series_user_id, "
                            "illust_series_title, illust_series_caption, illust_series_content_count, "
                            "illust_series_create_datetime, page_url) values (?, ?, ?, ?, ?, ?, ?)",
                            (illust_series_id, illust_series_user_id, illust_series_title, illust_series_caption,
                             illust_series_content_count, illust_series_create_datetime, page_url))
                self._illust_series_id_set.add(illust_series_id)
            illust_series = illust_series_id
        else:
            illust_series = None
        # tags
        for tag in tags:
            if self._tag_id_dict.get(tag, None):
                tag_id = self._tag_id_dict[tag]
            else:
                csr.execute("insert into tag(name) values (?)", (tag,))
                tag_id = len(self._tag_id_dict) + 1
                self._tag_id_dict[tag] = tag_id
            csr.execute("insert into illust_tags(illust_id, tag_id) values (?, ?)", (illust_id, tag_id))
        # converting date
        reg_ptn = re.compile('(\\d+)年(\\d+)月(\\d+)日\\s(\\d+):(\\d+)')
        match = re.match(reg_ptn, date)
        if match:
            date_year, date_month, date_day, date_hour, date_minute = (int(match.group(x)) for x in range(1, 6))
            date = datetime(date_year, date_month, date_day, date_hour, date_minute)
        illust_upload_timestamp = datetime.fromtimestamp(illust_upload_timestamp)
        if not self._illust_id_set.issuperset({illust_id}):
            csr.execute("insert into illust(title, date, url, illust_type, illust_book_style, illust_page_count, "
                        "illust_content_type, illust_series_id, illust_id, width, height, user_id, rating_count, "
                        "view_count, illust_upload_timestamp, attr) "
                        "values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
                        (title, date, url, illust_type, illust_book_style, illust_page_count, flag_illust_content_type,
                         illust_series, illust_id, width, height, user_id, rating_count, view_count,
                         illust_upload_timestamp, attr))
            self._illust_id_set.add(illust_id)

        if not self._rank_set.issuperset((illust_id, ranking_date, rank, yes_rank)):
            csr.execute("insert into illust_rank(illust_id, date, rank, yes_rank) values (?, ?, ?, ?)",
                        (illust_id, ranking_date, rank, yes_rank))
            self._rank_set.add((illust_id, ranking_date, rank, yes_rank))


if __name__ == '__main__':
    global_file_cache.load(os.path.join(cache_path, 'index'))
    print('Crawler starting')
    a = Crawler()
    a.start()
    a.wait()
    global_file_cache.save(os.path.join(cache_path, 'index'))
    print('Database generator starting')
    a = DatabaseGenerator()
    a.start()
    global_file_cache.save(os.path.join(cache_path, 'index'))

后日谈#

给这个爬虫爬到的图片标了下数据，自己跑了个faster-rcnn的动漫脸识别，结合动漫脸识别里面提到的基于opencv的识别方法，可以把opencv中CascadeClassifier那惨不忍睹的识别结果过滤到几乎为100%正确的结果，这些几乎没有错误的结果可以用来实验一下各种的GAN。

CascadeClassifier的优点就是检测出来的脸型比较单一，当然这也是它的缺点所在，就因为这一点，过滤掉了一大半的结果（心痛）。

后续会放出处理faster-rcnn和CascadeClassifier生成的结果，并且对两个结果求IoU比例、根据IoU进行边框匹配并裁剪缩放的过程及代码。

2019-01-19

game

碧蓝航线的Live2D提取与播放

碧（窑）蓝（子）航线的Live2D提取与播放#

这篇博文纯属是闲得蛋疼的产物（~~博文中的泥石流~~），借助了prefare大佬的文章，跟着大佬的教程自己走了一遍，效果拔群。

0x00 需要的工具#

提取Unity资源

Unity Asset Bundle Extractor (UABE)
prefare大佬的AzurLaneLive2DExtract（可以从大佬的文章中得到）

Live2D播放

Cubism官网自带的Live2D Cubism Viewer
或者是能够自己改代码的CubismNativeFramework

我选择了后者，毕竟能动些小手脚

0x01 提取Unity资源#

首先得从比例比例官网下个碧蓝的apk，装到手机/模拟器上，然后随便你游客登陆也好，用自己b站账号登陆也好，进到游戏主界面点击右上角的设置->资源->Live2D资源更新，下载全部的Live2D资源，下载完就可以退出游戏了。

随便掏出个文件浏览器，在SD卡的目录下找到Android/data/com.bilibili.azurlane/files/AssetBundles/live2d，把里面的文件全部copy出来，保存到电脑上。如下图所示。

运行UABE的AssetBundleExtractor，在File->Open依次打开上面的文件（如第一个就是aidang_2），然后它会提问是否要解压，选是，然后随便输入个文件名保存就ok了，覆盖掉原文件也是没问题的。如果嫌弃累的话，后面有个能够一键操作 ~~（站上去自己动）~~ 的python脚本，直接运行就好了。

然后把解压的文件拖动到大佬写的exe上就好了。（好一会儿脑子秀逗了还以为要点开exe才拖，结果看了大佬的代码之后才明白只要拖到文件资源管理器上就ok了）

在上面一步执行完后，生成了一个live2d文件夹，里面每个文件夹对应了每个live2d模型，如点开aidang_2，里面会有aidang_2.moc3、aidang_2.model3.json和aidang_2.physics3.json，以及motions和textures两个文件夹。

下面一步需要的就是这些文件了。

0x02 查看Live2D#

最简单直接的方法，下载Cubism官网的Live2D Cubism Viewer并打开，把moc3文件或者model3.json文件拖到窗口上就能直接看了。在左边双击motions下的文件就可以播放对应的动画。

这样就ok了。

动起来就是这样子的了：

如果有个大胆的想法（比如把右边的动画窗口嵌入到自己的一个程序界面上的话），就需要自己动动手了。

0x03 魔改Native demo#

目标：实现一个不依赖Unity并且能够自主控制的播放窗口。 ~~其实就是想把它当成自己写的背景小程序中的一个插件而已啦~~

第一步，下载CubismCore：在https://live2d.github.io/#native上点击下载Download Cubism 3 SDK for Native betag

第二步，下载GLEW：在http://glew.sourceforge.net下载Binaries Windows 32-bit and 64-bit

第三步，下载GLFW：由于一些功能需要最新版(3.3.0)，但是预编译的只有3.2.1版本，所以要自己动手build，丰衣足食。

GitHub 从复制到粘贴：运行git clone https://github.com/glfw/glfw
教科书般的cmake编译，我习惯将build文件夹设成.../glfw/build，把CMAKE_INSTALL_PREFIX设成.../glfw/build/install，打开VS，生成INSTALL，完事。

第四步，下载CubismNativeSamples：

git clone --recursive https://github.com/Live2D/CubismNativeSamples
~~对cubism native framework进行教科书般cmake~~
~~↑出事了，因为cmake脚本下的include_directories("${FRAMEWORK_GLFW_PATH}")和include_directories("${FRAMEWORK_GLEW_PATH}")这两行找不到~~

第五步，Build自己的项目，让cmake玩蛋去吧

看了看代码，感觉也就这样吧，打开vs，新建一个空的c++项目，取名就叫CubismBuild吧，自己编译去
把Framework下的src文件夹复制到这个项目（有.vcxproj文件）的文件夹下，右键CubismBuild项目，添加->添加现有项，除了Rendering下的源文件之外，把src下面的所有源文件都加进来，Rendering只要加OpenGL下的源文件和CubismRenderer两个文件就好了。
在CubismRenderer_OpenGLES2.hpp第一行加上#define CSM_TARGET_WIN_GL，手动指定相应的cmake宏
解压CubismCore和GLEW
把...\CubismNativeSamples\Samples\OpenGL\Demo\proj.win.cmake\Demo下的所有文件也都一同复制到项目文件夹下，在项目中添加这些文件
把...\CubismNativeSamples\Samples\OpenGL\thirdParty\stb\include\stb_image.h这个也复制并添加到项目中
改一下编译参数，右键项目，点击属性，转到VC++目录，把上面复制过来的src、解压的Core、GLEW和编译过的GLFW的路径都添加一下
包含路径

引用路径和库路径
然后在左边的链接器->输入->附加依赖项，加上Live2DCubismCore_MDd.lib，opengl32.lib，glu32.lib，glew32.lib和glfw3.lib。
直接编译就ok了，运行的时候要把glew的dll复制过去，否则会提示dll缺失。
改LAppDefine.cpp下的ResourcesPath，改成CubismNativeSamples\Samples\Res的绝对路径就大功告成了（路径用/分割，不要用\，最后一个/要保留）

It’s time to 魔改。

有了源代码，全程debug一遍基本上就知道哪些代码在干哪些活了。

那个power和齿轮按钮没用，去掉。
改下LAppView就好了

窗体背景透明，隐藏标题。
改下LAppDelegate就好了，这里就是为什么要选择3.3.0的GLFW了，因为3.2.1没有更改窗口背景的API。

要得到窗体的handle
在LAppDelegate.cpp加上

1 2	#define GLFW_EXPOSE_NATIVE_WIN32 #include <GLFW/glfw3native.h>

调用glfwGetWin32Window(GLFWwindow *)就会返回一个hwnd，调用WinAPI把父窗体设置成自己的窗体就可以为所欲为了。

设置窗体大小
在LAppDefine.cpp改就好了

更改模型的控制
在LAppModel上改就好了

在main上面加一个命令行解析，大概也就这样吧。（第二行输出的就是GLFW窗体的hwnd，可以用于后续的窗体嵌入）

0x04 流水作业，解放双手#

很麻烦对吧，要下很多东西对吧，其实要用到的只有UABE，大佬写的AzurLaneLive2DExtract而已，用一个python脚本执行足够了。

下载脚本（整合了UABE、AzurLaneLive2DExtract和魔改的native viewer）

你要做的：

自己动手把Unity资源从手机/模拟器复制到电脑上
打开process.py，把上面的路径改成上面的文件夹路径就ok了

运行： python process.py
自动解压Unity资源、提取Live2D

查看Live2D：

使用Cubism 3自带的Viewer
或在命令行敲.\player\CubismBuild.exe -d 模型所在的文件夹就好了
更多的参数（其实也就改了几个，可以通过敲.\player\CubismBuild.exe查看）
（可能需要VC++ 2017的运行环境）

0xff 引用#

Prefare大佬的文章和代码AzurLaneLive2DExtract
Unity Asset Bundle Extractor
Cubism 3自带的viewer：Live2D Cubism Viewer

魔改的代码连自己都看不下去，就不开源了（遮脸）

最后：适度游戏益脑，沉迷游戏伤身（物理）

2019-01-04

github

在Github上使用verified sign进行提交

雪饼的狗窝

黑贞天下第一