Bert Inference Benchmark

Date: 2019/10/18 Categories: 工作 Tags: BERT DeepLearningInference



一些粗略的结果

seq_len = 64
batch: 1 2 4, latency 2ms
2 instances 会降低吞吐量 44.2s
1 instance 40.8s

batch size 4/8
1 instance 29.9s

batch size 32/64
20ms latency: 2 instance 23.8s

batch size 32/64
10ms latency: 1 instancce 24.4s

seq_len=64 m40极限qps大约420

seq_len=40 m40 = 16.4s大约609 qps
client_batch = 64 提升到653
max_batch_size=256提升到688
max_batch_size=2048 降低了性能
max_batch_size=128 with client_batch_size=32 ->  680

以上使用1 M40 GPU和tensorrt BERT实验, 实验使用python3+pycurl多线程

下面用P40做实验, batch_size均为1,用了1到6哥instance在一张p40卡上[instance_num]time(qps)

[1]38.4(260)->[2]30.6(326) -> [4]25.4s(393) -> [6]25.7s(389)

Plan

计划对比TensorRT inference server和tfserving的性能, 在较小batch size时主要考虑延迟, 在较大batch size时主要考量吞吐量。

方法

TensoRT转换模型需要指定batch size, 为了方便我们选择在转换时为每个候选batch size生成一个profile, batch size取值为[1,2,4,8,16,32,64,128,256,512], trtis在启动时同时加载这些profile生成不同的模型, 之后使用wrk压测这些endpoint

tfserving模型使用类似方法, 但可以省略转换模型和部署不同batch size模型的部分

P40

TensorRT Inference Server

Maximum Throughput

# 2 instances, `server_batch_size=256`

!wrk -t2 -c512 --latency -s script.lua -d60s http://127.0.0.1:8000/api/infer/model

Running 1m test @ http://127.0.0.1:8000/api/infer/model
  2 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   604.11ms   68.59ms   1.04s    71.18%
    Req/Sec   397.83    220.43     2.21k    83.29%
  Latency Distribution
     50%  574.35ms
     75%  684.85ms
     90%  688.70ms
     99%  804.29ms
  44192 requests in 1.00m, 17.70MB read
Requests/sec:    735.80
Transfer/sec:    301.79KB

Online (Batch = 1)

Running 1m test @ http://127.0.0.1:8000/api/infer/model
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.98ms   92.04us   6.90ms   96.53%
    Req/Sec   251.74      3.23   264.00     85.28%
  Latency Distribution
     50%    3.97ms
     75%    4.00ms
     90%    4.04ms
     99%    4.17ms
  15076 requests in 1.00m, 6.03MB read
Requests/sec:    250.98
Transfer/sec:    102.76KB

Online Light Batching

server_batch_size = 8 instances = 2 server_latency = 4000us connections = 16

Running 1m test @ http://127.0.0.1:8000/api/infer/model
  2 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.25ms  157.41us  15.97ms   85.05%
    Req/Sec   281.67      5.19   323.00     91.92%
  Latency Distribution
     50%   14.23ms
     75%   14.30ms
     90%   14.38ms
     99%   14.92ms
  33728 requests in 1.00m, 13.50MB read
Requests/sec:    561.24
Transfer/sec:    230.02KB

Tensorflow Serving

No Batching

!wrk -t2 -c4 --latency -s .script.lua -d60s http://127.0.0.1:8501/v1/models/bert:predict
Running 1m test @ http://127.0.0.1:8501/v1/models/bert:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    28.75ms    2.31ms  38.84ms   74.52%
    Req/Sec    69.55      7.17    90.00     47.24%
  Latency Distribution
     50%   28.52ms
     75%   30.14ms
     90%   31.67ms
     99%   34.35ms
  8351 requests in 1.00m, 1.62MB read
Requests/sec:    139.05
Transfer/sec:     27.57KB

Maximum Throughput

max_batch_size { value: 256 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 512 }
num_batch_threads { value: 24 }
!wrk -t4 -c1024 --latency --timeout 10s -s .script.lua -d60s
http://127.0.0.1:8501/v1/models/bert:predict
Running 1m test @ http://127.0.0.1:8501/v1/models/bert:predict
  4 threads and 1024 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.28s   232.69ms   2.92s    93.23%
    Req/Sec   133.66    166.76     1.61k    95.34%
  Latency Distribution
     50%    2.33s
     75%    2.33s
     90%    2.33s
     99%    2.36s
  26061 requests in 1.00m, 5.02MB read
Requests/sec:    433.91
Transfer/sec:     85.60KB