Bert Inference Benchmark
Date: 2019/10/18 Categories: 工作 Tags: BERT DeepLearningInference
一些粗略的结果
seq_len = 64
batch: 1 2 4, latency 2ms
2 instances 会降低吞吐量 44.2s
1 instance 40.8s
batch size 4/8
1 instance 29.9s
batch size 32/64
20ms latency: 2 instance 23.8s
batch size 32/64
10ms latency: 1 instancce 24.4s
seq_len=64 m40极限qps大约420
seq_len=40 m40 = 16.4s大约609 qps
client_batch = 64 提升到653
max_batch_size=256提升到688
max_batch_size=2048 降低了性能
max_batch_size=128 with client_batch_size=32 -> 680
以上使用1 M40 GPU和tensorrt BERT实验, 实验使用python3+pycurl多线程
下面用P40做实验, batch_size
均为1,用了1到6哥instance在一张p40卡上[instance_num]time(qps)
[1]38.4(260)->[2]30.6(326) -> [4]25.4s(393) -> [6]25.7s(389)
Plan
计划对比TensorRT inference server和tfserving的性能, 在较小batch size时主要考虑延迟, 在较大batch size时主要考量吞吐量。
方法
TensoRT转换模型需要指定batch size, 为了方便我们选择在转换时为每个候选batch
size生成一个profile, batch size取值为[1,2,4,8,16,32,64,128,256,512]
,
trtis在启动时同时加载这些profile生成不同的模型, 之后使用wrk压测这些endpoint
tfserving模型使用类似方法, 但可以省略转换模型和部署不同batch size模型的部分
P40
TensorRT Inference Server
Maximum Throughput
# 2 instances, `server_batch_size=256`
!wrk -t2 -c512 --latency -s script.lua -d60s http://127.0.0.1:8000/api/infer/model
Running 1m test @ http://127.0.0.1:8000/api/infer/model
2 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 604.11ms 68.59ms 1.04s 71.18%
Req/Sec 397.83 220.43 2.21k 83.29%
Latency Distribution
50% 574.35ms
75% 684.85ms
90% 688.70ms
99% 804.29ms
44192 requests in 1.00m, 17.70MB read
Requests/sec: 735.80
Transfer/sec: 301.79KB
Online (Batch = 1)
Running 1m test @ http://127.0.0.1:8000/api/infer/model
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.98ms 92.04us 6.90ms 96.53%
Req/Sec 251.74 3.23 264.00 85.28%
Latency Distribution
50% 3.97ms
75% 4.00ms
90% 4.04ms
99% 4.17ms
15076 requests in 1.00m, 6.03MB read
Requests/sec: 250.98
Transfer/sec: 102.76KB
Online Light Batching
server_batch_size = 8 instances = 2 server_latency = 4000us connections = 16
Running 1m test @ http://127.0.0.1:8000/api/infer/model
2 threads and 8 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 14.25ms 157.41us 15.97ms 85.05%
Req/Sec 281.67 5.19 323.00 91.92%
Latency Distribution
50% 14.23ms
75% 14.30ms
90% 14.38ms
99% 14.92ms
33728 requests in 1.00m, 13.50MB read
Requests/sec: 561.24
Transfer/sec: 230.02KB
Tensorflow Serving
No Batching
!wrk -t2 -c4 --latency -s .script.lua -d60s http://127.0.0.1:8501/v1/models/bert:predict
Running 1m test @ http://127.0.0.1:8501/v1/models/bert:predict
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 28.75ms 2.31ms 38.84ms 74.52%
Req/Sec 69.55 7.17 90.00 47.24%
Latency Distribution
50% 28.52ms
75% 30.14ms
90% 31.67ms
99% 34.35ms
8351 requests in 1.00m, 1.62MB read
Requests/sec: 139.05
Transfer/sec: 27.57KB
Maximum Throughput
max_batch_size { value: 256 }
batch_timeout_micros { value: 100000 }
max_enqueued_batches { value: 512 }
num_batch_threads { value: 24 }
!wrk -t4 -c1024 --latency --timeout 10s -s .script.lua -d60s
http://127.0.0.1:8501/v1/models/bert:predict
Running 1m test @ http://127.0.0.1:8501/v1/models/bert:predict
4 threads and 1024 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.28s 232.69ms 2.92s 93.23%
Req/Sec 133.66 166.76 1.61k 95.34%
Latency Distribution
50% 2.33s
75% 2.33s
90% 2.33s
99% 2.36s
26061 requests in 1.00m, 5.02MB read
Requests/sec: 433.91
Transfer/sec: 85.60KB