README¶

这个notebook演示了如何用openvino中的inference模块来利用生成的ir做inference

目前遇到的限制有

使用bert时无法修改batch_size
输入输出的名字没有意义 (这个可以通过mapping解决)

但根据openvino网页上的说法, 因为cpu是latency oriented device, 所以强行增大batch size获得更大的throughput或降低一般情况下的latency表现, 所以openvino使用了内部将机器的resource分为多个stream的办法来并行执行请求, 增大throughput同时batch size保持为1 [链接]

One way to increase computational efficiency is batching, which combines many (potentially tens) of input images to achieve optimal throughput. However, high batch size also comes with a latency penalty. So, for more real-time oriented usages, lower batch sizes (as low as a single input) are used

下面的例子演示了如何用bert模型和openvino做inference, 这里使用的Python API

from __future__ import print_function
from openvino.inference_engine import IENetwork, IECore

net = IENetwork(model='./models/bert/1/frozen_model.xml',
                weights='./models/bert/1/frozen_model.bin')

ie = IECore()
extension_path = '/opt/intel/openvino_2019.2.242/deployment_tools/inference_engine/lib/intel64/libcpu_extension_avx2.so'
ie.add_extension(extension_path, 'CPU')

def check_model(ie, net):
    supported_layers = ie.query_network(net, 'CPU')
    not_supported_layers = [l for l in net.layers.keys() if l not in supported_layers]
    assert len(not_supported_layers) == 0, 'cannot run because of unsupported layers'
    print("layers are all supported")
check_model(ie, net)

layers are all supported

net.outputs

{'Softmax': <openvino.inference_engine.ie_api.OutputInfo at 0x7f0bc690cf30>}

这里load_network时提供了config字典, 具体的设置请参考CPU Plugin

Throughput Mode for CPU中提到

Unlike most accelerators, CPU is perceived as an inherently latency-oriented device. Since 2018 R5 release, the Inference Engine introduced the "throughput" mode, which allows the Inference Engine to efficiently run multiple inference requests on the CPU simultaneously, greatly improving the throughput.

Internally, the execution resources are split/pinned into execution "streams". Using this feature gains much better performance for the networks that originally are not scaled well with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines.

print('load network to the plugin')
config = {
    # 一般服务器XEON有两个NUMANode, 这样会把整个机器的资源分为两个stream, 同时可以执行两个请求
    # 实验中, 单个请求的延迟会略微升高
    'CPU_THROUGHPUT_STREAMS': 'CPU_THROUGHPUT_NUMA', 
    # 当设置为CPU_THROUGHPUT_AUTO时可获得最大吞吐量和最差的延迟
}
req_num = 2 # 请求数同样设为2
exec_net = ie.load_network(net, 'CPU', config, 2)

load network to the plugin

import numpy as np

a=np.ones([1, 48])

net.inputs

{'input_ids': <openvino.inference_engine.ie_api.InputInfo at 0x7f0bacd76170>,
 'input_mask': <openvino.inference_engine.ie_api.InputInfo at 0x7f0bacd76210>,
 'segment_ids': <openvino.inference_engine.ie_api.InputInfo at 0x7f0bacd76670>}

net.outputs

{'Softmax': <openvino.inference_engine.ie_api.OutputInfo at 0x7f0bac539f30>}

%%time
out = exec_net.infer(inputs={
    'input_ids': a,
    'input_mask': a,
    'segment_ids': a
})

CPU times: user 476 ms, sys: 8 ms, total: 484 ms
Wall time: 40.7 ms

out.keys()

dict_keys(['Softmax'])