这个notebook演示了如何用openvino中的inference模块来利用生成的ir做inference
目前遇到的限制有
但根据openvino网页上的说法, 因为cpu是latency oriented device, 所以强行增大batch size获得更大的throughput或降低一般情况下的latency表现, 所以openvino使用了内部将机器的resource分为多个stream的办法来并行执行请求, 增大throughput同时batch size保持为1 [链接]
One way to increase computational efficiency is batching, which combines many (potentially tens) of input images to achieve optimal throughput. However, high batch size also comes with a latency penalty. So, for more real-time oriented usages, lower batch sizes (as low as a single input) are used
下面的例子演示了如何用bert模型和openvino做inference, 这里使用的Python API
from __future__ import print_function
from openvino.inference_engine import IENetwork, IECore
net = IENetwork(model='./models/bert/1/frozen_model.xml',
weights='./models/bert/1/frozen_model.bin')
ie = IECore()
extension_path = '/opt/intel/openvino_2019.2.242/deployment_tools/inference_engine/lib/intel64/libcpu_extension_avx2.so'
ie.add_extension(extension_path, 'CPU')
def check_model(ie, net):
supported_layers = ie.query_network(net, 'CPU')
not_supported_layers = [l for l in net.layers.keys() if l not in supported_layers]
assert len(not_supported_layers) == 0, 'cannot run because of unsupported layers'
print("layers are all supported")
check_model(ie, net)
net.outputs
这里load_network时提供了config字典, 具体的设置请参考CPU Plugin
Unlike most accelerators, CPU is perceived as an inherently latency-oriented device. Since 2018 R5 release, the Inference Engine introduced the "throughput" mode, which allows the Inference Engine to efficiently run multiple inference requests on the CPU simultaneously, greatly improving the throughput.
Internally, the execution resources are split/pinned into execution "streams". Using this feature gains much better performance for the networks that originally are not scaled well with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines.
print('load network to the plugin')
config = {
# 一般服务器XEON有两个NUMANode, 这样会把整个机器的资源分为两个stream, 同时可以执行两个请求
# 实验中, 单个请求的延迟会略微升高
'CPU_THROUGHPUT_STREAMS': 'CPU_THROUGHPUT_NUMA',
# 当设置为CPU_THROUGHPUT_AUTO时可获得最大吞吐量和最差的延迟
}
req_num = 2 # 请求数同样设为2
exec_net = ie.load_network(net, 'CPU', config, 2)
import numpy as np
a=np.ones([1, 48])
net.inputs
net.outputs
%%time
out = exec_net.infer(inputs={
'input_ids': a,
'input_mask': a,
'segment_ids': a
})
out.keys()