实验1

  • 实验时间: 2019/4/29
  • 实验目的: 统计目前tf-idf方法未召回的占比估计引入深度模型的提升(的上界)
  • 实验数据: frankzhong给出的2018/04/28的全量原始querylog共5287122 条query,未去重
  • 实验参数: 使用python和geventhttpclient, 启动协程池,大小为48, 48个http连接
  • 实验负载 client使用单核, 负载~25%, 内存占用<50M, 230服务器满负载,有超时间情况
  • 代码版本 pyqa:commit 234fa1f16c2e1af47e5f2c0fbe97844fa29f5b58 |
In [16]:
from geventhttpclient import HTTPClient
import ujson
from itertools import islice
import gzip
import requests, io
from itertools import islice
from tqdm import tqdm_notebook as tqdm
import gevent.pool
import time
import numpy as np
import matplotlib.pyplot as plot
In [2]:
CONCCURRENCY = 48
In [3]:
url = 'http://10.229.146.230:1234'
http = HTTPClient.from_url(url, concurrency=CONCCURRENCY)
In [4]:
def classify(query):
    r = http.post('/api?truncate=0.7', body=ujson.dumps({'query': query}))
    obj = ujson.load(r)
    if any(i['score'] >= 0.9 for i in obj):
        return 2
    if any(i['score'] >= 0.7 for i in obj):
        return 1
    else:
        return 0
In [5]:
latency = []
In [6]:
def cb(query):
    start = time.time()
    x = classify(query)
    end = time.time()
    latency.append((end-start)*1000) #ms
    results[x] += 1
In [7]:
x=!wc -l /data/querylog_20190428.txt
size = int(x[0].split()[0])
In [8]:
print 'size:', size
size: 5287122
In [9]:
pool = gevent.pool.Pool(CONCCURRENCY)
In [10]:
results = {2:0, 1:0, 0:0}
src = open('/data/querylog_20190428.txt')
for line in tqdm(src, total=size):
    query = line.strip()
    pool.wait_available()
    pool.spawn(cb, query)

In [14]:
total = sum(results.values())
print 'hit: ', results[2]/float(total)
print 'subhit:', results[1]/float(total)
hit:  0.0123820850626
subhit: 0.112151068814
In [33]:
latency = np.array(sorted(latency))
In [47]:
np.percentile(latency, [1, 5, 25, 50, 75, 95, 99])
Out[47]:
array([ 28.4436202 ,  39.90721703,  71.16413116, 100.26097298,
       138.61584663, 209.37634706, 276.471138  ])

实验1结果

  • 命中(相似度大于90的) 占比1.24%
  • 接近命中(相似度大于70小于90) 占比11.22%
  • 延迟情况见上表,单位为毫秒, 可见95percentile的延时已经大于200ms的time budget了

结论: 可能提升的召回的上界为 11.22/1.24 = 900%