实验1¶

实验时间: 2019/4/29
实验目的: 统计目前tf-idf方法未召回的占比估计引入深度模型的提升(的上界)
实验数据: frankzhong给出的2018/04/28的全量原始querylog共5287122 条query,未去重
实验参数: 使用python和geventhttpclient, 启动协程池,大小为48, 48个http连接
实验负载 client使用单核, 负载~25%, 内存占用<50M, 230服务器满负载,有超时间情况
代码版本 pyqa:commit 234fa1f16c2e1af47e5f2c0fbe97844fa29f5b58 |

from geventhttpclient import HTTPClient
import ujson
from itertools import islice
import gzip
import requests, io
from itertools import islice
from tqdm import tqdm_notebook as tqdm
import gevent.pool
import time
import numpy as np
import matplotlib.pyplot as plot

CONCCURRENCY = 48

url = 'http://10.229.146.230:1234'
http = HTTPClient.from_url(url, concurrency=CONCCURRENCY)

def classify(query):
    r = http.post('/api?truncate=0.7', body=ujson.dumps({'query': query}))
    obj = ujson.load(r)
    if any(i['score'] >= 0.9 for i in obj):
        return 2
    if any(i['score'] >= 0.7 for i in obj):
        return 1
    else:
        return 0

latency = []

def cb(query):
    start = time.time()
    x = classify(query)
    end = time.time()
    latency.append((end-start)*1000) #ms
    results[x] += 1

x=!wc -l /data/querylog_20190428.txt
size = int(x[0].split()[0])

print 'size:', size

size: 5287122

pool = gevent.pool.Pool(CONCCURRENCY)

results = {2:0, 1:0, 0:0}
src = open('/data/querylog_20190428.txt')
for line in tqdm(src, total=size):
    query = line.strip()
    pool.wait_available()
    pool.spawn(cb, query)

total = sum(results.values())
print 'hit: ', results[2]/float(total)
print 'subhit:', results[1]/float(total)

hit:  0.0123820850626
subhit: 0.112151068814

latency = np.array(sorted(latency))

np.percentile(latency, [1, 5, 25, 50, 75, 95, 99])

array([ 28.4436202 ,  39.90721703,  71.16413116, 100.26097298,
       138.61584663, 209.37634706, 276.471138  ])

实验1结果¶

命中(相似度大于90的) 占比1.24%
接近命中(相似度大于70小于90) 占比11.22%
延迟情况见上表,单位为毫秒, 可见95percentile的延时已经大于200ms的time budget了

结论: 可能提升的召回的上界为 11.22/1.24 = 900%