Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

接口句子重复概率过大 #91

Closed
XuLin404 opened this issue Jul 15, 2020 · 5 comments
Closed

接口句子重复概率过大 #91

XuLin404 opened this issue Jul 15, 2020 · 5 comments
Assignees

Comments

@XuLin404
Copy link

使用的接口的时候获取到的句子有很大概率是重复的,未指定分类。一句话在一个小时之内每60s刷新一次的情况下会重复三四次

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.57. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the bug Something isn't working label Jul 15, 2020
@greenhat616 greenhat616 added this to the v1.5.x milestone Jul 16, 2020
@freejishu
Copy link
Member

freejishu commented Jul 16, 2020

这块可以先解释一下逻辑
如果没有提交特定的type,按照下列规则选择句子:
1、取随机数1到N(N为type数量),取随机数对应type为本次输出的type;
2、取随机数1到x(x为该type下对应句子的数量),取随机数对应的句子。
现在,使用20线程循环访问API 8万次左右,得到了这样一个结果:
image
标准差为106.8,在可接受的范围之内。
按照前面的算法,我们其实可以算出某一个特定句子被获取到的概率为1/N·x,这样可以给句子数量较少的冷门type提高曝光的机会,但是遇到极端情况(如多次随机到同一个type)会令人感到重复较多。
@greenhat616 建议令N=每个分类赋予单独的权值,从而更好地分配资源。

@greenhat616 greenhat616 added need_discussion and removed bug Something isn't working need_confirmation labels Jul 16, 2020
@greenhat616
Copy link
Member

greenhat616 commented Jul 16, 2020

我觉得需要更深入得探讨这个问题,因此这个问题将保留一段时间。

@greenhat616 greenhat616 modified the milestones: v1.5.x, 1.6.x Aug 17, 2020
@greenhat616 greenhat616 pinned this issue Dec 15, 2020
@greenhat616 greenhat616 removed this from the 1.6.x milestone May 21, 2021
@zhang33ya
Copy link

取随机数1到x(x为该type下对应句子的数量),取随机数对应的句子。-----
1、当用户随机请求时,type 可以随机函数确定,但x可直接按数据库中存储顺序(假设按uuid排序或其他排序)取,可有效减低重复。(对用户来说,已经按uuid排序的句子也是随机的)
2、根据请求ip建立缓存列表,1000次请求绝不重复。超过1000次重新建立已经请求过的句子列表。

@greenhat616 greenhat616 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 11, 2023
@greenhat616
Copy link
Member

greenhat616 commented Apr 12, 2023

取随机数1到x(x为该type下对应句子的数量),取随机数对应的句子。----- 1、当用户随机请求时,type 可以随机函数确定,但x可直接按数据库中存储顺序(假设按uuid排序或其他排序)取,可有效减低重复。(对用户来说,已经按uuid排序的句子也是随机的) 2、根据请求ip建立缓存列表,1000次请求绝不重复。超过1000次重新建立已经请求过的句子列表。

@ZhangSanSanLa 你方便的话,也许可以做一个 PR 试试看?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants