RAG理论-第二篇-评估2·
0.前言·
本文为RAG评估篇的第2讲,补充一些RAG方法常用的评估方式
1.补充评估方式·
语言模型评估·
- Perplexity(困惑度)
- 代表方法:ICRALM[2]
- $perplexity=(\prod_{i=1}^n\frac {1} {p(w_i|w_{1},…,w_{i-1})})^{\frac 1 n}$
- 取对数其实就得到了语言模型的交叉熵损失函数:$\mathcal L=-\frac 1 n\sum_{i=1}^n\log p(w_i|w_{1},…,w_{i-1})$
- 一般是使用wikitext[4]训练集作为检索文档集,使用wikitext测试集验证困惑度
- 在GPT2,GPT3,ChatGPT,LLaMA等模型上测试
- BPB(bits per UTF-8 encoded byte)
- 代表方法:REPLUG[1]
- $BPB=(L_T/L_B)\log_2(e^\mathcal L)=(L_T/L_B)\mathcal L/\ln 2$;$L_T$代表token长度,$L_B$代表字节长度;
- BPB是Pile[3]建议的评估指标,因此一般也在Pile的测试集上测试
- 在GPT2,GPT3,ChatGPT,LLaMA等模型上测试
下游任务评估·
- 专业知识问答
- 代表方法:REPLUG[1]
- 用选择题的准确率来评估
- 数据集:MMLU[5]
- 开放领域问答
- 代表方法:REPLUG[1]
- 用问答的准确率来评估
- 数据集:Natural Questions (NQ) [6],TriviaQA[7]
2.引用·
[1]Shi W, Min S, Yasunaga M, et al. Replug: Retrieval-augmented black-box language models[J]. arXiv preprint arXiv:2301.12652, 2023.
[2]Ram O, Levine Y, Dalmedigos I, et al. In-context retrieval-augmented language models[J]. arXiv preprint arXiv:2302.00083, 2023.
[3]Gao L, Biderman S, Black S, et al. The pile: An 800gb dataset of diverse text for language modeling[J]. arXiv preprint arXiv:2101.00027, 2020.
[4]Karpukhin V, Oguz B, Min S, et al. Dense Passage Retrieval for Open-Domain Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 6769-6781.
[5]Hendrycks D, Burns C, Basart S, et al. Measuring Massive Multitask Language Understanding[C]//International Conference on Learning Representations. 2020.
[6]Kwiatkowski T, Palomaki J, Redfield O, et al. Natural Questions: a Benchmark for Question Answering Research[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 452-466.
[7]Joshi M, Choi E, Weld D S, et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017: 1601-1611.