RAG理论-第二篇-评估2·

0.前言·

本文为RAG评估篇的第2讲，补充一些RAG方法常用的评估方式

1.补充评估方式·

语言模型评估·

Perplexity（困惑度）
- 代表方法：ICRALM[2]
- $perplexity=(\prod_{i=1}^n\frac {1} {p(w_i|w_{1},…,w_{i-1})})^{\frac 1 n}$
- 取对数其实就得到了语言模型的交叉熵损失函数：$\mathcal L=-\frac 1 n\sum_{i=1}^n\log p(w_i|w_{1},…,w_{i-1})$
- 一般是使用wikitext[4]训练集作为检索文档集，使用wikitext测试集验证困惑度
- 在GPT2，GPT3，ChatGPT，LLaMA等模型上测试
BPB（bits per UTF-8 encoded byte）
- 代表方法：REPLUG[1]
- $BPB=(L_T/L_B)\log_2(e^\mathcal L)=(L_T/L_B)\mathcal L/\ln 2$；$L_T$代表token长度，$L_B$代表字节长度；
- BPB是Pile[3]建议的评估指标，因此一般也在Pile的测试集上测试
- 在GPT2，GPT3，ChatGPT，LLaMA等模型上测试

下游任务评估·

专业知识问答
- 代表方法：REPLUG[1]
- 用选择题的准确率来评估
- 数据集：MMLU[5]
开放领域问答
- 代表方法：REPLUG[1]
- 用问答的准确率来评估
- 数据集：Natural Questions (NQ) [6]，TriviaQA[7]

2.引用·

[1]Shi W, Min S, Yasunaga M, et al. Replug: Retrieval-augmented black-box language models[J]. arXiv preprint arXiv:2301.12652, 2023.

[2]Ram O, Levine Y, Dalmedigos I, et al. In-context retrieval-augmented language models[J]. arXiv preprint arXiv:2302.00083, 2023.

[3]Gao L, Biderman S, Black S, et al. The pile: An 800gb dataset of diverse text for language modeling[J]. arXiv preprint arXiv:2101.00027, 2020.

[4]Karpukhin V, Oguz B, Min S, et al. Dense Passage Retrieval for Open-Domain Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 6769-6781.

[5]Hendrycks D, Burns C, Basart S, et al. Measuring Massive Multitask Language Understanding[C]//International Conference on Learning Representations. 2020.

[6]Kwiatkowski T, Palomaki J, Redfield O, et al. Natural Questions: a Benchmark for Question Answering Research[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 452-466.

[7]Joshi M, Choi E, Weld D S, et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017: 1601-1611.

RAG理论-评估2