RAG

RAG理论-评估2

Posted by BUAADreamer on 2024-01-16
Words 492 and Reading Time 2 Minutes
Viewed Times

RAG理论-第二篇-评估2·

0.前言·

本文为RAG评估篇的第2讲,补充一些RAG方法常用的评估方式

1.补充评估方式·

语言模型评估·

  1. Perplexity(困惑度)
    • 代表方法:ICRALM[2]
    • $perplexity=(\prod_{i=1}^n\frac {1} {p(w_i|w_{1},…,w_{i-1})})^{\frac 1 n}$
    • 取对数其实就得到了语言模型的交叉熵损失函数:$\mathcal L=-\frac 1 n\sum_{i=1}^n\log p(w_i|w_{1},…,w_{i-1})$
    • 一般是使用wikitext[4]训练集作为检索文档集,使用wikitext测试集验证困惑度
    • 在GPT2,GPT3,ChatGPT,LLaMA等模型上测试
  2. BPB(bits per UTF-8 encoded byte)
    • 代表方法:REPLUG[1]
    • $BPB=(L_T/L_B)\log_2(e^\mathcal L)=(L_T/L_B)\mathcal L/\ln 2$;$L_T$代表token长度,$L_B$代表字节长度;
    • BPB是Pile[3]建议的评估指标,因此一般也在Pile的测试集上测试
    • 在GPT2,GPT3,ChatGPT,LLaMA等模型上测试

下游任务评估·

  1. 专业知识问答
    • 代表方法:REPLUG[1]
    • 用选择题的准确率评估
    • 数据集:MMLU[5]
  2. 开放领域问答
    • 代表方法:REPLUG[1]
    • 用问答的准确率评估
    • 数据集:Natural Questions (NQ) [6],TriviaQA[7]

2.引用·

[1]Shi W, Min S, Yasunaga M, et al. Replug: Retrieval-augmented black-box language models[J]. arXiv preprint arXiv:2301.12652, 2023.

[2]Ram O, Levine Y, Dalmedigos I, et al. In-context retrieval-augmented language models[J]. arXiv preprint arXiv:2302.00083, 2023.

[3]Gao L, Biderman S, Black S, et al. The pile: An 800gb dataset of diverse text for language modeling[J]. arXiv preprint arXiv:2101.00027, 2020.

[4]Karpukhin V, Oguz B, Min S, et al. Dense Passage Retrieval for Open-Domain Question Answering[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 6769-6781.

[5]Hendrycks D, Burns C, Basart S, et al. Measuring Massive Multitask Language Understanding[C]//International Conference on Learning Representations. 2020.

[6]Kwiatkowski T, Palomaki J, Redfield O, et al. Natural Questions: a Benchmark for Question Answering Research[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 452-466.

[7]Joshi M, Choi E, Weld D S, et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017: 1601-1611.