Fig. 9

Evaluating performances of BM25, PES1, and PES2—A Based on Precision@10: For each query, we computed (using Eq. 2) the fraction of the top ten retrieved chunks that originated from the key paper for each retriever. PES2 demonstrated the highest precision among the three retrievers, with a median of 0.95. B Based on LLM-generated answer qualities: We determined the quality of LLM-generated answers through manual assessment. For the quantitative evaluation, we categorized the generated answers into five categories: Excellent (4), Better (3), Good (2), Poor (1), and Wrong (0). Using PES2 led to the highest quality answers among the three retrievers, with a median score of 2.5