Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models F Liu, T Guan, Z Li, L Chen, Y Yacoob, D Manocha, T Zhou arXiv preprint arXiv:2310.14566, 2023 | 51 | 2023 |
Towards understanding in-context learning with contrastive demonstrations and saliency maps Z Li, P Xu, F Liu, H Song arXiv preprint arXiv:2307.05052, 2023 | 17 | 2023 |
Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models T Guan, F Liu, X Wu, R Xian, Z Li, X Liu, X Wang, L Chen, F Huang, ... arXiv preprint arXiv:2310.14566, 2023 | 12 | 2023 |
Sodapop: open-ended discovery of social biases in social commonsense reasoning models H An, Z Li, J Zhao, R Rudinger arXiv preprint arXiv:2210.07269, 2022 | 9 | 2022 |
CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering Z Li, I Mondal, Y Liang, H Nghiem, J Boyd-Graber arXiv preprint arXiv:2401.13170, 2024 | 1 | 2024 |
Improving the TENOR of Labeling: Re-evaluating Topic Models for Content Analysis Z Li, A Mao, D Stephens, P Goel, E Walpole, A Dima, J Fung, ... Proceedings of the 18th Conference of the European Chapter of the …, 2024 | | 2024 |
PANDA (Pedantic ANswer-correctness Determination and Adjudication): Improving Automatic Evaluation for Question Answering and Text Generation Z Li, I Mondal, Y Liang, H Nghiem, JL Boyd-Graber arXiv preprint arXiv:2402.11161, 2024 | | 2024 |