What is ROUGE?
ROUGE is a software package for automated evaluation of summaries. It was developed by Chin-Yew Lin while he was at the Information Sciences Institute of University of Southern California (USC/ISI).
Automated text summarization has drawn a lot of interest in the natural language processing and information retrieval communities in the recent years. A series of workshops on automatic text summarization (WAS 2000, 2001, 2002), special topic sessions in ACL, COLING, and SIGIR, and government sponsored evaluation efforts in the United States (DUC 2002) and Japan (Fukusima and Okumura 2001) have advanced the technology and produced a couple of experimental online systems (Radev et al. 2001, McKeown et al. 2002). Despite these efforts, however, there are no common, convenient, and repeatable evaluation methods that can be easily applied to support system development and just-in-time comparison among different summarization methods.
Following the recent adoption by the machine translation community of automatic evaluation using the BLEU/NIST scoring process, we conduct an in-depth study of a similar idea for evaluating summaries. The results show that automatic evaluation using unigram co-occurrences, i.e. ROUGE, between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results. For the inception of ROUGE, please read Lin & Hovy's HLT-NAACL 2003 (Lin and Hovy 2003) paper. For more details, please read Lin's paper "ROUGE: a Package for Automatic Evaluation of Summaries" (Lin 2004a). For the effect of sample size, please see "Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?" (Lin 2004b). For the application of ROUGE in automatic machine translation evaluation, please see "Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics" (Lin & Och 2004a) and "ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation" (Lin & Och 2004b).
For a brief review of ROUGE, please see my presentation at the Workshop on Machine Translation Evaluation - Towards Systematizing MT Evaluation, entitled "Cross-domain Study of N-gram Co-Occurrence Metrics - A Case in Summarization".
ROUGE has been used in DUC 2004 and will be used in DUC 2005 and the multilingual summarization evaluation to be held with the Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization.
If you would like to use ROUGE in your experiments, you can download the most recent version here. If you have any suggestions and comments please contact me at: rouge AT berouge.com.