Skip to main content


ROUGE: Recall-Oriented Understudy of Gisting Evaluation

A software package for automated evaluation of summaries

Home
Download ROUGE
Member Login

What is ROUGE?

ROUGE is a software package for automated evaluation of summaries. It was developed by Chin-Yew Lin while he was at the Information Sciences Institute of University of Southern California (USC/ISI).

Automated text summarization has drawn a lot of interest in the natural language processing and information retrieval communities in the recent years. A series of workshops on automatic text summarization (WAS 2000, 2001, 2002), special topic sessions in ACL, COLING, and SIGIR, and government sponsored evaluation efforts in the United States (DUC 2002) and Japan (Fukusima and Okumura 2001) have advanced the technology and produced a couple of experimental online systems (Radev et al. 2001, McKeown et al. 2002). Despite these efforts, however, there are no common, convenient, and repeatable evaluation methods that can be easily applied to support system development and just-in-time comparison among different summarization methods.

Following the recent adoption by the machine translation community of automatic evaluation using the BLEU/NIST scoring process, we conduct an in-depth study of a similar idea for evaluating summaries. The results show that automatic evaluation using unigram co-occurrences, i.e. ROUGE, between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results. For the inception of ROUGE, please read Lin & Hovy's HLT-NAACL 2003 (Lin and Hovy 2003) paper. For more details, please read Lin's paper "ROUGE: a Package for Automatic Evaluation of Summaries" (Lin 2004a). For the effect of sample size, please see "Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?" (Lin 2004b). For the application of ROUGE in automatic machine translation evaluation, please see "Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics" (Lin & Och 2004a) and "ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation" (Lin & Och 2004b).

For a brief review of ROUGE, please see my presentation at the Workshop on Machine Translation Evaluation - Towards Systematizing MT Evaluation, entitled "Cross-domain Study of N-gram Co-Occurrence Metrics - A Case in Summarization".

ROUGE has been used in DUC 2004 and will be used in DUC 2005 and the multilingual summarization evaluation to be held with the Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization.

If you would like to use ROUGE in your experiments, you can download the most recent version here. If you have any suggestions and comments please contact me at: rouge AT berouge.com.

References

  • DUC. 2002. The Document Understanding Conference. http://duc.nist.gov.

  • Fukusima, T. and Okumura, M. 2001. Text Summarization Challenge: Text Summarization Evaluation at NTCIR Workshop2. In Proceedings of the Second NTCIR Workshop on Research in Chinese & Japanese Text Retrieval and Text Summarization, NII, Tokyo, Japan, 2001.

  • Lin, Chin-Yew and E.H. Hovy 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003.

  • Lin, Chin-Yew and Franz Josef Och. 2004a. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, July 21 - 26, 2004.

  • Lin, Chin-Yew and Franz Josef Och. 2004b. ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, August 23 - August 27, 2004.

  • Lin, Chin-Yew. 2004a. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.

  • Lin, Chin-Yew. 2004b. Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?. In Proceedings of the NTCIR Workshop 4, Tokyo, Japan, June 2 - June 4, 2004. McKeown, K., R. Barzilay, D. Evans, V. Hatzivassiloglou, J. L. Klavans, A. Nenkova, C. Sable, B. Schiffman, S. Sigelman. 2002. Tracking and Summarizing News on a Daily Basis with Columbia’s Newsblaster. In Proceedings of Human Language Technology Conference 2002 (HLT 2002). San Diego, CA, 2002.

  • NIST. 2002. Automatic Evaluation of Machine Translation Quality using N-gram Co-Occurrence Statistics.

  • Papineni, K., S. Roukos, T. Ward, W.-J. Zhu. 2001. BLEU: a Method for Automatic Evaluation of Machine Translation. IBM Research Report RC22176 (W0109-022).

  • Radev, D. R., S. Blair-Goldensohn, Z. Zhang, and R. S. Raghavan. 2001. Newsinessence: A System for Domain-Independent, Real-Time News Clustering and Multi-Document Summarization. In Proceedings of human Language Technology Conference (HLT 2001), San Diego, CA, 2001.

  • WAS. 2000. Workshop on Automatic Summarization, post-conference workshop of ANLP-NAACL-2000, Seattle, WA, 2000.

  • WAS. 2001. Workshop on Automatic Summarization, pre-conference workshop of NAACL-2001, Pittsburgh, PA, 2001.

  • WAS. 2002. Workshop on Automatic Summarization, post-conference workshop of ACL-2002, Philadelphia, PA, 2002.

     

     

Chin-Yew LIN / All about ROUGE! 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015  All rights reserved