Optimal Representation of Text Documents

By: Jeff Clark    Date: Thu, 15 Feb 2007

Given a text document, what is the 'best' way to concisely represent the content within say - a 600x600 pixel region ? One procedure that would probably give good output is this:

  1. Find a person with excellent knowledge of the document topic and extraordinary literary skills
  2. Have them read and ponder the document and a set of related documents
  3. Give them lots of time to formulate a summary short enough to be easily legible within the 600x600 pixel space constraints
  4. Have it read by many people with varying degrees of knowledge of the domain and literary ability
  5. Pass the feedback from this sample of readers back to the summary author
  6. Let the summary author adjust the summary if she wishes
  7. Iterate over steps 4-6 until the author is content

This would be a time-consuming and expensive option. What is the best automated way to solve the same problem ? Perhaps software that reads the text and automatically produces a summary ? I don't have any experience with the state-of-the-art in auto-summarization but I suspect it often doesn't work very well.

How about tag clouds of the most frequent non-trivial words ? They would highlight high-frequency words but don't show any real structure within the document. I'm sure we can do better.

I suspect software that detects named entities (people, places, organizations, products etc) might be a useful component of a solution. Perhaps something that creates a diagram illustrating the key entities and relationships between them would be useful.

Any ideas ?

 


In Defense of Pie Charts
Blog
Word Frequency Graphs