Optimal Representation of Text Documents

By: Jeff Clark Date: Thu, 15 Feb 2007

Given a text document, what is the 'best' way to concisely represent the content within say - a 600x600 pixel region ? One procedure that would probably give good output is this:

Find a person with excellent knowledge of the document topic and extraordinary literary skills
Have them read and ponder the document and a set of related documents
Give them lots of time to formulate a summary short enough to be easily legible within the 600x600 pixel space constraints
Have it read by many people with varying degrees of knowledge of the domain and literary ability
Pass the feedback from this sample of readers back to the summary author
Let the summary author adjust the summary if she wishes
Iterate over steps 4-6 until the author is content

This would be a time-consuming and expensive option. What is the best automated way to solve the same problem ? Perhaps software that reads the text and automatically produces a summary ? I don't have any experience with the state-of-the-art in auto-summarization but I suspect it often doesn't work very well.

How about tag clouds of the most frequent non-trivial words ? They would highlight high-frequency words but don't show any real structure within the document. I'm sure we can do better.

I suspect software that detects named entities (people, places, organizations, products etc) might be a useful component of a solution. Perhaps something that creates a diagram illustrating the key entities and relationships between them would be useful.

Any ideas ?

In Defense of Pie Charts

Blog

Word Frequency Graphs