Twitter StreamGraphs has triggered some attention for Neoformix over the last month. Some of the major sources were:
Thanks very much to these people and the many others who have let people know about Neoformix !
Zach Beane has created another variation on a graphic to illustrate movie box office data. See Movie box office charts for the original but here are a few interesting bits:


In Zach's words:
Each page displays trends in the top 25 movies at the box office for each weekend in a year. The color is based on the movie's debut week. Because of that, long-running movies will gradually start to stand out from newer movies with different colors.There is an interactive version as well.
Related posts: Movies Ebb and Flow
I just posted a new application in my projects section called Twitter StreamGraphs. It is an interactive tool to let you create StreamGraphs from the latest tweets containing a given word or from a particular user. A few examples are shown below.



The application shows a StreamGraph for the latest 200 tweets which contain the search word. The default search word is 'interesting' but a new one can be typed into the text box at the top of the application. You can also enter a Twitter ID preceded by the '@' symbol to see the latest tweets from that user. A parameter to the URL can be used to specify the initial search word. For example, use http://www.neoformix.com/Projects/TwitterStreamGraphs/view.php?q=coffee to see the latest tweets about coffee. This makes it possible to link to a StreamGraph for your own tweets from your blog or within a twitter update.
The StreamGraph shows the usage over time for the words most highly associated with the search word. One of these series together with a time period are in a selected state and coloured red. The tweets that contain this word in the given time period are shown below the graph. You can click on another word series or time period to see different matches. In the match list you click on any word to create a different graph with tweets containing that word. You can also click on the user or comment icons and any URL to see the appropriate content in another window. If you see a large spike in one time period that hides the detail in all the other periods it will be useful to click in the area to the left of the y-axis in order to change the vertical scale.
Credits go to Lee Byron for the visual ideas behind the StreamGraph (although I'm using a simpler symmetrical form), to Processing for the development tools, to Martin Porter for the Porter Stemming Algorithm, to Vaga for the two small icons, and to Summize for building a great API into the Twitter data.
Related posts are TwitArcs and Twitter Spectrum.
Happy Canada Day ! I've created a simple flag graphic using a few words that come to my mind when I think of Canada.
My last post explored the company and product names discussed on TechCrunch and how they varied over time. The number of posts written by the various authors and how it varied over time was also illustrated. An obvious follow-up analysis is to look at the interaction between author and company/product names. Do certain TechCrunch authors specialize in writing about particular companies or products ? Or do some authors avoid specific domains ?
I've done this analysis and presented the results below. For each of the top 6 authors and top 60 names the number of times each author used each name was determined. The first graph shows the breakdown for the top 10 names. The second has the same form and shows numbers 11-60 but I've broken it into a separate graph because it uses a different scale. This lets us see more details for these names. I have also colored the bars to show proportional use of the names. A deep blue color means that the name was used proportionally much more often for that author and a deep red shows that it was used proportionally much less often. Paler colors indicate a lesser degree of high(blue) or low(red) usage.

Some things that I spotted quickly from the larger version of the top 10 graphic include:
The data for this analysis was kindly provided by Yuvi from The StatBot.
TechCrunch is a weblog that reviews products or companies that are having an impact on the internet. Who do they write about and how do references to these companies or products vary over time ? I've analyzed the proper names referenced in their posts in the time frame May 1st, 2006 until April 30th, 2008 - 2 years of data. I discarded place names and people and plotted the top 8 names with the most references in a StreamGraph below.
The graph clearly shows the companies that have dominated TechCrunch coverage over the last 2 years. Google looks biggest with FaceBook, Yahoo, and Microsoft being quite large as well. You can spot the increase in coverage for Microsoft and Yahoo in Feb of this year due to the merger talks. Notice also that MySpace and FaceBook were fairly even until July 2007 when FaceBook began dominating. If you look closely you can also tell that Twitter has become important lately with the number of references in April 2008 very similar to both Microsoft and FaceBook.
The standard line graph for the same data lets you see some details more clearly. Google was indeed the most referenced company in all but
a few months where it was barely exceeded by Yahoo (Sep 2006, Feb 2008) and FaceBook (Aug and Oct 2007). And references to Twitter did
barely exceed Microsoft and Facebook in Apr 2008.
The standard line graph cannot usefully show the top 20 names
because so many of the series overlap each other and can't be distinguished.
The StreamGraph version for 20 names is much more useful at full size.

The above StreamGraphs show the texts delivered by Obama and McCain recently on the American economy. Click on either one to see more detail. Obama's remarks, given the title Renewing American Competitiveness, were delivered at Kettering University in Flint, Michigan on June 16th, 2008. John McCain delivered his remarks concerning America's Leadership in the Global Economy to the National Restaurant Association, in Chicago, Illinois, on May 19, 2008. Of course it's more informative to actually read the texts but these things do jump out from the graphics:
I've been having fun playing with Twitter data lately. It's a wonderful playground for those interested in analyzing text data. I'm also starting to actually use the service a bit more for early announcements of projects I'm working on. Feel free to follow Jeff Clark to see my updates. I try and keep my signal to noise ratio pretty good :-)
I have created another StreamGraph, this one for the book Little Brother, by Cory Doctorow. Click on it to see a larger version. It shows the distribution of proper noun references across the text. Here are a few things you can pick out from the graph:
See the post Tom Sawyer Character StreamGraph for a very brief description of how it was constructed. The design of the graph is based loosely on those created by Lee Byron.
Here is another for a different work by Cory, Down and Out in the Magic Kingdom.
The above image is a StreamGraph for the book The Adventures of Tom Sawyer, by Mark Twain. Click on it to see a larger version. It seems to do a pretty good job of communicating the ebb and flow of the various characters throughout the book. The Mississippi River figures prominently in the book so a stream-like representation of the text seems appropriate.
I have adapted the StreamGraph code used to create the various Twitter Topic Streams so I can create StreamGraphs from arbitrary text documents. The document is split up into 25 equal sized segments and the word counts are done within each segment. These segments are used in place of time along the horizontal axis of the StreamGraph. This document StreamGraph again focuses on capitolized words but ignores a few common ones like 'Mr' and 'Mrs'. I'm also using a longer format for the graph and showing two labels for each word series - one on the left half of the graph, and one on the right. The difference in label size for the same word can show whether it was used more frequently in the first or second half of the document. In the 'Tom Sawyer' graphic above you can clearly see that both 'Ben' and 'Mary' are more prominent in the first half of the text but that 'Huck' is more common in the second half.
Many people seemed to enjoy the Topic StreamGraph I made a few days back for Robert Scoble so I have created Topic StreamGraphs for some of the other top Twitter users. If you missed my post from last week on Twitter Topic Streams a quick explanation is that they illustrate the most interesting capitolized words used in the tweets for these people. I removed many common terms from consideration including most of the placenames although a few managed to squeak through.
Jonathan Feinberg has created an interesting toy for building excellent looking word clouds from submitted text. You can adjust the font, color scheme, and choose from a variety of layouts. It's similar in many ways to what I did with Word Hearts a couple of months ago. A few samples are shown below. Great work Jonathan !
The above StreamGraph illustrates the distribution of the most interesting capitolized words in the StatBot dataset of all the updates for the top 100 twitter users. I removed most place names (NY, Paris, Boston etc) and several common words like 'twitter', 'lol', 'company', 'web', and 'internet'. The interestingness of a word was quantified by a function of the total references as well as the burstiness of the word distribution.
The most 'interesting' words in this data are primarily product, technology, or technology event names with the exceptions of 'Scoble' and 'Obama'. This isn't surprising since the top twitter users are early-adopters interested in technology. I was a bit surprised at the large volume for Seesmic but discovered that it is a company founded by Loic Le Meur, the 6th top twitter user.
I also created the twitter topic stream for Robert Scoble shown below. The graphic does a pretty good job of highlighting the primary technologies Scoble focused on over the last year or so.
This StreamGraph shows the top twitter users based on the number of tweets sent during the period December 2006 until April 2008. Click the image to see a larger version with more of the labels legible.
I have mentioned before the wonderful stream-like visualizations created by Lee Byron. I've written some code so I can create my own using whatever data I want. The one above was constructed using the twitter data from The StatBot. You can click on it to see a larger version of the image. I left out the first few months which had a very low volume of data so this one runs from Dec 2006 to Apr 2008.
For a small number of series a simple line graph would be superior because you can directly see which values are larger at each point in time. These StreamGraphs do a better job of emphasizing the sum at each point and the breakdown into the various series. I think StreamGraphs are also better at showing lots of series that dominate for short parts of the timespan of interest. For example see the image below that shows movie revenues. There are a great many movies illustrated and each one is only present in a fairly small part of the overall range of time.

Lee Byron and Martin Wattenberg have written a short paper describing the design decisions and algorithms behind these types of graphics. Have a look at Stacked Graphs - Geometry & Aesthetics (pdf) if you are interested in the details.
Using data from The StatBot again I've built some graphs detailing usage of the top twitter users over the May 2006- May 2008 period. A line graph with this data is too crowded to interpret properly unless we restrict it to only a few top users so I decided to create a set of bar graphs instead. The pink bars are the highest for that month and the highest month of all is the last scobleizer month - 2005 tweets for April 2008. Here are a few observations:
I've constructed a graph showing how use of Twitter clients by power users has changed over time. I used a dataset containing all the tweets from the power users in the Twitterific Top 100 list which was graciously provided by Yuvi, over at The StatBot. Two full years of data, from May 1st, 2006 until April 30th, 2008 were used for the analysis. The main things that jump out for me are:
There have been some other recent posts giving statistics on the clients used most often to post updates to Twitter. One, from ReadWriteWeb, was called How We Tweet: The Definitive List of the Top Twitter Clients and was based on a random sample of over 37,000 tweets from the public timeline. The results were posted April 2, 2008 so I presume the data was collected shortly before then. The top 3 clients from their survey were:
Visit the original post for full results.
Yuvi, more recently, did a similar analysis based on the data he provided to me. He listed a number of findings by constrasting the two datasets including that the power users make use of both SMS txt messages (10% vs 5%) and Mobile Twitter (6% vs negligible) much more often than the typical user. He also claimed that the power users are using Twhirl less than the typical user (5% vs 7%). I believe this claim is incorrect.
The two studies mentioned above show an average client usage over very different time periods. The ReadWriteWeb study uses data from a 24 hour period around April 2, 2008 but the StatBot analysis uses a complete list of tweets that span a timeframe from March 21st, 2006 until May 25th, 2008. Drawing comparisons between two datasets based on such vastly different time periods should be done very cautiously. Twhirl is relatively new and Yuvi's analysis used lots of historical data before Twhirl was available.
The StatBot analysis showed that on average it was the 6th most popular client. In fact, if you look at the graph above at the point between Mar and Apr 2008, which corresponds to when the 'typical user' study was done by ReadWriteWeb, you can easily see that Twhirl was actually the second most popular client for power users. I've looked at the power user data for all tweets between Mar 31st and Apr 2nd, 2008 and there were 241 for Twhirl out of 1833 total - 13% , which is much higher than the ReadWriteWeb result of 7% for typical users. This makes sense to me - power users have more of an incentive to install a specialized client than an average user who doesn't use twitter very often.
Just for fun, (well, and to try and get them to link to me ! ), I have generated graphs for two of the top power users: Robert Scoble and Chris Brogan. Note that the clients are coloured differently in the three graphs. Ideally, for easy comparison, they should be consistent. Here are a few observations concerning their patterns of use:
I've combined some visuals from a side project related to linguistics with twitter data to create TwitArcs. It takes the latest 100 tweets for a twitter ID or term of interest and creates a list representation that has arcs connecting messages sent to the same users or that use the same primary term. You can click on the left side to load the tweets for a new user, on the right side to load the tweets for a specific term, and in the middle to visit the actual tweet.
Thanks to Twitter and Summize for the data and Processing.org for the tools. Give TwitArcs a try !
TwitArcs (static image)
I've slightly improved the Twitter Spectrum application so that clicking on words used in conjunction with both terms properly use OR in the query. I also changed the default search terms to 'from:jasoncalacanis' and 'from:scobleizer' to show how you can compare the tweets from two users rather than related to two terms.
Twitter Spectrum (static image)
Just for fun, I've modified my News Spectrum application to take data from Twitter instead. Its called Twitter Spectrum of course ! It uses the wonderful Summize which provides excellent search capability for Twitter data.
As before, one topic is coloured blue, the other red, and the associated words are coloured and positioned based on how highly they are associated with the two topics. Click on any word to see the related tweets. Give Twitter Spectrum a try ! As always, feedback is welcome.
Thanks to Twitter and Summize for the data, Processing.org for the tools, and Chris Harrison for the inspiration behind the design.
Twitter Spectrum (static image)
Introducing News Spectrum ! It is a visualization of the words used for two topics in the latest results from Google News. One topic is coloured blue, the other red, and the associated words are coloured and positioned based on how highly they are associated with the two topics. Click on any word to see the related Google News results.
This is a generalization of my recent Obama McCain News Spectrum that allows you to enter your own terms of interest. Press the 'Enter' key to generate the spectrum after entering your words. The layout algorithm has also been improved to minimize the number of overlapping words. Give News Spectrum a try ! As always, feedback is welcome.
Thanks to Google News for the data, Processing.org for the tools, and Chris Harrison for the inspiration behind the design.
News Spectrum (static image)
I was thinking about the Word Association Spectrums created by Chris Harrison and thought it might be interesting to create something similar using live data. I've come up with a little application that gets the latest google news results for two terms of interest and generates a word spectrum based on the words found in the results. I removed stop words in order to highlight the words more likely to be of interest. It's an obvious drawback that there are often many hard to decipher overlapping words but it's kind of fun to play with nevertheless. This initial version shows a news spectrum related to the terms 'Obama' and 'McCain'.
Obama McCain News Spectrum (static image)
The New York Times has published an interesting interactive diagram depicting the relationship between various diseases and the genes that are known affect them. The large circle in the image below is zoomed in on one part of the diagram. [via FlowingData]
Chris Harrison has a wonderful collection of visualizations one of which I featured recently in More Color Name Graphics.
Chris recently posted a set of beautiful Word Association Spectrums based on an extremely large dataset from Google containing word bigram distributions. The example shown below is for the words 'war' and 'peace'. The horizontal position of the various words indicate whether they more frequently follow 'war' or 'peace' in the analyzed text. So the word 'memorial' is positioned very close to the left (at the bottom) because the bigram 'war memorial' occurs much more often (normalized by overall counts) than 'peace memorial'. The vertical position is random.
My own Document Contrast Diagrams also stretch out words along a horizontal axis based on the strength of association between two poles. My diagrams try and express a lot more information as well - probably too much. Chris's Word Association Spectrums carry less information. This simplicity allows for a much more elegant design. He has generated spectrums for other interesting word pairs like 'kids:adults' , 'good:evil', and 'american:chinese'. I might like to see versions that don't show the common prepositions so that the nouns, verbs, and adjectives stand out more.
Word Association Spectrum for War and Peace (click to visit Chris Harrison's Post)
I ran the speeches delivered by both Obama and Clinton last night after the May 6th primary results and used them to build a Document Contrast Diagram. See the link for a description of how to interpret the diagram.
May 6th Primary Speech Contrast Diagram (click to see larger version)
I have taken the speeches delivered by both Obama and Clinton last night after the May 6th primary results and used them to build a Document Cloud Comparison. It shows which words were used together by each speaker using linked word clouds. A static image is shown below for references to the word 'change' to give you a flavour but the real fun comes with exploring the interactive application.
If you enter a blank focus string in the application it shows a standard word cloud and colors words that are unique to one speaker or the other. The top words used by Obama and not by Clinton include 'side' , 'down', 'government', 'values', 'yes', 'lead' , 'life', 'kind', 'trust', and 'united' . Those used by Clinton uniquely include 'keep', 'feel', 'journey', 'working', 'invisible', 'west', and 'story'.
'change' Associations and References (static image)
Give it a try yourself. The application is written in Java so you may have to wait a few seconds for it to start up.
As I pointed out in my last post, Directed Sentence Drawings generated from a text make it extremely difficult to see in what order the various topics were discussed and that a simple bar for each sentence in the order they occurred in the text and coloured by topic would be much better in most respects. I've built a graphic to show what I mean. I have also added the most frequent topic words for each set of 10 consecutive sentences.
State of the Union - Sentence Bars with Topic Colours (click to see larger version)
Click on the up arrow below if you found this interesting:
In my post earlier today about Sentence Drawings I mentioned that the overall shape of the graphic doesn't really express anything useful. I have come up with a variation on the idea that tries to address this.
In the sentence drawings produced by Stephanie Posavec or David Sparks each line segment is turned 90 degrees to the right relative to the previous one. This makes the overall shape highly sensitive to minor variations in the text which is why the overall shape doesn't carry much meaning - it's almost random.
I call my diagrams Directed Sentence Drawings because the direction of the line segments are a function of their topic. As before, each sentence is assigned a topic or remains neutral based on the vocabulary it contains. I place a neutral point in the middle of the diagram and four other topic points form a diamond shape around it (see below). For the State of the Union diagrams produced below I used the four topics Government, Domestic, Economy, and Security. The algorithm is as follows:
The diagram immediately below is constructed from the State of the Union Address for the year 2000. It shows there were many sentences about both Domestic and Economic issues, a fair number concerning Government and fewer about Security. The dominant colours give this away but also the overall shape makes it obvious. There is a greater density of lines near the Domestic and Economic topic nodes.
Directed Sentence Drawing for SOTU 2000
This next diagram is for the SOTU of 2001, the first delivered by George W. Bush. It's obvious that it was much shorter, had even less discussion of Security issues than Clinton's in 2000, and also not much sustained discussion about Domestic issues.
Directed Sentence Drawing for SOTU 2001
The SOTU for 2002 was delivered after 9/11 and clearly shows that Security has become the predominant concern.
Directed Sentence Drawing for SOTU 2002
This last diagram is for the SOTU of 2008 and shows that Security is still very topical but that Economic and Governmental issues are starting to recapture attention.
I posted a few weeks back on Stephanie Posavec's interesting graphics constructed from the text of Kerouac’s On the Road. One of her pieces featured Sentence Drawings that were generated using each sentence in sequence with line segments coloured to reflect the topic and sized based on the length of the sentence.
David Sparks has constructed a set of similar sentence drawings for the State of the Union addresses delivered by Bush over his 8 years in office.
David Spark's Sentence Drawing for SOTU 2008 (click to see graphic with all 8 addresses delivered by Bush)
I find these interesting to look at. However, the dominant visual feature is the overall shape of the graphic and I don't think it really expresses anything useful.
Dolores Labs has posted an update on how people have used their color name data in various ways. They linked to my own Color Names Explorer - thank you very much ! Their post is called Color flowers, networks, photos, and even 3D and has several more interesting views of this data. The one that really caught my eye was by Chris Harrison who created a flower-like image by rendering the names in their associated color and varying the position by hue along the radius. I don't think many of these images, including my own, are particularly useful, but they sure are interesting to look at !
Chris Harrison's Color Name Flower (click to see larger version in original article)
Color Name Flower Closeup
There is a new Portfolio link available from all pages on my weblog. It links to a simple index of my most interesting or useful applications and gives a pretty good idea of the kinds of things I like to create.
I'm currently available for data analysis or visualization projects if anybody is interested in working together. I live near Toronto, Canada but I'm open to projects done remotely. I would be happy with creative projects that vary in size from a few days to a few months of work. Send me an email if you are interested.
I have taken the words spoken by both Obama and Clinton during the Pennsylvanian Democratic debate held on April 16th, 2008 and constructed from them a Document Cloud Comparison. Basically, it lets you see which words were used together by each speaker using linked word clouds. A few static images are shown below to give you a flavour but the real fun comes with exploring the interactive application.
If you enter a blank focus string in the application it shows a standard word cloud and colors words that are unique to one speaker or the other. The top words used by Obama and not by Clinton include 'politics' , 'decade', 'election', 'economic', 'somehow', 'generation' , 'mission', 'forward', and 'problem' . Those used by Clinton uniquely include 'york', 'begin', 'world', 'best', 'support', 'administration', 'police', and 'hope'.
'Country' Associations and References (static image)
'jobs' Associations and References (static image)
Give it a try yourself. The application is written in Java so you may have to wait a few seconds for it to start up.
I have taken the words spoken by both Obama and Clinton during the Pennsylvanian Democratic debate held on April 16th, 2008 and constructed from them a Document Contrast Diagram. See the link for a description of how to interpret the diagram.
It shows that they spoke roughly the same number of words but with Obama speaking slightly more. Both were slightly positive in overall emotional tone with some areas of negativity related to guns and security for Clinton and taxes for Obama. There was a great deal of overlap in the words used by the two speakers with the words 'kind', 'Democrats' , 'important', 'country', 'make', 'work', 'president', 'can', 'take' , 'right', and 'guns' being frequently used by both. 'Know' was used a lot by both but more often by Clinton. They both spoke each others names much more than their own but Obama used Clinton's name more often than the reverse.
Key words used frequently and uniquely or much more often by Obama included 'true' , 'statement' , 'economic' , 'issues', 'election', 'confident', 'George' , 'American', 'policy', 'politics', 'income', 'change', 'General', 'ideas', 'Chicago', and 'individuals'. Words used frequently and uniquely or much more often by Clinton included 'decisions', 'stay', 'withdraw', 'Iran', 'failed', 'begin', 'world', 'military', 'best', 'York', 'administration', 'Philadelphia', 'impose' , 'order', 'police', and 'oil'.
Pennsylvanian Debate Contrast Diagram (click to see larger version)
I added the transcript for the Pennsylvanian Democratic debate held on April 16, 2008 to the interactive Transcript Analyzer. The image below is smaller (and more blurry) than from the application but gives a rough idea of what was discussed by which candidate and when. Here are the primary topics covered in order:
Notable by their absence were the words 'immigration' and 'nafta' .
Democrat Debate - Apr 16th, 2008 ( click for interactive application )
One small refinement was made to the application. The counts and bars for the various words will now also include simple plural variations. So references to 'jobs' will also include 'job', and references to 'gun' would also include 'guns'.
Give the Transcript Analyzer a try yourself and, as always, feedback is welcome !
One of the areas I have been exploring here on Neoformix is the notion of constructing graphics in an algorithmic fashion from textual data. The site NOTCOT has just published an article on some interesting work by Stephanie Posavec that explores this same idea. She has constructed a number of different works based on the text of Kerouac’s On the Road. From NOTCOT's article:
The maps visually represent the rhythm and structure of Kerouac’s literary space, creating works that are not only gorgeous from the point of view of graphic design, but also exhibit scientific rigor and precision in their formulation: meticulous scouring the surface of the text, highlighting and noting sentence length, prosody and themes, Posavec’s approach to the text is not unlike that of a surveyor.
Here are a few images that will give you a taste and a rough idea of what they mean. Although definitely more on the artistic side of information visualization, I like these images and the ideas behind them a great deal.




Recently both Clinton and Obama delivered speeches related to the economy. Clinton's was more focussed specifically on the housing crisis. I took the text of Clinton's Halting the Housing Crisis and Obama's Renewing the American Economy and created a Document Contrast Diagram.
It clearly shows that they were about the same length, both slightly positive in overall emotional tone but Clinton's text varied more in tone. The large blue word circles for 'mortgage', 'housing', 'crisis', 'families', 'foreclosure' show the primary topic of interest for Clinton. Obama's mostly unique key terms were 'American', 'financial', 'risk', 'system', 'regulatory', and 'institutions'. The blue segments in the middle of Obama's speech show that he used words in that section more strongly associated with Clinton overall. This is where he discussed the housing crisis.
Obama/Clinton Economic Speech Contrast Diagram (click to see larger version)
Dolores Labs recently did an interesting experiment where they showed many people samples of colors and asked them what they should be called. They posted a graphic that showed the color names that people used for the various colors.
Dolores Labs' Color Name Cloud (click to see larger version in original article)
They also posted the raw data for other people to play with. Martin Wattenberg at IBM Research took the data and created a much more beautiful graphic. Nathan at FlowingData discusses the design differences in the post A Little Bit of Design Goes a Long Way With Infographics.
Wattenberg's Version of the Color Name Cloud (click to see larger version in original article)
I decided to try my hand at building a simple interactive 3D explorer for the data as well. I combined entries with the same name and found the average RGB values. The frequency count was used to highlight the more common names by scaling the size of the text in a manner likely similar to that used by Wattenberg. I then plotted the names in 3D using the red (x), green (y), and blue (z) components of the color value.
Color Name Cloud - initial view

Color Name Cloud - zoomed in view
The initial view is similar to Wattenberg's but not spaced out as nicely. My version also suffers from the fact that the size of the name depends on both frequency of use and how much blue the color happens to contain since the more blue a color has the closer it is drawn to the front of the display.
You can try out the color name explorer below. Can you find the shade somebody called 'baby poop' ?
I'm a proud citizen of Canada and have decided to include a bit more analysis of Canadian-themed data and text in the future.
Yesterday the 2008 Ontario budget speech was delivered which outlines the governments' priorities for the coming year. I have constructed a Document Contrast Diagram from the text of the 2007 Ontario Budget Speech and the 2008 Ontario Budget Speech.
Document Contrast Diagram for 2007/2008 Ontario budget Speeches (click to see larger version)
My first post on Document Contrast Diagrams will give some guidance on how to interpret the image. Here are a few things I noticed that are illustrated by the diagram. You may have to view the larger version to see some of these details.
The image below shows the Document Contrast Diagram from the remarks made by both Clinton and Obama after the Super Tuesday primaries on Feb 5th.
Document Contrast Diagram for Clinton/Obama Super Tuesday Remarks (click to see larger version)
My first post on Document Contrast Diagrams will give some guidance on how to interpret the image. Here are a few things I noticed that are illustrated by the diagram. You may have to view the larger version to see some of these details.
A Document Contrast Diagram is a visual summary of the content of two text documents that illustrates shared words, words that are unique to one document or the other, word frequency, relative size of the two documents, distribution of emotional tone within the documents, related words based on co-occurence, and the most common word in each document segment. Have a look below at the Document Contrast Diagram for the 2007 and 2008 US State of the Union (SOTU) Addresses. If you wish you can click on the image to see a larger version.
I'm hoping that much of the following is reasonably intuitive but here are a number of points regarding interpretation:
I've been consumed lately by the idea of taking two distinct documents and creating a large, visually interesting, static image that compares and contrasts them. I don't have time at the moment to explain how to interpret these but have a look at the images below. The blue text is the State of The Union Address for 2007 and the red is that for 2008.
Click on the images to see larger versions. The idea needs work still but it's starting to look promising.
Last week the New York Times published an interactive graphic called The Ebb and Flow of Movies: Box Office Receipts 1986-2007. It does a pretty good job of showing how the revenue of various movies rose and fell over time as well as more global patterns. The design does make it hard to directly compare movies against each other. It would be neat to pick a bunch of movies and see a set of traditional line graphs starting from the same point. Here is a close up:
And here is 4 years of data with labels showing the summer blockbuster periods. You can also clearly spot the peaks at the end of the years.
Lee Byron has done some other interesting work. One that really caught my eye when I first saw it is this stream-like visualization of music listening habits over time. The data comes from the Last.fm records for a particular user.
In the author's words:
After thinking about how I could show this whole sum in a presentable form, I decided on a sort of layered histogram. Each colored sliver represents a different artist listened to in the last 18 months. The sliver moves through time left to right growing thicker where it was more popular and thinner where it was less. The color indicates the first time the artist was listened to, warmer colors being more recent and cooler being further back. As a new artist is listened to it is put onto the outsides of the graph. The result is a wiggling tour through your listening history past.
Lee describes it as 'a sort of layered histogram' but I think of it as a 'stream graph' - it nicely shows how something varies over time and looks like a stream to me.
Back in 2006 I wrote about Martin Wattenberg's work called The Shape of Song and how it illustrates the repetitive patterns in music using translucent arches that connect identical passages of notes. At the time I mused about doing something similar for text:
Perhaps poetry or lyrics from songs might have an interesting structure but I suspect most text data wouldn't have enough repetition at the token or word level for this idea to be fruitful.I did eventually develop the idea into Document Arc Diagrams that uses similarity of vocabulary.
I just stumbled across Children's Poetry & Limerick Visualizations by Lee Byron which stems from Wattenberg's concept as well. Lee describes the image below with these words
The arcs represent rhyme, alliteration, homophone and repetition. Steps underneath the line represent rhythm. You can see these elements clearly represented in the classic childrens poem: "Hickory Dickory Dock".
Interesting work.
I added the transcript for the Ohio Democratic debate held on February 26, 2008 to the interactive Transcript Analyzer. The image below is smaller (and more blurry) than from the application but gives a rough idea of what was discussed by which candidate and when. Here are the primary topics covered in order:
It's also interesting that 'immigration' was mentioned in passing only once during this debate but was a primary topic in Texas. Also 'education' was not given any real attention.
Democrat Debate - Feb 26th, 2008
I have made another minor refinement to the application. Beside the word lines are shown the number of times each word was used by each candidate. For example for this debate 'Iraq' was used 8 times by Obama and 5 times by Clinton.
Give the Transcript Analyzer a try yourself and, as always, feedback is welcome !
I'm sure almost everybody reading this entry is aware of the tight race in the US democratic primary between Clinton and Obama. There is a huge amount of coverage over this exciting and extremely important contest. A concept much-discussed lately is momentum. I've created a simple graphic to try and visualize the momentum.
The darker blue area shows Clinton's delegate counts over time. The lighter blue shows how much Obama's counts exceed those of Clinton. The small numbers show the actual difference at a point in time. For example, after Feb 5th (Super Tuesday), Obama had 30 more regular pledged delegates than Clinton - not counting super delegates.
Hillary Clinton currently has an advantage in super-delegates (241 to 181 for Obama) and this makes the race closer than depicted above. However, super-delegates support is not fixed - they are free to change who they support up until the time of the convention.
I just added the transcript for the Texas Democratic debate held on February 21, 2008 to the interactive Transcript Analyzer. The image below is smaller (and more blurry) than from the application but should give a rough idea of what was discussed by which candidate and when. Here are the primary topics covered in order:
Democrat Debate - Feb 21st, 2008
I did make a minor refinement to the application. The bars for the words of interest are now coloured to show the speaker. This makes it easy to tell, for example, that Obama used the word 'Iraq' in 7 separate segments but Clinton only used it in one segment.
Give the Transcript Analyzer a try yourself and, as always, feedback is welcome !
Pixish is a new site devoted to connecting visual artists with people interested in exploring and possibly using their work. You can sign up and post 'Assignments' that describe what you are looking for or you can submit designs to fulfill assignments. The site is still in beta mode and had a few hiccups when I played with it yesterday but it's an interesting idea.
Just for fun, I created a couple of designs with a modified version of Word Hearts and entered them into an assignment. They are looking for a T-Shirt design for typography lovers. Here are small versions of my 2 entries:
In a few of my previous interactive applications, namely Digg Explorer and the Race Results Analyzer, I have used small 'data objects' that get smoothly animated between different locations. Sometimes the set of data objects represent a data graphic - a pie chart or histogram for example.
I have just come across a research paper and video by Jeffrey Heer and George Robertson where they investigate the effectiveness of animated transitions in statistical data transitions. Their conclusion was that animated transitions can significantly improve graphical perception. The video is high quality and explains the ideas and results very well.
Note that this research did not use multiple constituent data objects as in my applications but the conclusion is likely valid in this context as well.
I have taken the remarks made by both Obama and McCain after the Potomac Primary results were in and constructed another Document Cloud Comparison. As before, a few static images are shown below to give you a flavour but the real fun comes with exploring the interactive application.
If you enter a blank focus string in the application it shows a standard word cloud and colors words that are unique to one speaker or the other. The top words used by Obama and not by McCain include 'change' , 'tax', 'health', 'college', 'bush', 'lobbyists' , 'jobs', 'rich', and 'iraq' . Those used by McCain uniquely include 'promise', 'serve', 'friends', 'strength', 'faith', 'dreams', and 'challenges'.
'Hope' Associations and References (static image)
'War' Associations and References (static image)
Give it a try yourself. The application is written in Java so you may have to wait a few seconds for it to start up.
I've been playing around with words and shapes again and just posted a little application I call Word Hearts that lets you generate heart shapes filled with words. Here are two sample images:

It's just in time for Valentines Day so have some fun!
I have taken the remarks made by both Clinton and Obama after the Super Tuesday results were in and constructed a Document Cloud Comparison. A few static images are shown below to give you a flavour but the real fun comes with exploring the interactive application.
Most Common Words (static image)
This first image shows part of the list of most common words for both speeches. Clinton mentions 'America' most frequently, Obama the word 'can'. Clinton uses the terms 'god' , 'auto', 'veteran', and 'economy' which aren't mentioned at all by Obama. Interestingly, Obama's top unique words are 'time' and 'change'.
'Hope' Associations and References (static image)
The references to the word 'hope' clearly show Obama's use of repetition and rhythm. This is shown again in his use of the words 'time' and 'change' as shown below.
'Time' & 'Change' Associations and References (static image)
The last reference to 'change' caught my eye - We are the change that we seek. It's the declarative form of a famous quote by Gandhi - You must be the change you want to see in the world.
It's much more interesting to try it out yourself. Click on 'more' to give it a try. The application is written in Java so you will have to wait a few seconds for it to start up.
Word Association Clouds appear to be an interesting way to navigate within a document and get an understanding of the concepts discussed. I've also been playing around with the idea of using two of them linked together in order to explore the similarities and differences between two different documents.
The image below shows an example using the State of the Union addresses for both 2007 and 2008. The two clouds show the words related to the focus word in both documents in the same manner as for the single Word Association Cloud. The only difference is that colour is used to indicate words that are unique to one document or another. The words in blue on the left are unique to the 2007 SOTU and those in red on the right are unique to the 2008 SOTU. As before, you can click on a word to bring it in focus or click on the top edit box to change it. The clouds are linked in this case so that they always show the same word for both documents.
Document Cloud Comparison (static image)
We show here the words associated with 'energy' in both of the transcripts. The word 'supply' is most highly associated with 'energy' in the 2007 version and the blue colour shows that it isn't even used in the 2008 address. You can also easily see that 'wind', 'solar', 'electric' and 'vehicles' were all used in relation to 'energy' in 2007 but were not even mentioned in 2008. In 2008 the word 'security' is the most highly associated term. It does appear in 2007 but is not as prominent in relation to 'energy'.
It's much more interesting to try it out yourself. Click on the image or 'more' to give it a try.
The image below is a Document Arc Diagram generated from the text for the State of the Union Address for 2008. There is some interesting structure evident. There are two very distinctive groupings of arcs. The first is focused on domestic issues and arises from repeated use of the terms America, Americans, Congress, trust, tax, veto, health, housing, technology, and jobs. The second group of arcs is based on repeated use of the terms America, Qaeda, troops, terrorists, iraq, iraqi, afghanistan.
You can enter your own text for analysis with the Document Arc Diagram Application.
State of the Union Adress, 2008
I just added the transcript for the California Democratic debate held on January 31, 2008 to the interactive Transcript Analyzer.
Democrat Debate - Jan 31st, 2008
I have adapted my recent Digg Trends tool so that it can analyze data about weblog posts. A new version exists called Boing Boing Word Trends that loads summaries of the latest 500 posts from Boing Boing and lets you explore which words are used together and how usage has varied over the recent past.
Give Boing Boing Word Trends a try !
The American president recently presented the State of the Union Address for 2008. I noticed this Tag Cloud representation of the text. I'm sure there are several others already on the web as this is a standard analysis these days for any text of interest. It does a pretty good job of summarizing the content by listing the top keywords with a font scaled to their frequency.
In my recent tool Digg Trends I introduced something I call a Word Association Cloud. Visually, a Word Association Cloud looks like a standard Tag Cloud except the topmost word is made distinct in some manner. I've been using a faint block of color behind it. Rather than using font size to represent a simple word frequency the size here illustrates how good the correlation is with the primary word.
Word Association Cloud (static image)
In this example the primary word is 'Afghanistan' and the cloud clearly shows that the major words associated with it are 'iraq', 'america', 'freedom', 'pakistan' etc. The references within the text are also shown. I'm basically counting how often the various words occur near 'Afghanistan' but I'm also weighting this count based on how far apart the words are. You can click the primary word to enter edit mode and change it to whatever you wish. Or you can simply click on one of the associated words to make it the new primary word. This lets you navigate around easily to explore different words. If you change the primary word to a blank then a standard tag cloud is presented.
It's a simple idea but seems to give a useful perspective. I'm guessing somebody somewhere has done this before but I'm not aware of any examples. Please let me know if you find some. Give the Word Association Cloud for the State of the Union Address a try !
The design of the Digg Election Story Analyzer has been improved and generalized so that it can be used for all the topics and subtopics available on Digg. I'm calling the result Digg Trends. The tool loads the latest 500 popular stories for the desired topic and analyzes the text found in the story titles and descriptions. The image below shows the current results for the 'Technology' topic.
Static image - click it to launch the interactive application
The Digg Trends analysis focuses on four words at any given time. A different color is used for each. The graph at the top shows how the number of references to each of the four words varies over time. You can turn off the 'Stacked' checkbox to show a line graph which does a better job of showing which word is referenced the most at any given time.

For much of this past month of January, 2008 Apple has had much greater attention within the Digg community than Microsoft, Google, or Digg itself. There was a large spike in Apple references around Jan 16th which corresponds to the announcement of the MacBook Air at MacWorld. Attention to Digg was higher than Apple around January 23rd.
Recently I completed a small freelance project for the site ButterBeeHappy.com . Basically, the site lets you easily keep a journal of those things that make you happy or that you are grateful for. There is research to suggest that doing so has psychological benefit. The site is free to use and I've been enjoying using it the last couple of months.
My small piece of the puzzle is called the Honeycomb Navigator. It lets you see the words used most often in your entries as well as which other words are associated with them. You can also explore the things that made you, or other people, happy by hovering over and clicking on particular cells.
HoneyComb Navigator (static image)
In the example image the central hex on the left shows a particular user, in this case it's me - jclark. The outer hexes show the words most commonly used in my recent entries: julia, soccer, today, leanne etc. The middle ring of hexes on the left are the other users that most often used these same words in their entries. If you mouse over a user hex the right-hand area shows a random entry from that user. If you mouse over a word hex the right-hand area shows a random entry containing that word. You can click on a word or user to make it central.
You can try out the navigator by itself below or visit ButterBeeHappy.com to sign up yourself !
I have added the transcript for the South Carolina Democratic debate held on January 21, 2008 to the interactive Transcript Analyzer.
Democrat Debate - Jan 21st, 2008

Here are a few simple patterns that I noticed:
I have just posted another tool to my projects section. This one is called the Digg Election Story Analyzer and shows the trends in word usage over time and word associations for stories that reached popular status in the Digg US Elections 2008 topic. The tool loads the latest 500 popular stories and analyzes the text found in the story titles and descriptions. An 'attention timeline' and tag clouds of related words are then displayed.
Here are a couple of images to give you a taste. Of course, it's always more fun to just give the Digg Election Story Analyzer a try!


I have updated the Transcript Analyzer so that you can view different transcripts. Both the Democrat & Republican debates in New Hampshire on January 5th are available. There are two other debates as well.
Democrat Debate - Jan 5th, 2008

Republican Debate - Jan 5th, 2008

These images are a little compressed compared to the actual application but a few things still immediately jump out at me:
Here are the top ten posts on Neoformix that were visited the most often by people during 2007. All but two of them (6 and 9) are interactive applications written in Java/Processing and allow you to explore some data or create an interesting image.
Thanks to everyone who visited the site over the year, especially those who sent me feedback or linked to my content. Best wishes to everyone and may you have a happy, productive, and interesting 2008 !
Like lots of people this time of year I've been thinking about snow. Actually more than thinking - I've been shovelling it, walking in it, driving in it, and playing in it. My latest Text Toy stays with the snow theme. It allows you to generate snowflake-like graphics from a few words or phrases.
Check out the interactive application for the Text Snowflake Creator. You can enter your own text to generate images like:

Lot's of people have been having fun with the Big Small application I posted a couple of weeks ago. In fact, I've had a couple of days with more than 25,000 pageviews. Not too bad for such a simple application !
The information provided by the Digg API is quite rich and very relevant to the community of Digg users. I've created a second visualization using the API, this one focussed on the relationships between the latest popular stories. The Digg Story Graph is an interactive visualization that shows the relationships between recent popular stories on Digg through the use of node and link diagrams. Stories can be visually connected through shared vocabulary, common topics, domain, submitter, or date submitted.



There is also a large version of the Digg Story Graph available. It requires 900x800 pixels for proper display and a decent CPU for good responsiveness. The smaller version shows the 100 latest popular Digg stories. The larger version will show 200 and support more word nodes.
Give it a try !
My Digg Explorer has attracted some attention of late culminating in it reaching the front page of Digg late last night. In the span of a couple of hours it received about 5,600 views, and reached a total of more than 7,000 views for the day. To put this in perspective, it's more than my site usually gets in a month. The application and my server handled the load with no trouble and remained very responsive throughout.
I did have a little bit of excitement when Digg decided to add new features to their site immediately before my app went popular. I was a little concerned they might break my application but the only impact was that two new top level categories had no predefined colour and appeared white. Within a few minutes I added colours for the new categories and had it posted to my server.

Thanks very much to everyone who dugg my little application, especially Reg 'Zaibatsu', Muhammad Saleem, and Andrew Sorcini who really got things rolling yesterday. I would also like to thank Stan Schroeder for the write-up in mashable - Beautiful Digg Tool Provides Wealth Of Interesting Data. Thanks also to Daniel Burka, the creative director at Digg, and Tom Carden of Stamen Design for triggering the attention I got within their organizations. Stamen partnered with Digg to produce the very popular Digg Labs visualizations of Digg data.

I have posted the interactive application called Big Small in my projects section. Now you can enter your own text to gene