Winter has finally ended in Markham where I live and it has seemed a very long and cold season this year. I decided to take a look at the weather data from Environment Canada and see whether my impression is supported by the data. The result is the graphic below. Click on it to see a larger version.
Yes, 2014 was the coldest winter in Markham since 1994. We had an average temperature during the winter of -8.2 C this year and in 1994 it was -9.2 C. Both last year and especially 2012 were warmer than usual so it likely felt that much worse in comparison. We also had the 4th most snow in the last 20 years so it was both very cold and snowy.
Toronto is the most multicultural city in the world. According to the 2011 National Household Survey, 46% of the population were foreign-born immigrants and 47% are members of a visible minority. (ref) These immigrants come from a wide variety of places across the globe and their diversity makes the city a truly remarkable place.
I have created a Dot Map that shows a single point for every person in the Toronto area, coloured by visible minority status. There are 5,700,628 in all and they are positioned at their place of residence and coloured based on the information from the 2011 census and National Household Survey. They do not depict actual individual locations but are based on the statistics over small areas.
This first image is zoomed in slightly and shows Toronto with only a few outlying areas. You can see regions of higher and lower population density as well as how the visible minorities are distributed across the city.
You can explore the map in detail with this Zoomable Dot Map of Toronto.
The section below is a close-up of the high-density string of condos along Yonge Street north of HWY 401. You can spot the blank rectangle of the cemetery to the left, the Don river valley, and commercial areas where no people reside.
The next image shows the white, predominantly Italian, area of Woodbridge with the South Asian concentration obvious to the west in Brampton.
It was created with population data from Statistics Canada and map reference data from OpenStreetMap. The OpenStreetMap data was taken from the very helpful Metro Extracts provided by Michal Migurski. The TileMill tool from MapBox was used to compose a map used to mask out non-residential areas and also the basemap underneath the dots. Custom code written with Processing was used to place the actual dots and create the final images. Thanks!
The calls people make into the 311 service line in Toronto give an interesting glimpse into the pulse of the city. The City of Toronto makes this data available through their Open Data initiative. I did some analysis and design work with it to produce a visualization for illuminating time-based patterns during 2012.
The visualization is a set of small multiple calendar heatmaps, one for each data series. The one shown above is for reports about 'long grass and weeds'. I was inspired to use this visual form by this example: Vehicles involved in fatal crashes by Nathan Yau. I experimented with a few different visual methods but this one did the best job of revealing both the seasonal and day of week patterns. I chose to use a unique colour scale for each series in order to maximize the amount of detail.
The image below shows the top 20 most common types of requests. Click on the image to load the full sized version. You can also view all the data series with an interactive version of the Toronto 311 Visualization.
This was created with Processing JS and contains information licensed under the Open Government Licence - Toronto.
One common pattern I see in many interactive applications is to support a person who is selecting a few items from some larger set. Often these items have various characteristics that the person wants to use in some way to guide their selection process. The characteristics can be numeric quantities, dates, categories, or names of things. Showing all the items in a list and allowing the person to sort by one of the attributes is often a decent default solution.
In other cases it's more useful to consider multiple attributes at a time during the selection process. Maybe you want items that are high in one attribute, low in another, and are from a particular category. Ideally the selection process should be one of exploration and successive refinement where various filtering criteria are adjusted until some small subset of items are defined and they can be investigated individually.
I have built an example of this concept which I call the Visual Book Selector. The books are directly represented with small circles and filters can be applied to progressively exclude books by various criteria. The filters are depicted visually as gates through which some of the items can pass and others cannot. The image below shows one possible configuration.
There are about 1000 books which start in the top segment of the display when no filters have been applied. In this example three of the category gates have been opened so books from those categories can pass through. The ones that don't pass this filter pile up near their closed gate which helps give some understanding of their distribution. The books that pass the first criteria encounter a second filter on the average rating of the book from Google Book reviews. This filter gate is set to only allow books having an average rating of at least 4.0 to pass through. The final gate does a pattern match on Author name and allows 4 books to the bottom which have passed all of the criteria.
The best way to get a feel for it is to try out the Visual Book Selector yourself. You can use the dropdown selectors on the left of each segment barrier to choose different criteria on which to filter. Hover over a book to see details and click on it's circle to visit the corresponding Google Books page.
The list of books and their categories comes from the 2009 article in the Guardian 1000 novels everyone must read: the definitive list. The other data was gathered from Google Books.
I should also note that an excellent solution to this multi-attribute selection/exploration problem posed here is the Elastic Lists concept by Moritz Stefaner. It supports what's called Facet Browsing and enhances it with the visualization of proportions and distributions as well as animated transitions.
Recently YouTube had a video that showed all six Star Wars movies at once. They were placed in a 2 by 3 matrix and had an audio track of all the movies superimposed. It was an interesting experiment that has since been removed based on copyright grounds. Before it was removed I was able to do some simple analysis on the video and extract some details of the individual episodes of the Star Wars series.
Basically, I produced something very similar to a classic work called Cinema Redux™ by Brendan Dawes, done in 2004. Each individual movie in the series was reduced to a collection of small snapshots taken at 1 second intervals. The snapshots are layed out 60 images per row so a row corresponds to a minute in the film. These 'fingerprint' images reveal some aspects of the film structure.
Click on any of these images to see higher resolution versions.
I used some fairly simple code in Processing to analyze the video and create the output images.
Last week the wonderful Guardian Datablog published an interesting post called Obesity worldwide: the map of the world's weight. It contains a map that shows with color the rates of obesity around the world. The accompanying chart gives data for different time frames and for both male and female which you can select and view on the map. When I saw the chart I immediately thought of a number of interesting questions that could not be easily answered with the map or chart.
Much of my past work has been driven by personal curiousity. That, together with my own background in science, have shaped my work such that most of it has been exploratory in nature. Recently I have been thinking more about the storytelling or communicative aspect of data visualization. This has been triggered by my admiration for the amazing work of the New York Times Graphics Department, and the writings of Alberto Cairo, Robert Kosara, Andy Kirk, and Jonathan Stray.
I decided try and build an interactive visualization that helped answer the questions above. I also tried to build something that explicitly highlighted some of the more interesting aspects of the data without sacrificing freeform exploration. I settled on using a Slopegraph which was first described by Edward Tufte and is featured on the cover of Cairo's excellent book The Functional Art.
This first image shows the trend for male obesity organized by continent. It's a difficult problem to show labels for so many countries along one axis so I tried to alleviate it by letting the user expand or hide countries by continent group. In this case 'North America' is expanded to show its' individual countries. Labels are only shown if they don't overlap with others. The largest countries by population are placed first.
Individual country lines can be clicked on to emphasize them with colour.
The third example shown below charts female values on the left against male values on the right in order to emphasize gender differences.
The interactive visualization includes a 'stepper' that takes the user through four different views. This helps introduce functionality gradually as well as serving to emphasize important patterns in the data.
In addition to the people and organizations mentioned above I would like to acknowledge the people behind Processing and Processing JS which was used to build the application. The code for the dashed lines comes from J David Eisenberg. Thanks!
In 2006, I started this blog as an outlet for my creative personal work as well as to gather in one place references to interesting work by other people. Since then, Neoformix has grown into a full-time business for me specializing in the development of custom data visualizations. I have just spent some time giving the website it's first facelift in 7 years. I hope you like it!
I've tried to simplify the design and emphasize that Neoformix is a business by designing a main page that highlights some projects and moving the blog to a secondary page. Thanks to Twitter Bootstrap for a powerful front-end framework which I made use of in the redesign.
About five years ago I posted a simple little application called Word Hearts which lets you fill a heart shape with words. Last year it was the most visited page on my site despite the fact that it was still a java applet based application which many modern browsers won't render. I have updated this tool to use ProcessingJS so it runs well in modern browsers. There is also enhanced functionality like:
Here are a couple of examples of what you can do:
Launch the interactive version of Word Hearts to try it out.
I have built another little digital humanities project based on the text of the 62 stories in Grimm's Fairy Tales. This one is called Grimm's Story Metrics and presents an interactive matrix of stories together with various metrics calculated from their text. You can click on a column to sort by that data, click again to reverse the direction, and click on a story name to open it in another window. The image below shows the stories sorted by the 'Royalty' metric which indicates, as you would expect, how many references there are to words related to the topic of royalty. Click on the image to go to the interactive tool.
Hovering over any of the bars shows details about that particular measurement. Most of the metrics, like 'Royalty', are based on topics and the details shown are the words characteristic of that topic used in the story. So, for example, the details for 'Royalty' in the 'Frog-Prince' are princess, prince, king, kingdom which are listed in frequency order. These topical metrics are normalized based on total words in the story so longer stories have no scoring advantage.
The 'Lexical Diversity' is a ratio of the number of unique words in the story to the total words. These stories are fairly short and you can observe a rough inverse relationship between 'Story Length' and 'Lexical Diversity'. 'Clever Hans' is an outlier in this relationship. If you examine the text for this story you'll see that there is a great deal of repitition.
Area of the words reflects frequency in the text. The top three most similar words are considered for connections with the word similarity metric defined by collocation within the text. The outer ring of words only have one weak connection to another word in the graph.
My previous post on the Grimm's Fairy Tale Network showed a graph illustrating the strongest connections between the various stories. I used a few techniques to try and prevent the usual mess of connections that often obscure the relationships of interest.
Another way of tackling graphs with lots of connections is to only show a small portion of the graph at a time and use interaction to provide navigation. This lets you browse around a complex network of nodes and relations and repeatedly get views centered on a node of interest. I've created an example of this for the Grimm's fairy tale data which I call the Grimm Fairy Tale Connection Browser.
The image below shows the connections to the story 'Little Red Riding Hood'. The larger circles are stories and the smaller ones represent key words in the collection. The inner ring shows the words and stories closely connected to the story of interest. The outer ring gives the related stories and words that are related but with less strength. You can click on any story or word to make it the new focus node. Click on the image below to launch the interactive version.
This second example shows the stories and other words highly related to the word 'wolf'. The interactive tool shows the Gutenberg version of the stories in a panel on the right. When a new story is made the central focus of the visualization the right panel shows the story text.
This was created with Processing JS.
I have had some fun playing around analyzing the text of the stories in Grimm's Fairy Tales. There are 62 stories in this set and they contain many popular tales such as Little Red Riding Hood, Snow White, and Rapunzel. The text analyzed is the English translation by Edgar Taylor and Marian Edwardes available at Project Gutenberg.
The graphic below is a simple network showing which stories are connected through the use of a common vocabulary. There are three different strengths of connection shown and I've tried to minimize the usual 'hairball' nature of these types of diagrams by only showing the top three connections for a story. Some stories will have more than three links because the link meets the top-three threshold for the story on the other end of the link. The shade of blue simply indicates the number of connections for that story - the darker the shade the more connections. Click on the image to see a larger version.
The diagram shows in the upper-right corner for example that 'Little Red Riding Hood' is strongly linked to 'The Wolf and the Seven Little Kids'. My analysis shows that the strength of this connection is due to them both using words like wolf, stones, door, belly, scissors, drowned, and devour.
The project 'Novel Views' consists of a series of visualizations of the novel Les Miserables by Victor Hugo. The text analyzed is the English translation by Isabel F. Hapgood available at Project Gutenberg.
This graphic shows where the names of the primary characters are mentioned within the text. Click on any of these images to see larger versions.
Characters are listed from top to bottom in their order of appearance. The horizontal space is segmented into the 5 volumes of the novel. Each volume is subdivided further with a faint line indicating the various books and, finally, small rectangles indicate the chapters within the books. In the 5 volumes there are a total of 48 books and 365 chapters. The height of the small rectangles indicate how frequently that character is mentioned in that particular chapter.
A word used in multiple places in a text can be interpreted as a connection between those locations. Depending on the word itself the connection could be in terms of character, setting, activity, mood, or other aspects of the text. This graphic shows a number of these word connections.
The 365 chapters of the text are shown with small segments on the inner ring of the circle with the first chapter appearing at the top and proceeding clockwise from there. The outer ring shows how the chapters are grouped into books of the novel and the book titles are shown as well. The words in the middle are connected using lines of the same color to the chapters where they are used. The edge bundling technique together with the Volume - Book - Chapter hierarchy of the text are used so the patterns of connections are more easily revealed.
I really like the effect and it's completely automatic which opens up some interesting possibilities. The original base image is by Steve McCurry and is of Sharbat Gula. A retrospective on her life done by National Geographic can be found here.
In my last post about visualizing Movement in Manhattan I mentioned that it would be interesting to explore a more direct view of the data by using an animation. I have created such a video based on a fresh collection of tweets from Monday, April 30th. I gathered new data because I realized that my previous data set was collected over the weekend and I suspected that a weekday might provide more obvious patterns. It compresses 24 hours of data into 1 minute of video. Here it is:
I was influenced by the 'Fireflies' video showing iPhone traces done by Michael Kreil. In particular, I like the idea of using larger but more transparent graphics to represent the increased uncertainty when drawing interpolated locations. Basically, if a person tweets at location A and then again at location B ten minutes later the model I used assumes they moved at a constant speed in a straight line between those two events. This is an obviously crude approximation and leads to unrealistic paths in many cases. By increasing the transparency in between the two measured events it shows this uncertainty in a visual manner.
Again, as I saw in the original version, the patterns of tweets, both moving and static are quite chaotic. You can easily see the rise and fall of tweets over the changing time of day and some local patterns that look interesting but the patterns are still a bit of a jumble.
The geolocated tweets were collected with the library Twitter4J which was used from code written in Processing. I used this tutorial created by Jer Thorp to get started with the library. Code from this flow field sample by Daniel Shiffman was used as a starting point to create my flow maps. The background map is from OpenStreetMap. Thanks everyone!
Inspired by the beautiful and elegant Interactive Wind Map created by Fernanda Viegas and Martin Wattenberg I have begun to explore the flow of people within a city. An ideal dataset to do this would include the GPS traces from thousands of people wearing trackers for weeks as they go about their daily lives. Organizations such as crowdflow.net and OpenPaths collect voluntarily donated data of this type and might be fruitful to explore. I decided, instead, to use geolocated tweets to try and see how the movement of people is affected by the urban landscape.
The image below shows an area of Manhattan roughly from Houston Street north to 72nd Street which corresponded to the region with the most geolocated tweets that I collected. It includes Times Square, Grand Central Station, the Empire State Building, Rockefeller Center, the southern portion of Central Park, and many other well known landmarks. The blue and red markings are an attempt to show the flow of people based on the data.
Basically, tweets sent by the same person within a 4 hour time-window were used as samples of speed and direction. These samples were used to construct a vector field representing the average flow of people within the area. The vector field and total tweet density over the space were then used to simulate the movement of people. Particles, representing people, were released at locations where actual tweets were recorded and their subsequent movement was determined by the flow field. The particles start out blue and gradually change through purple to red over time so each trace shows the direction of movement. Locations where there is little movement will have blue dots or very short blue traces. Longer traces with more red show a greater speed at that point.
The density and direction of the flow patterns seem reasonable but they do appear fairly chaotic - much more so than the patterns seen in wind flow for example. This makes sense for many reasons. One, people are much less deterministic than the molecules that make up the air. Secondly, the environment that they exist in is extremely complex. Also, statistically we are dealing with a much smaller sample size. In this case, roughly 34,000 geolocated tweets with only 9,600 path segments. If we had a million-times more data then the average patterns would be more clear. Another important factor is that this data was collected over a few days and so there may be clear patterns for specific times of day that are mixed together visually.
I have produced three more images that separate out the data by time of day. This first one only uses data from 6-11 am. It does appear to be a bit simpler and shows a few interesting patterns but it is still fairly chaotic. There is a strong flow east out from Central Park near 65th Street. There is also a more scattered flow from the east into New York University near the bottom left.
The afternoon flow map shows a greater overall density indicating a greater number of locations from which people are tweeting. There also appears to be a strong convergence on the area of 14th Street - 4th Avenue.
The evening map is also quite busy with lots of small local patterns. There is heavy action between 50th and 57th Streets. Comparing these three versions is easier with this Flickr lightbox version of the images.
Overall, there are lots of flows and some of them likely reflect real movement of people within Manhattan. Many others probably just reflect noisy data because the sample size is so small. It's difficult to distinguish between the two cases here. The technique itself might warrant further study with more data. Another interesting avenue to explore would be to more directly visualize the data with an animation like this 'Fireflies' video showing iPhone traces done by Michael Kreil.
The geolocated tweets were collected with the library Twitter4J which was used from code written in Processing. I used this tutorial created by Jer Thorp to get started with the library. Code from this flow field sample by Daniel Shiffman was used as a starting point to create my flow maps. The background map is from OpenStreetMap. Thanks everyone!
This is Part 4 of a set of posts related to the analysis of the Data Visualization Field on Twitter. For context or more information you may want to read those other posts first. They are:
In the previous posts we have seen that there are two fairly cohesive subgroups of twitter accounts that emerged from our analysis of the original 1000 accounts. I've been calling them the 'blue' and the 'red'. They were determined by looking exclusively at the references to twitter IDs within the tweets that were sent.
Presumably the fact that there are two fairly distinct groups would also be reflected in what they are discussing. I've done some analysis of the words used within the tweets for both groups. English stop words ('the' , 'and' , 'or', ... ) and other words commonly found in tweets ('new', 'via', 'like', 'day', ...) were excluded. Word clouds definitely have their limitations but I believe they can be an effective way to get a qualitative feel for a body of text. I have used Wordle to construct word clouds for the two groups.
It's clear that the blue group tweets a lot about 'art', 'code', 'design', 'processing', 'project', 'app' and 'workshop'. The red group tweets about 'data', 'visualization', 'design', 'infographic', and 'visual'. There is some overlap for sure but it's clear that they emphasize different things in what they are talking about.
Right from the very start I was calling the whole set of accounts the 'Data Visualization Field'. Of course, a more accurate description was that I was looking at the 'Set of Accounts on Twitter Connected Through Tweet Mentions from @moritz_stefaner, @datavis, @infosthetics, @wiederkehr, @FILWD, @janwillemtulp, @visualisingdata, @jcukier, @mccandelish, @flowingdata, @mslima, @blprnt, @pitchinteractiv, @bestiario140, @eagereyes, @feltron, @stamen, and @thewhyaxis'. It doesn't exactly roll off the tongue. From looking at these word clouds it appears that the red group could reasonably be named 'The Data Visualization Field' and the blue group something like 'Computational Artists and Designers'.
If we want to contrast these two groups more directly we can look for words that are used much more frequently in tweets of one group than the other. I've done this for words that met both an overall frequency threshold and an author support threshold - they were used by at least 10% of the group members. The bar charts show the frequency proportion. So, for example, in the large sample of tweets I looked at from both of the two groups if you count the number of times the word 'makerbot' was used then 99% of those instances were in tweets from people in the blue group.
This shows even more clearly the different things that these two groups emphasize.
The recent post on Data Visualization Field Subgroups had an interesting reaction on Twitter that I didn't expect. Many people that were placed in the 'red group' by the community detection algorithm in Gephi joked about being part of the 'team' and being happy to represent it and be grouped together with the others. Jen Lowe lightheartedly suggested a scrimmage at #eyeo between the red and blue. There was much less reaction from the 'blue group', likely because I'm embedded within the reds myself and so they likely paid more attention to my posts and the subsequent reaction on twitter.
There does, indeed, seem to be two fairly cohesive groups of people here but I suspect there are very many connections between the groups as well. We can use some simple network analysis to get a feel for this. Here are a few statistics calculated on the blue and red groups only:
|Number of Nodes||216||244|
|Total Intergroup links||665||1329|
|Total Intragroup links||5405||5047|
|Percent Intergroup links||10.96%||20.84%|
Both groups are pretty similar in most respects. The primary difference is that blue group members have on average more incoming links and that the percentage of intergroup links going from someone in one group to someone in the other is roughly double for reds. Remember that a link from A to B means that A referenced B in a tweet through a reply, a retweet, or just mentioning them in some context. When considering just the links between these two groups the people in red are referring to the people in blue at twice the rate of the reverse.
If you look at the graph showing both groups together (edges not drawn) it's clear that some nodes, for example blprnt and pitchinteraciv, are on the border between the groups which suggests they likely have a fair number of cross-group connections.
By looking at the details of the connections and their strengths we can quantify the 'blueness' or 'redness' of any particular node. This indicates how embedded they are within their own group. We can also do this separately for both incoming and outgoing links but I'll keep it simple for now and show one value that reflects both types of links together. This first table shows the top blue accounts (by degree) sorted by how 'blue' they really are.
|Blue Account||Degree||Blueness %|
You can see that feltron, blprnt, eyeofestival, and ben_fry are all tending towards the red which matches what we see in the network graphic where they are on the border. This table below shows how 'blue' the top twitter IDs are that were placed in the red group. Again we see that some accounts had significant linkages to the blue group.
There was some interesting discussion yesterday on Twitter about my post on the Data Visualization Field on Twitter. Moritz Stefaner pointed out that he didn't see a big improvement over his VIZoSPHERE and a quite similar topology. Furthermore, he noted that if you rotate my version 90 degrees counter-clockwise many of the primary nodes line up fairly closely with his. He's right, and it's something I missed noticing completely. It's not really surprising that an analysis of most of the same twitter accounts using a different connectedness metric would yield similar results. I do still feel the map based on tweet text account references is slightly better at the detailed local level but I have no objective evidence that this is the case.
Another interesting thing I learned yesterday was that Lynn Cherny did an excellent analysis of Moritz's data back in September which is reported in Combing Through the Infovis Twitter Network Hairball. She focused on the detection of sub-communities within the network using both Gephi and NetworkX and has some nice results.
Following Lynn's lead I have spent some time looking at the communities within my data. Doing this analysis with Gephi yields subgroups that look like this:
The modularity score was .356 which is slightly under the .4 boundary for significance. By visual inspection of the image above it seems clear that there are two coherent groups to the left and four other groups that are intermixed and less clearly defined. These two coherent groups correspond pretty well to what I saw by eye yesterday. The top-left blue group has people who focus on computational design, generative art, or design in general. The bottom-left red group, as I noted yesterday, seem focused more on the practical aspects of data visualization.
Below is a map showing only the blue group. I've also shown the top 3% of edges as well. I wasn't able to emphasize the flows as much as I would have liked but you can see some of the stronger edges and their direction. One of the strongest relationships visible in this map goes from @eyeofestival to @blprnt which indicates that a relatively high fraction of the tweets sent by @eyeofestival mention @blprnt.
Here is the map for the red group below. Note that you can click on any of these images to get PDF versions where you can zoom in or search for a particular account.
I consider myself one small part of a community on Twitter that focuses on information visualization, computational design, and interaction design. Collectively we tweet about our personal work, highlight other work of quality or that has interesting characteristics, critique approaches or individual designs, discuss tools and techniques, and suggest interesting datasets or projects. I'm grateful to be connected with such an interesting group of people and I've learned a great deal from them.
Moritz Stefaner is an important part of this group and in July 2011 he created an interesting map of this community he calls The VIZoSPHERE. Basically, he started from a set of 18 selected twitter accounts, found their friends and followers and included any twitter account that met a minimum criterion of connectedness. A small version of part of this map is below. Node sizes reflect the number of followers within this community.
It's a fairly standard graph view of the network data and the sheer number of connections makes them extremely difficult to traverse. Like many such large network graphs the primary utility seems to come from seeing which nodes are largest and seeing which ones seem to be grouped together, presumably reflecting nodes that have a similar set of connections to the rest of the network or strong connections between them. This can sometimes visually suggest sub-groups within the overall community.
After stumbling across this work recently I decided to explore the same problem myself. Rather than rely on follower information for connectedness I decided to analyze the actual tweets sent and look for mentions of twitter IDs. These could be retweets, replies, or just references to someone in a tweet. For a given twitter account we are essentially looking at who they talk to or talk about. Unlike the binary nature of the follower connections we can also measure the strength of this connection by looking at how often one person mentions another.
I started with the same set of accounts that Moritz used: @moritz_stefaner, @datavis, @infosthetics, @wiederkehr, @FILWD, @janwillemtulp, @visualisingdata, @jcukier, @mccandelish, @flowingdata, @mslima, @blprnt, @pitchinteractiv, @bestiario140, @eagereyes, @feltron, @stamen, and @thewhyaxis. I looked at the 1000 latest tweets (or as many as they had if they hadn't sent 1000) and found all the twitter accounts they mention. For each mentioned account I calculated its' support - the number of accounts in the original 18 that mentioned it and used that ranked list to enlarge my set to 50. The latest 1000 tweets for this larger set were retrieved and analyzed in the same way to enlarge the community to 100. I repeated this once more and used tweets from these 100 accounts to finally get a list of the top 1000.
The total number of tweets analyzed for these 1000 accounts was 821,407 and I used them to determine a directed connection strength between each pair of accounts. This connection data was loaded into Gephi which I used to produce the graph below.
For a searchable and zoomable version use the PDF.
As in Moritz's VIZoSPHERE there were so many connections that I didn't think they provided any useful information that could be seen with the eye so I left them out. They are used to layout the nodes for each account and also the node sizes are determined by the degree - the number of edges coming into or out of the node. The bigger nodes can be read off from this graph - @blprnt, @moritz_stefaner, @flowingdata, @visualizingdata, @janwillemtulp, @infosthetics, @golan, @mariuswatz, @reas, @ben_fry, @brainpicker, @nytimes, @timoreilly. Many of these larger nodes are, unsurprisingly, the original seed accounts we started with.
Looking at the details of which accounts are placed near each other seems to give reasonable results. @Eyeofestival is near @blprnt, @krees near @periscopic, and @mccandelish near @infobeautiful. It's very likely that many nodes are placed near each other based on more global or indirect factors so there are still likely some surprising juxtapositions.
Many of the initial seed accounts are in the lower left part of the diagram and seem to reflect a subgroup focused more on the practical aspects of data visualization. The top left accounts seem more to be in the area of computational design, generative art, or design in general. @Blprnt seems to lie between these 2 subgroups. The right side of the diagram seems to be more general media and data sources. I suspect that many of the accounts on the left side mention those on the right but the reverse is not true. In fact, I suspect that many of the accounts on the right side aren't really part of the community in that they don't strongly interact with it. They are sources but not contributors. It would be interesting to repeat my enlargement process from the original seed accounts with some minimum criterion for two-way interaction.
The nodes are colored based on the total number of incoming links which represent people in this community mentioning that account. The darker the color the more incoming links there are. So there are a lot of different people within this community referring to @blprnt, @flowingdata, @brainpicker and @nytimes for example. You can't extract much quantitative detail from a color range but it does give you a feel for which accounts are highly referenced. Note that the color is based on the absolute number of incoming links - not the proportion of incoming to total links. That would be a more interesting measure but I couldn't easily map it to color with Gephi.
This looks like an interesting view of the data and I'm curious to explore a few related variations. Note that prominence within this graphic is a fairly crude measure of overall contribution to the field of data visualization. Many key figures in the field, Stephen Few for example, don't use twitter and so aren't represented here even though his critiques have a huge impact and are discussed within the twittersphere. Many others, such as Ben Shneiderman (@benbendc) and Edward Tufte (@edwardtufte), do use twitter but not extensively and not to a level that reflects their value to the field. They do appear in this map but have very small bubbles.
I have created many word portraits in the past and have always limited myself for the sake of simplicity to completely horizontal or vertical words. My interest in word portraits has been re-ignited by a recent client project and I've started to play with allowing angled text.
In this first example below the words are flat when near the horizontal middle and gradually turn to vertical at the edges. I also swap the orientation below the vertical middle.
In the next example the angle of the word is determined by the brightness level at that point in the image. White regions are flat and dark are vertical. This gives a reasonable contoured effect because the brightness levels in the image vary in a natural fashion.
For this last one the words are all angled towards a point on one of Einstein's eyes.
This post was modified on February 15th, 2012 to reflect changes in the software.
Spot is an interactive real-time Twitter visualization that uses a particle metaphor to represent tweets. The tweet particles are called spots and get organized in various configurations to illustrate information about the topic of interest.
Spot has an entry field at the lower-left corner where you can type any valid Twitter search query. The latest 200 tweets will be gathered and used for the visualization. Note that Twitter search results only go back about a week so a search for a rare topic may only return a few. When you enter a query the URL is changed so you can easily bookmark it or send it to someone. The query brainpicker gives you a display something like this:
At the top left, next to the logo, are six icons to access the different views. The first is called Banner mode and is shown above. Basically, tweets that share a lot of the same words are grouped together and the top five groups are shown. Tweets are often grouped because they are retweets of the same original content but this doesn't have to be the case. They may be tweets from different people that don't even know each other but happen to be discussing the same thing. The intent is to show quickly the most popular things people are saying about a particular topic. Tweets that are more unique are placed in the phyllotaxy spiral to the right.
All the tweet spots show an image of the sender and at any time can be clicked on to see the tweet details. Clicking on the text of an open tweet will show the original in another browser window. Click on the background or an open tweet spot to close it or you can directly click on another spot.
Here is a complete list of the views and what they show:
The Word View, again for the query brainpicker:
The string 'brainpicker' matches the wonderful twitter account by Maria Popova and the results shown above are mainly retweets of or discussions about the tweets she has sent. You can also do a search for @brainpicker including the @ sign to see the latest tweets sent from that account. This uses the standard Twitter API to get the data and so can go back farther in time. The Word View for this query clearly shows the Brainpicker focus on books, reading, writing, art, and maps.
You can also retrieve the latest tweets from a twitter list. Here is an example for a list I created by analyzing who was on various lists created about data visualization. In the search field enter @Top100in/datavis and you should get something like this for the User View:
I was inspired to create this when playing with the wonderful Twitter visualization called Revisit by Moritz Stefaner. Another influence was the Stamen work on Digg swarm which is no longer active but there is a video. My academic background in physics makes it natural for me to think in terms of interacting particles.
Performance is pretty good with the Chrome browser, and decent in Firefox and Safari. It will not work in Internet Explorer (except perhaps the new IE 9). It seems to work reasonably well on the newer iPads although the search field is broken currently in that environment. The application will go out and get new tweets periodically. For popular queries the analysis and display of those tweets will often cause lagging to occur.
Here is a Multiscale Mosaic of Obama created from hundreds of pictures taken during his time in office.
The Van Gogh Portrait Mosaics were fun but I wanted to try an example that uses photographs as opposed to paintings. I settled on a portrait of Obama because of the widespread availability of photographs of him that are free of copyright restrictions. The subimages for this design are taken from the White House's Flickr photostream and seem to have been primarily taken by Pete Souza. I downloaded the 1000 most 'interesting' photos from the stream and used those as input to my process. I also manually selected and hand-centered about 10 interesting regions from these images to augment the set.
Here is a close-up showing the detail near the eye and nose.
Here are four mosaic portraits of Vincent Van Gogh. The primary images and all the various component tiles are regions of paintings by Van Gogh.
A few more details on the multiscale mosaic process can be found in the post Multiscale Mosaics. The portrait images are all from WikiMedia Commons. The other Van Gogh paintings came from here. I created these by writing custom code in Processing.
I have been further refining my multiscale mosaic technique in search of the overriding goal of reconstructing an image from sub-images in such a way that balances the clarity of the large target image and the sub-images. I have tried out lots of ideas and the ones that seem to have the most potential for creating interesting multiscale mosaics are:
I have used a cropped region of Vincent Van Gogh's painting Self-Portrait With Grey Felt Hat as my target image while developing these ideas. The sub-images are sections of Van Gogh paintings. They are either the central squares or a few are manually selected square regions that focus on some interesting detail.
These techniques do seem capable of producing interesting mosaic images that can carry meaning at multiple visual scales.
The post Mona Mosaics showed a number of ways to segment a flat surface and build mosaics by filling regions with the average colour for that region in some underlying image. Here is another example of the same technique but this time using a Phyllotaxy spiral, sometimes called a Fibonacci spiral. It's an arrangement commonly found in plant growth - for example in the Sunflower.
Jim Bumgardner has an excellent tutorial where he develops the idea and gives code for producing the pattern and several variations. I'm using something based on his Example 10 code to produce the mosaic below from a simple radial gradient. I love the swirling spirals in opposite directions found in the pattern.
And of course we must apply it to the Mona Lisa image as well.
In the previous posts Mona Mosaics, Recursive Mona, and Blended Mona I played around with some ideas for reconstructing the famous Mona Lisa image in different ways. One of the things I did was to build up the image from smaller versions of itself. I was using simple image tinting and blending to get reasonable results.
This time I'm going to select sub-images from a set of pictures and use those to build the large image. This has been done for many years now and there are various tools to support it but I thought it would be interesting to try it myself. For this test rendering I'm using a small set of 23 images related to pizza. For simplicity they are all square images so they map well to the square regions determined by my algorithm. The algorithm selects the best-matching sub-image for each region and if the match isn't very good then it sub-divides the region and tries again at a smaller scale. This version uses blending to try and balance clarity of both the sub-images and the global picture.
For purposes of comparison here is the same image with no blending applied. You can see the sub-images more clearly but the overall image is only vaguely defined. This could be improved by using smaller sub-image pixels or a larger collection of sub-images to choose from.
The previous post, Recursive Mona, showed an image of the Mona Lisa constructed from smaller versions of itself. One of the things I don't like about that image, and most other 'photographic mosaic' type images, is that the grid structure controlling the sub-images is so visually prominent. Using multiple scales as I did helps to some degree but the regularity detracts from the overall image.
I've tried to improve this by breaking down the squares that require a more detailed rendering into subsquares in a more varied fashion. There are now 5 or 6 different splitting algorithms used to get the sub-components. This reduces the number of places where you see large numbers of consecutive tiles with the same geometry.
Another technique I've tried out is to blend the sub-images into the overall image at their edges. This tends to smooth out the edges between adjacent sub-images so it looks more natural and also has the impact of strengthening the overal global image. Here is Mona again with both of these techniques applied.
One of the ideas presented in Mona Mosaics was to break down an image into square areas at different scales where the colour doesn't vary much. A natural extension of this is to redraw a tinted version of the original image inside each square. Repeat a few times and you get a version of the starting image built recursively from smaller and smaller versions of itself. Here is an example of the concept applied again to the Mona Lisa.
A couple of years ago I explored reconstructing images based on Delaunay triangulization and Voronoi decomposition. Inspired by the work of Jonathan Puckey and Andy Gilmore I've revisited the idea of rebuilding images using some geometric-based simplification.
The source image for all these example is the Mona Lisa. The first rendering is a simple square grid where the colour of each square is the average colour in that region of the underlying image. By using a smaller grid size one can obviously get more detail than is shown here.
The image beside it is much more interesting. I start by looking at large square regions to see how much the colour varies. If it is fairly consistent then that implies there is less detail in that region and I can draw it as a simple large square. If the colour variation is higher than some threshold I look at the smaller subsquares and repeat the process recursively until some lower size is reached. This gives us a version of the image that has smaller more detailed squares where the image varies a lot and larger blocks of colour elsewhere.
Images 3 and 4 are similar but use triangular regions rather than squares. Another wrinkle which I added to the recursive process is to define a location on the base image that shows the 'center of attention'. I then vary the colour consistency threshold based on distance from that point. This allows for manually defining, to a limited degree, where the regenerated image will be more detailed. For these examples I used a point in the middle of the Mona Lisa's face.
The next 2 versions use circular regions which don't filll all the space so a background colour shows through.
These 2 fill the background of each circle with the average colour of that region and this gives a much more pleasing result.
This last image uses a recursive triangle decomposition as well but the sub-triangles are defined in a more varied fashion.
Edward Tufte defines Sparklines as intense, simple, word-sized graphics, that should also be high-resolution. They are a very useful technique, especially when combined with the idea of small multiples.
I generated the example below based on the results of the 2011 Major League Soccer regular season. In this case, a whisker-style sparkline was generated for each team to show the complete Win-Loss-Tie sequence for the season. A small upward blue bar shows a win, a grey bar in the middle a tie, and a downward red bar is, of course, a loss.
The graphic succinctly illustrates how each team did over the season. A few interesting tidbits:
Here are a couple of portraits done with a simple radial scan technique. Arc segments are drawn that are coloured by sampling an image source.
I created some print graphics for Live Magazine back in February. I enjoyed the project a great deal and would be very happy to tackle more print projects. Send me an email at email@example.com if you are interested.
The graphic shows a streamgraph illustrating the top selling automobiles in the UK from 1973 until 2010. The various series were sorted to group the same brands together as much as possible and to add the newer brands to the outside of the graph.
I used custom code created with Processing to create vector output in PDF format and then fine-tuned the graphics with Adobe Illustrator.
I made a couple of minor changes to the Neoformix.com website. The first was that I removed the google Ads. They made virtually no money and cluttered the display up unnecessarily. The second change was that I added a 'Tweet' button at the bottom of every article page to make it easier to share my content on Twitter.
I experimented a bit with adding a more 3D impression to the image by using a tool to bring forward the brighter parts of the image. This was done more for the Gandhi image since the highlighted parts of Hitler didn't correspond very well to depth. The tool I used was DeepImage by Daniel Hawkes.
It has been very gratifying to see the interest in my recently launched Tweet Topic Explorer. In the week since it was made available there have been posts about it on Infosthetics, FlowingData, Cool Infographics, and many other places. It has also had over 1,200 tweets sent about it. Thank you everyone for trying it out and telling your friends!
Much of the initial attention came from people in Europe looking at non-English accounts. The tool was enhanced a few days after launch to ignore stop words in German, Italian, Spanish, French, and Dutch. It's not a perfect implementation and of course misses many common languages but it does make the tool more useful for many more people.
Another request for improvement that I was able to deliver was the capacity to analyze the tweets from Twitter Lists. You can now enter a list name in the field to see a Word Cluster Diagram for the latest tweets from the people on the list. The volume of tweets on a list is usually pretty high so the last 800 tweets (which is how many are used by the tool) will not go very far back in time. When using the Tweet Topic Explorer with a list the tweets on the right are enhanced to include the account and icon for the author of each tweet.
Here is the result for the Twitter List @Top100In/DataVis:
And here are a few others without the tweet list shown. @mashable/marketing:
One problem I face on a daily basis is to decide for a given Twitter account whether I want to follow it or not. I consider many factors when making the decision such as language of their tweets, frequency, whether they interact on twitter with other people I admire, or if I have some personal or geographic connection with them. But the most critical factor for me is whether they tweet about things that match my interests. Sometimes you can get a hint about this by looking at their short one line twitter bio but the best way is usually to scan their latest tweets.
I have created a new tool to help see which topics a person tweets about most often. It also shows the other twitter users that are mentioned most frequently in their tweets. I call it the Tweet Topic Explorer. I'm using the recently described Word Cluster Diagrams to show the most frequently used words in their tweets and how they are grouped together. This example below is for my own account, @JeffClark, and shows one word cluster containing twitter,data,visualization,list,venn, and streamgraph. Another group has word,cloud,shaped,post etc. It's a bit hard to see in this small image but there is a cluster about Toronto where I live and mentions of run, marathon, soccer. Also, there are bubbles for some of the people on Twitter I mention the most often: @flowingdata, @eagereyes, @blprnt, @moritz_stefaner, @dougpete.
For all these images below you can click on them to go to a live version of the tool.
Here is another example showing the full tool. This one is for one of my favourite accounts to follow, @brainpicker, by Maria Popova. In this case the word 'book' has been highlighted with a click and the list to the right shows the tweets that contain the word. The words in the tweet list are coloured if they appear in the word cluster diagram. Clicking a different word bubble will select that word instead. You can click on any twitter @ID in the tweet list to load the data for that account. The tool is currently configured to load the last 800 tweets. For my account this goes back a couple of years in time but for more prolific tweeters it may only span a few weeks. The entry field at the lower left lets you explore the tweets for any twitter user.
Here are a few more examples of the word cluster diagrams generated from some twitter accounts. @acarvin is doing an extraordinary job of covering the events in the Middle East.
A few years back I introduced the idea of Clustered Word Clouds which use word size to indicate frequency but also use positioning and word colour to group words together that were highly correlated in the text. It works reasonably well I think. See the example below:
I've come up with a new variation on this idea that tries to improve a couple of things. In many word clouds, including those generated by Wordle and my clustered clouds, the font size of the words are proportional to the word frequency. This has the effect that words with many letters (for example 'indisposed') cover a much greater area than a word with fewer letters (say 'ill') if they have the same word count. Some word clouds are constructed so that the area of the word is proportional to the word count rather than font height. This often has the opposite effect of unnaturally emphasizing words with fewer letters. My new design uses solid circles of colour whose area is proportional to the count. I think they may do a slightly better job of giving the proper visual emphasis to the words.
By using larger blocks of colour I think it's also easier to visually distinguish the groups in a clustered cloud. I'm calling this new variation a 'Word Cluster Diagram'. The one below is for the same text as the older style above but the clustering algorithm, and stop word list are a bit different so they aren't directly comparable. I think it has some promise although it's not as space efficient as using the words on their own.
Five years ago today, I published my first entry on Neoformix.com. I wasn't really sure if anyone would pay attention. You have, and for that I thank you all. Thanks especially to everyone who has written about my work or passed it along to your friends.
Except for the first few months, virtually all the images, interactive applications, and analysis presented on this blog were created using code I wrote with Processing. Thanks very much to Casey Reas, Ben Fry, and the community around that wonderful tool. Thanks to all the amazing researchers, coders, artists, and designers that have most directly influenced my work, especially: Ben Shneiderman, Martin Wattenberg, Fernanda Viégas, Ben Fry, Casey Reas, Chris Harrison, Nathan Yau, Lee Byron, Moritz Stefaner, Jonathan Feinberg, Gui Borchet, Jer Thorp, Robert Kosara, Andrew Vande Moere, Manuel Lima, Frederik Vanhoutte, Mario Klingemann, Robert Hodgin, and Tom Carden.
I've selected images from a few representative posts from the past five years. Click on the image to visit the respective post. Thanks again everyone and I'm looking forward to what the next five years will bring!
I have been collecting tweets containing the words 'love' and 'hate' for a couple of years now and decided to analyze them to see what could be discovered. It was a fun project that I finished just in time for Valentine's Day. I hope you love it!
For the data I chose to use every tenth tweet containing the word 'love' and every tenth tweet containing the word 'hate' from all of 2010. This yielded 658,391 love tweets and 503,489 hate tweets. Incidentally, this means there were roughly 6.5 million tweets last year containing 'love' and about 5 million containing 'hate'.
The first set of diagrams in the graphic show the love/hate ratio for various sets of related words. Basically, I counted the number of times a word appeared together with 'love' and together with 'hate'. A simple percentage of 'love' associations out of the total gives a basic measure of sentiment - let's call it the Love Quotient ;) A value near 100% means the word is used almost exclusively with 'love' and never with 'hate' and the graph will show hearts all the way to the right side. Each full heart represents 5% over the 50% neutral point so, for example, 'amazon' has six and a bit hearts showing so its' Love Quotient is about 82%.
Using simple word association is a pretty crude measure of sentiment. It obviously would be fooled by a sarcastic tweet like: Ugg - liver and onions again. Don't you just love the food in the cafeteria? Even so, by looking at large quantities of data it seems to give reasonable results in many cases. The data definitely settles the age-old question: pie > cake!
The diagram with all the photos is actually a Treemap. Surprisingly, this is the first treemap to appear on Neoformix since my second post back in April of 2006 about The Map of the Market. This one shows the people who were mentioned most frequently with the word 'love'. It's dominated by celebrities, mostly singers who appeal to young teenagers.
The StreamGraph shows how the word 'love' was used together with various sports over the course of 2010. The term 'football' combines references to both american football and international football (soccer). You can see the peak in June for the World Cup and peaks for both hockey and skating during the winter olympics in February.
Text analysis and creation of the various graphics was done with custom code created in Processing. The Treemap diagram used the Treemap library created by Benjamin B. Bederson and Martin Wattenberg. Thanks!
President Obama delivered the State of the Union speech last night for 2011. I've created a few diagrams that compare it with the speech from last year to try and understand how it differs.
First we have two Sentence Bar Diagrams for the speeches from 2010 and 2011. Sentence Bar diagrams use color coding to show the topic of the various sentences in the text and bar length to show how long the sentences are. In these diagrams I did combine adjacent pairs of sentences so it wouldn't be too long. These two texts are almost the same length, have a very similar breakdown over the four topics, and both have a segment towards the end about security issues. The 2011 speech has slightly more emphasis on the domestic issues of education and less on economic matters.
This next diagram shows the words that were used much more frequently in 2010 vs 2011. For example, the word 'families' - the third down the list, was used 17 times in 2010 but only 2 times in 2011. Other prominent words from last year compared to this year: bill, businesses, security, national, recovery, act, banks, energy, and insurance.
This one below shows the words used much more often this year than last year: new, world, race, future, high, technology, research, education, progress, and innovation.
Finally, we have a Document Contrast Diagram comparing the two speeches.
I've been exploring algorithmic generation of images from combinations of simple shapes. I'm using alpha-blending with grayscale sub-components and then taking the various shades of gray created through overlap and recoloring based on a palette. Here are a couple that I think turned out pretty well.
The June 2010 Issue of the Harvard Business Review contains a small data visualization piece by myself and Scott Berinato. It's called Six Ways to Find Value in Twitter's Noise and has a StreamGraph showing tweets about the iPad during the launch weekend. I collected and analyzed the data and created the StreamGraph. Scott did a great job picking out some interesting features and explaining what it all means. It was a fun project and it's great to see my work in such a prestigious print magazine. Thanks for the opportunity Scott!
A few weeks ago I had the pleasure of reading Makers, a novel by Cory Doctorow. It's an interesting story, well told, and filled with stimulating ideas related to technology, creative culture, and intellectual property.
Cory makes his work available for free download so I was able to create a Document StreamGraph based on the text of the book. The document is split up into 24 equal sized segments and the word counts are done within each segment. These segments are used in place of time along the horizontal axis of the StreamGraph. I chose to show capitolized words and the resulting image does a reasonable job of illustrating the ebb and flow of the various characters within the narrative.
I really like the work of Tatiana Plakhova and have been following her Flickr stream since last year. Some of her images make me think of alien life forms or cities of the distant future. The one on the top right here below reminds me of Cerenkov Radiation.
Using her image from the top left above as inspiration I created a simple animation that tries to recreate her style. This video isn't great quality but seems to get the idea across.
One charting technique that I really like is to take time series for related data that occured over different time periods and align them to a common starting point so they can more easily be compared. One good example is this graph comparing this recession to the last five in terms of employment decline. Another one, this time interactive, is from the NYT and depicts Paths to the Top of the Home Run Charts.
I have created a couple of simple line charts showing cumulative point production (goals + assists) for selected NHL players over their careers. I'm actually using Adjusted Points which try to control for the fact that teams played fewer games in the past and rule changes and other factors impact the ease of scoring goals over time. Data is from Hockey-Reference.com.
This first chart shows many of the top players from the past. I only showed data up until age 43. Gordie Howe did get points in the NHL at age 51 but they were negligible in the overall results other than to illustrate his amazing longevity as a player. The graph clearly shows why Wayne Gretzky is called the 'Great One'. You can also see the various plateaus due to injury for Lemieux, and early career end for Bobby Orr (who is also the only defenceman shown here).
The second graph keeps Gretzky and Richard for comparison but highlights many of today's top stars. Crosby appears to have a legitimate chance to match Gretzky but has a long way to go...
I have been collecting tweets containing the word 'love' for more than a year now and just analyzed a sample to see what other words are being used in conjunction with 'love'. I naively assumed I'd see lots of company or product names as the top non-generic terms. There were a few near the top - iphone, ipod, and starbucks for example. The most commonly used non-generic terms were actually almost all Twitter accounts for singers. The person with the most references was @justinbieber. Note that I analyzed 1 out of every 50 tweets so the counts shown here are ~50 times smaller than the real totals for the year.
During the last few months the total for @justinbieber exceeded the next top 14 combined. The streamgraph also shows a strong decrease for @mileycyrus and @ddlovato. References to @jonasbrothers seem to have split into separate streams for both @nickjonas and @joejonas.
Here is a PDF version of the streamgraph.
Wouldn't it be cool if your twitter client could directly show tweets with small embedded images? Things like stock charts, graphical weather reports, server status, traffic reports, graphical emoticons expressing emotional state of your friends, mini-graphical movie ratings with thumbs up/down or stars, sports record summaries, or a million others that I haven't though of? Perhaps something like this?
This shouldn't be very hard. In fact, I think all that's required is the following:
Step 1 is easy. There are hundreds of URL shorteners already in existence. We just need to adopt one that indicates by its' name that it points to a small embeddable image. An alternative that would avoid having to get different companies to adopt the same convention would be to use a special hashcode to indicate the same thing. Have all tweets with any link and the tag #inlinedimage handled by showing the image inline. If the link is invalid or doesn't point to a small image then the twitter client should revert to showing the text form.
Step 2 is easy as well since Twitter clients already show images in tweets - the user avatar images. I chose the size constraint by measuring the space used by TweetDeck to show the text of a tweet - I got about 237x62 pixels. This is just slightly bigger than the standard half banner size of 234x60 used for online advertising so I chose that instead.
Here are a few more things that could be added to make this even more useful:
I think many people would find this valuable and it seems quite simple to accomplish. Come on TweetDeck, Twhirl, and other Twitter Client companies - get to work!
This morning I came across the interesting post Visualizing time series data embedded in tweets by Chris McDowall. The basic idea he discusses is to send time series data in tweets and have twitter clients recognize the format and present it as a small graph ( or Sparkline ) embedded in the tweet stream rather than just text. Chris seemes to have been inspired by the Twitter Data proposal.
It's an intriguing idea and Chris created a proof of concept twitter client called the Twitter Sparkline Visualizer.
One problem I see is that a twitter client that doesn't recognize the special data format would show the cryptic form which would probably be undesirable in most cases. Also, the 140 character limit of a tweet would put a fairly tight boundary on how much could be encoded. In a comment on the post, Tom Carden suggested looking at the Google Charts API as a "good example of a concise vocabulary for passing chart data around using URLs".
Tom's suggestion triggered an idea for me: Use any RESTful api like Google Charts to encode small charts in a URL, then use a URL shortener to construct a tweetable link representing the chart. Furthermore, we can use a specially named URL shortener that indicates to a twitter client that all of its' links point to small inline charts. This lets a twitter client determine efficiently that a given link can be rendered inline.
It makes sense to generalize the idea further to support use of any small image rather than charts in particular.
About ten days ago I was contacted by Scott Berinato, an editor at the Harvard Business Review, who was interested in writing up some of my visualization work for the HBR Research blog. We had a nice chat and he subsequently published Four Ways of Looking at Twitter which profiled my four twitter visualization tools.
He did a wonderful job and the article got lots of attention on Twitter. I've been tracking many of the tweets about the article and there have been at least 1500 tweets sent by various people telling their friends to read it. All the extra attention has made this the busiest week on Neoformix over the past year. Thank you to Scott for creating the article and thanks also to everybody who passed it along to all their friends!
I was looking for pictures of the new Apple iPad and stumbled across this image of Apple Form Factor Evolution. It's got lots of images of Apple products on a nice simple white background and was perfect fodder to use with the Image Foam Technique so I made this version of the Apple logo from the product sub-images.
In a recent post I showed the Top 20 Individual Data Visualizations Mentioned on Twitter and remarked that many of the most frequently mentioned twitter links were to collections of visualizations. Shown below is a meta list of the top collection-type data visualization or infographic links.
Here are the top product type links in the field according to Twitter data between March 24 and Dec 31, 2009.
Michael Deal has published an interesting collection of graphics in his Charting the Beatles project. This first snippet below shows the beginnings of a graph illustrating authorship and collaboration in songwriting throughout their song collection. The full graphic clearly shows the trend towards less collaboration over time in songwriting, the increasing contribution from George, and increasing contribution by outside contributors.
This second image is from a chart showing references in Beatles songs to earlier songs. There are full images and several other interesting graphics on his site.
For many people Twitter has become the best place for discovering the latest and most interesting work in a variety of fields. In my twitter client I keep a search column open that gets constantly updated with the latest tweets pertaining to data visualization or infographics and I see lots of beautiful content flow by. I've been collecting these tweets for quite a while and thought it would be interesting to analyze them and see which visualizations were shared through twitter the most often.
Many of the top links in the domain were articles containing collections of visualizations chosen to be the 'Top NNN' by some panel of experts. For example, the top most shared link was 50 Great Examples of Data Visualization by Web Designer Depot. I will have another post in the near future that lists the most popular of these types of links as well as separate lists for products/frameworks and news/analysis. For this list I chose to focus instead on references to individual data visualizations or infographics.
Here are the top 20 ordered by popularity. Click on either the link or image to go to the original article.
Note that the link made popular on Twitter for #9 Death and Taxes was actually a link to an image on imageshack and I have used instead a link to the original source of the material.
The tweets for this entire analysis were collected from March 24, 2009 until December 31, 2009. Only the first link to a specific item from each Twitter ID was counted so that one person did not unfairly impact the results by tweeting frequently about the same thing.
Items 11-20 are listed below.
Here is a Shaped Word Cloud for tweets containing 'android' from 2009. I removed the tokens 'android' and '#android' from the analysis. You can click on the words to jump to Twitter Search and see the matching tweets. It's pretty clear that android is a 'google' 'phone' and is related to 'iphone' and 'htc'.
I've taken another look at the set of tweets from 2009 that contain 'Obama'. This time I started by focusing on the most popular hashtags that were used. This graph shows the top 10 hashtags, their distribution over the course of 2009, and the total references to them. The top hashtag by far was #tcot which stands for 'Top Conservatives on Twitter'.
How do tweets that contain #tcot differ from those that don't have it? What words seem especially associated with the tag? What topics do people using the tag seem to be focusing on?
I've done an analysis on the word frequency inside tweets containing the tag versus tweets without it. This chart below shows the words that are used much more frequently in the #tcot tweets compared to the baseline. Words on the left like 'CARE' and 'BUSH' are used at a rate of around 100-120% of the baseline rate. Words on the right like 'BHO' (shorthand for Barack Hussein Obama) and 'RASMUSSEN' are used around 500% of the baseline rate - or, in other words, they occur around five times as often in #tcot tweets as they do in non-#tcot tweets.
The chart is an interesting collection of terms and is an attempt at distilling what the people who use the tag #tcot are saying in relation to Obama. Some notable words in the set are 'DANGEROUS', 'SOCIALIZED', 'EXPOSE', 'RADICALS', 'ARROGANT', 'MARXIST', 'COMMUNIST', 'CLIMATEGATE'.
I collected all the public tweets containing 'Obama' during 2009. There were over 5 million recorded during the course of the year. I've done some analysis on a sample containing every 20th tweet. This first graph simply shows the distribution over the course of the year of the number of times the name 'Obama' was used. The curve has a big peak during the inauguration, a few smaller ones in February and March and is then remarkably level for the rest of the year.
This set of graphs shows other words that were used frequently in the tweets about Obama and that had distributions with a high concentration near specific dates during the year. When ordered by the peak date for each graph they give an interesting graphical narrative of Obama-related events during 2009.
It's been snowing where I live for the last month or so and I've been playing around with generating a dove image from snowflake constituents. This first image is constructed from smaller snowflakes built using the Text Snowflake Creator based on the words PEACE, LOVE, and TRUTH. The dove image is from Wikimedia Commons.
This second version uses the three unicode snowflake characters in the font Arial Unicode MS. I've also applied a small variation in color.
Thank you everybody for your interest in Neoformix over the past year. I wish you all a Wonderful and Happy 2010!
These are the 20 most popular posts published on Neoformix during 2009 ordered by their popularity. There are a large number of popular posts based on the Shaped Word Cloud concept and a few more on the related Image Foam Technique.
Note that many of the most popular parts of Neoformix visited during the past year were for projects published prior to 2009 and include Twitter StreamGraphs, Twitter Venn, Big Small, and Word Hearts.
I'm very pleased to announce that an image from my Twitter StreamGraphs tool was chosen as the cover for the current issue of ACM Crossroads - the Student Journal of the Association for Computing Machinery. There is also a small writeup inside about the image. It depicts the streamgraph for the phrase 'data visualization' and suits the issue well since it is dedicated to the Social Web. The entire issue is available online.
Thanks to Chris Harrison, the editor-in-chief, for inviting me to contribute the image and to Senior Editor Jill Duffy for sending me some copies of the issue.
Fifty-six papers in forty-five countries published a front page article today calling for action at the climate summit in Copenhagen. I've taken the text of the article and created a couple of images. The first is a Clustered Word Cloud which shows the more prominent words from the article grouped into clusters based on whether they were used together.
This second image takes the word clusters and arranges them in a starburst type pattern. The visual form was influenced by the Word Associations work by Chris Harrison. It's a little more interesting to look at and makes the groupings more obvious but has the drawback that the words are smaller than in the first format.
Last night Obama outlined the new policy in Afghanistan in a speech at West Point entitled The Way Forward in Afghanistan and Pakistan. Like many people, I have mixed feelings towards a larger military effort in the region. I have tried to represent that ambivalence with an animated word cloud based on the speech that transitions from one symbol to another.
Text pagers are usually carried by persons operating in an official capacity. Messages in the archive range from Pentagon, FBI, FEMA and New York Police Department exchanges, to computers reporting faults at investment banks inside the World Trade Center
The archive is a completely objective record of the defining moment of our time. We hope that its entrance into the historical record will lead to a nuanced understanding of how this event led to death, opportunism and war.
I have taken this data and done an analysis for 100 phrases selected to summarize the events of that horrible day. I have focused on the time period from 8am until 8pm, September 11th, 2001.
This video below shows a Phrase Burst Visualization of the data. The larger the text the more frequently it was used during the 12 hour period. Text appears bright during the times of high usage and fades away otherwise. The color hues are cosmetic. This phrase burst visualization is basically a word cloud where the brightness of the words varies according to how prominent the words were during specific periods of time. You can drag the playhead for the video around to examine specific times.
Perhaps a more useful view of the data is provided by this set of timeline graphs. They are ordered by the time of the highest peak for the phrase and in this arrangement provide a narrative of the events.
Video, graphing, and analysis done with custom code created with Processing.
I believe that the recent Swine Flu pandemic has been dramatically overplayed in the media. This morning I came across the image below on dataviz.tumblr.com that shows the number of deaths in the last 300 days from various causes including Swine Flu. There are a lot of things done really well here - the most important of which is that the deaths due to swine flu are put in a proper context.
Unfortunately the choice of using a solid red bar for emphasis beside the bar graph for Swine Flu deaths confuses the message because at first glance the bar can be interpreted as an extension of the bar graph itself. The first impression (and for some viewers the only impression) is that the deaths due to swine are exceptionally high - the very myth that the graphic is trying to dispel.
I have made a small intervention to the graphic that I believe makes the message less likely to be confused. The bar has been replaced with a text label and three arrows that can't be confused with an extension of the graph itself but still draw attention to the relatively small number of deaths for Swine Flu.
In a recent post I defined the idea of Twitter ListMates as IDs that are frequently grouped together on the same twitter lists. The listmates for some starting ID give an interesting perspective on how that ID is perceived by others and are in some sense similar to it.
If the starting 'seed' ID is highly characteristic of some particular domain then the highest ranking listmates will also be characteristic of that domain. As a concrete example, let's start from infosthetics, the twitter account for one of the central websites in the area of data visualization. The top ranking listmates are: flowingdata, datavis, and infobeautiful which are all very important voices in the domain.
If we start with all four of these IDs, find the lists they are on, and see who else appears on the same lists the most often we can get an excellent quality list of twitter IDs for the field of data visualization. By starting with a small set of IDs rather than just one we introduce less bias into the result. Another technique that can be used to improve quality is to only use twitter lists whose name matches the domain as well - for example include the members of a list called 'datavis' but not of one called 'friends' when determining the listmates.
I have used this technique to define a number of twitter lists for various domains and saved them under the twitter ID Top100in. The lists defined so far are:
I have updated Twitter StreamGraphs to support the new twitter lists. You just enter a list in the standard format in the text box to see the graph for the latest 1000 tweets from all members of the list. The standard format looks like this: @scobleizer/web-innovators.
In Twitter ListMates I introduced a name for the idea of people who are often grouped together on Twitter lists. The idea has value because listmates have been grouped together by multiple people who independently decided that those accounts are similar in some sense. Doing this type of analysis starting from my account, JeffClark, helped me find new people to follow.
I have repeated the process for four other accounts to try and confirm that this technique is indeed useful. The results are shown below.
|For Robert Scoble (scobleizer) we get:||For Shaquille O'Neal (THE_REAL_SHAQ) we get:|
|For John Mayer (johncmayer) we get:||And for Alex Payne (al3x), an engineer at Twitter:|
Again, it seems to give good results: Scoble is grouped with other influential people in the field of technology; Shaq with a mixture of athletes and other celebrities; John Mayer with musicians and celebrities; And Alex with a mixture of developers, other twitter employees, and people influential in technology.
In the recent post called Twitter List Profile Clouds I explored how the Twitter list names to which a person has been added can reveal how they are perceived across the twittersphere. Another interesting idea is that when somebody adds an account to a list they are implicitly defining a relation between that account and every other account on the same list. They are essentially making a declaration that all the members of the list share some characteristic. The name of the list usually offers a clue about how all the list members are related.
So, for example, the fact that datavis and flowingdata both appear on a list together means that somebody thinks they are similar in some sense. And if the list name is called 'datavisualization' then that reveals how the list creator thinks they are similar.
I think of two accounts that appear on a list together as 'listmates'. It seems a reasonable name for the concept and follows the pattern of schoolmates, roommates, teammates etc. If you take all the Twitter Lists that an account is listed on and find all the members of those lists you can define a set of users related to the starting account. Keep track of how many times they appear in total and you also get a numeric score for how similar they are.
I tried out the idea using my own account, JeffClark, as a starting point. Here are my top 25 Twitter Listmates:
The list is a who's who of people I respect and admire in the field of data visualization and I'm very pleased that others have grouped us together. I believe this technique has promise for finding interesting new accounts to follow.
Jer Thorp has been doing some amazing work over the last couple of years. He just wrote an excellent post called Two Sides of the Same Story: Laskas & Gladwell on CTE & the NFL where he introduces a small visualization tool to look at the similarities and differences between two articles published in October about head injuries and the NFL. The articles are Game Brain, by Jeanne Marie Laskas and Offensive Play, by Malcolm Gladwell. The image below shows an example of what his tool can do.
I have previously explored the idea of comparing and contrasting document pairs with my Document Contrast Diagrams. The diagram below was created from the same two articles that Jer used in his analysis. There are obviously a lot of differences between the two visualizations both in appearance and in the technical means of constructing the diagrams but the underlying organizational metaphor is the same:
Jer's tool seems designed more to be for interactive exploration whereas mine is focused more on creating static diagrams that try and show more information all at once. Mine also tries to illustrate emotional tone (with the little coloured triangles), the overall document size difference, and the fraction of unique or shared vocabulary.
Just to be clear, I'm NOT suggesting Jer used my work as a starting point for his own - although I'd be flattered if he did! It's just a case of two people tackling the same problem and independently coming up with a fairly obvious approach to represent the information. Those of you who like my work should check out his blog blprnt.com. Jer has recently published the source code for a number of his projects and has plans to set free the code for this tool as well.
Twitter recently introduced the Twitter List feature which lets people define sets of user accounts that are related in some manner. The lists are given a name and can be followed by other people who are interested in seeing all the tweets from the accounts in the list. Popular twitter users such as Robert Scoble appear on thousands of lists - 3963 for Robert at this time. My twitter Id JeffClark, appears on a more modest 40 lists for comparison.
The act of assigning someone to a list is a type of tagging operation and the name of the list gives a clue regarding how that person is regarded by others. I've used the new Twitter List API to get the names of all the lists that Scoble currently appears on. Some simple counting (using code of course) gives us a table showing the most common names for lists that he appears on. The first few entries are:
I have used these names and frequency counts to generate a Shaped Word Cloud that illustrates the various list names that list creators associate with Scoble.
Here is another Twitter List Profile Cloud below - this one is for Guy Kawasaki. It has many similarities to the one for Scoble but there some names much more prominent for Guy: marketing, business, and entrepeneurs for example.
And here is a third. Can you guess who it's for?
Here is another Delaunay Image, this one created from a well known photograph by Steve McCurry for the National Geographic. The subject was Sharbat Gula and a retrospective on her life done by National Geographic can be found here.
Here are a couple of more Voronoi designs based on the same image.
I created these images with custom software written in Processing that relies heavily on the Mesh library by Lee Byron. I also used the Mesh demo created by Marius Watz as a starting point for my code. Thanks!
One reason the images I referenced in my previous post caught my eye was that I've been playing around with a similar technique for a couple of months now. I dusted off the code and improved it to support Delaunay images as well as to do shading of the triangles or polygons.
Image 1 below shows a Delaunay image constructed from the Mona Lisa. The triangles in the first image are coloured evenly and the shade is the average colour of the three vertices. Image 2 is the same except I'm colouring the triangle pixels based on a function of how far they are from the various vertices and the colours at those vertices. It gives a much more realistic image.
I've removed the triangle edges in image 3 and image 4 is the original for reference. I like this technique because you can easily control where the resulting image is more detailed by just using more control points in that region or by shading the triangles differently.
There is a related type of diagram that is based on Voronoi cells. This next image is the Voronoi diagram using the same control points as above. The regions are polygons of arbitrary number of sides rather than triangles. The last image uses more control points to get more details from the underlying base image.
I created these images with custom software written in Processing that relies heavily on the Mesh library by Lee Byron. I also used the Mesh demo created by Marius Watz as a starting point for my code. Thanks!
I stumbled across this image by Hugo Dechesne and liked the sense of depth suggested by the stacked tiles. Click on his image to see a higher resolution version.
I've tried to recreate the technique and applied it to a more famous image. The second version below just uses smaller tiles. I'm pretty happy with how it came out for such a simple technique but I still prefer the shading in Hugo's images. I think he's using a more diffuse and rounded shadow.
I love typographic designs. When I was doing my first work with Word Portraits a year ago it occurred to me that I could probably make a really cool childrens ABC book where the representative images were constructed with words or letterforms. I thought it might be visually interesting and that potentially there might even be an educational benefit for the kids learning to read if the images helped them remember the beginning letter for the word. I haven't pursued the idea yet but I just stumbled across a beautiful example of the same idea. It's called alphabeasties: and other Amazing Types and was created by Werner Design Werks.
Here are a couple of images from the book:
I encountered this via grain edit.
Last week I produced several Document Contrast Diagrams comparing speeches by various political leaders in the UK. The diagrams were used in an article for The Times called How the party leaders' speeches compare. See the article for all three diagrams and a description of how to interpret the diagram. The one for David Cameron and Gordon Brown is shown below.
Thanks to Jonathan Richards and The Times for the opportunity to get some exposure for the technique.
A couple of months ago I attended the grand opening of a new exhibit at the Toronto Zoo called Tundra Trek. While I was there I noticed they were promoting it with a cool composite design made from symbols of local landmarks. I couldn't find it online at the time but just looked again and found it at adsoftheworld.com . Design by Lowe Roche, Canada.
Thanks to Joe Sapiano a long-time zoo volunteer (and my father-in-law) for the invitation to the event.
Here is a typographical piece about peace. It's called 'Peace Dove' and uses the word 'peace' translated into 21 different languages - English, Hindi, Chinese, French, Russian, Dutch, Hebrew, German, Greek, Czech, Filipino, Arabic, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Portuguese, and Swahili. How many can you find ? The dove image is from Wikimedia Commons and I used Google Translate to get the word in the different languages.
My original image had the characters shown in reverse order for both arabic and hebrew. The image shown below has been corrected. Thanks to Ori Folger for pointing out the problem.
This is the fourth part in a series analyzing a years worth of tweets containing the word 'apple'. The first three sections are:
The graphs below showing distribution over time have been normalized to remove the trend of increasing number of tweets over time. This helps show the underlying patterns related to the specific term of interest. Note that the month labels are positioned at the beginning of the month.
Here are a few observations:
I added five more links in my Portfolio section a couple of days ago. The link is (currently) found near the top right of all the pages on Neoformix. If you are looking for a post of mine based on the memory of an image it might prove to be a useful starting point since all the links have small thumbnail images associated with them.
In case you weren't aware the Archive link brings you to a page showing all the posts on Neoformix. It does take a while to load and I will likely reorganize them by year before 2010 begins.
Here is a StreamGraph prepared from the text of Obama's speech to the UN. I've tried to show more words than in some of my other text-based StreamGraphs but I'm not sure it is successful. More words means more slices and less of a chance that you can follow an individual slice through the speech to see the rise and fall of it's frequency.
This is the third part in a series analyzing aspects of a years worth of tweets containing the word 'apple'. The first part of the series discussed Apple Brand References in Tweets and showed which Apple brands were referenced the most and their distribution over time. It also included word clouds showing the terms most often associated with each of the primary brands. One of these is shown below for 'ipod'.
It's interesting and gives some indication of the other topical words related to 'ipod' and their relative frequency. One thing it doesn't do is show what people feel about ipods. Do they Love them? Hate them? Can we figure it out from all this data?
One simple method of approaching this problem is to see which emotion-laden adjectives or declarations occur together with the various brands in tweets. This is a crude form of sentiment mining that makes no attempt at detecting sarcasm or the even more common inversion due to modifiers like 'not'. The size limitations of tweets mean that they seldom express ideas in a subtle or linguistically complex fashion so it might be appropriate to use such a simplistic approach - especially when we are dealing with large volumes of tweets like we are here (570,464).
I have repeated the word association analysis done in Apple Brand References in Tweets but have restricted the words of interest to a small set of terms that are often used to express feelings. Have a look:
There appears to be considerable variation in the spectrum for the different brands. People seem to find 'iphone', 'ipod', 'nano', and 'shuffle' to be cool and interesting. They love the 'mac' and are much more negative towards 'itunes'. I suspect this technique might indeed be valuable.
This is a second installment in a series analyzing aspects of a years worth of tweets containing the word 'apple'. The previous post showed which Apple brands were referenced the most and their distribution over time. This one focuses on the other companies mentioned in tweets containing 'apple'. The data is from Sep 1st, 2008 until Aug 31, 2009 and I collected a total of 2,852,320 tweets in this time frame and analyzed every fifth tweet emitted (570,464 of them) to get the results below.
Apart from Apple itself, the most frequently mentioned company in the data was Google followed closely by Microsoft. The spread over time is very spiky for all the companies but some exhibit very little attention apart from the spikes - Dell, Adobe, and Facebook for example. Verizon shows a significant drop off in attention over this time span and both AT&T and Palm have become discussed more often over time. As in my previous post these distribution graphs have been normalized using the number of tweets in each time period to remove the overall trend of increasing twitter use from the picture. They are also scaled independently in the vertical direction in order to show the most detail for each graph.
I've created accentuated word clouds that show the words used in conjunction with the various companies to give some idea of what was being discussed in relation to Apple and the respective company. In an accentuated word cloud the sizes of the words are a function of both the frequency with which they occur and their prominence relative to a baseline text.
Which Apple brands are most discussed on Twitter? I have analyzed a large set of tweets that contain the word 'apple' sent out over the course of an entire year - from Sep 1st, 2008 until Aug 31, 2009. I collected a total of 2,852,320 tweets in this time frame and analyzed every fifth tweet emitted (570,464 of them) to get the results below.
The following graph shows the distribution over time of the number of tweets containing the word 'apple'. There is an obvious overall rising trend as we would expect since the use of twitter has grown greatly over the course of that year. There are also many large peaks throughout the year and at least one large trough in late Aug 2009 that was likely due to a failure in my data collection infrastructure. There also appears to be a relative slowdown in activity in July and August but examining data from multiple years might be necessary to confirm this.
The most frequently mentioned apple brand in the data was 'iphone' by far which had 97,166 references in the 570,464 tweets (17%). The bar graphs below show the total number of references for some other Apple brands and how they compare to 'iphone'. Also shown are the distribution of the brand usage over time. These distribution graphs have been normalized using the number of tweets in each time period to remove the overall trend from the picture. They are also scaled independently in the vertical direction in order to show the most detail for each graph.
Note that these results are for tweets containing 'apple' and the brand in question. There are obviously a lot of tweets that mention these brands without explicitly referencing 'apple' but they are not a part of this analysis.
I lined up the initial graph showing the total number of tweets with the brand distribution graphs and you can see that several of the peaks in number of tweets in March correspond to big spikes in references to 'iphone', 'ipod', and 'mac'. The brands 'safari', 'shuffle', 'ilife' , and 'iwork' have surprisingly few references apart from the big spikes - people just aren't tweeting about them. All the top 6 brands, together with 'nano', and 'itouch', seem to have more consistent chatter about them in the twittersphere. The term 'leopard' (as in Snow Leopard) is obviously of more recent interest.
These graphs above give a great idea of how often the various brands were mentioned and the distribution over time. What are people actually saying about these brands? I've created accentuated word clouds that show the words used in conjunction with the various brands.
In an accentuated word cloud the sizes of the words are a function of both the frequency with which they occur and their prominence relative to a baseline text. For example, the word 'new' may be used quite frequently in tweets about 'iphone' but if it is used proportionally less often than in other tweets it will be made smaller. Similarly, a word like '3gs' may appear much more frequently together with 'iphone' than in other tweets and so its' size is increased.
Today, in Arlington Virginia, Obama delivered some Back to School remarks to the students of America. Here are a few choice snippets:
Where you are right now doesn’t have to determine where you’ll end up. No one’s written your destiny for you. Here in America, you write your own destiny. You make your own future.
No one’s born being good at things, you become good at things through hard work.
So today, I want to ask you, what’s your contribution going to be? What problems are you going to solve? What discoveries will you make? What will a president who comes here in twenty or fifty or one hundred years say about what all of you did for this country?
One of the trending phrases on twitter lately has been 'True Blood' due to the popularity of the True Blood TV series. I've noticed lately that most trending terms in twitter have quite a large number of spam tweets and this is no exception. I've used Twitter Venn to try and get a feel for what the proportion of the spam tweets are for this topic. A quick glance at the search results showed large numbers of spam tweets mentioning free grocery money or gift certificates so I did a twitter Venn of 'True Blood' versus 'grocery'.
Based on the tweets at this time there are 8597 tweets/day for 'True Blood' that don't mention 'grocery' and 3781 that mention both. This gives us a spam proportion of approximately 3781 / (3781 + 8597) = 31% without even including spam that don't mention grocery. If you look at the red word cloud for 'True Blood' without 'grocery' you can see that there are several other spammy words that are fairly prominent - 'won', 'free', 'cash', 'gift', 'cards'. This suggests that the amount of spam for this topic is even higher.
These numbers do change quickly because they are based on the latest tweets only. To do an accurate analysis would require looking at more data over a greater period of time.
I have been having fun recently exploring how the use of words in tweets varies over the time of day ( #1, #2, #3, #4, and #5 ). A minor change in the code I use for the analysis of the text in the tweets lets me look instead at how use of words varies over the course of a week. The dataset contains over a million tweets sent from Toronto during June and July, 2009 so we have roughly 8 weeks of data. I've binned the data into 2 hour segments by day of the week.
As in the charts below, many of the time series show obvious daily patterns with no apparent variation across the different days. Note that the day of week labels are positioned at noon of the respective day.
Other words show strong peaks for certain days of the week. The terms 'tgif' (Thank God It's Friday), '#followfriday', and 'mondays' appear in the expected locations. Why is 'father' localized to Sunday ? And 'michael' on Thursday ?
Let's check out the terms that have similar shaped curves to these words. For 'father' we get:
From these terms that are temporally related I suspect the tight association between Father and Sunday is because of Father's Day which was on Sunday, June 21st this year which was in the range of data we used for this analysis.
Similarly for 'michael' we get the graphs below and it's easy to see that Michael Jackson died on a Thursday.
Here are a few terms that seem relatively high on weekends:
Overall the technique seems to work well for analyzing day of week patterns. As is often the case, much of what gets revealed seems obvious in retrospect. I suspect, however, that this type of analysis could discover non-obvious patterns as well.
Here is a fifth post in a series looking at word usage by time of day in tweets. The first four posts are useful background material if you haven't read them yet:
If you look at the time series for the top ten words you will notice that many of them have a very similar shape. For the words 'lol', 'new', 'time', 'love', 'know', 'great', and 'twitter' they all seem to peak around 1-2am, drop off to a lowest point between 3-5am, and gradually rise during the day. Why should there be a relationship between the curves for these words ? Do lots of people write tweets that use these words together ? Or is there some special temporal relationship between these words ?
The answer is much simpler. One of my readers, Kyle McDonald, posed an interesting question: is tweet density (tweets over time) relatively constant throughout a day?. The data I'm using here all comes from Toronto. It's a single location and is therefore from a single time zone which is important when looking into the time of day that the words were used. If we look at the curve for number of tweets over time of day for this data we get this:
So, no, the tweet density is not relatively constant throughout the day for a specific location. This curve is very similar to the common shape we see for the set of words listed above. The counts for these words are basically just tracking the number of tweets. Or, in Kyle's terms, the word count density over time is just tracking the tweet density. So the interesting features in the curve for the word 'love' seem to arise because more tweets are getting sent out during those times of day and are not due to any special temporal property of the word itself.
Kyle goes on to suggest that it would be really helpful to see the same plots normalized by tweet density. Here are the normalized curves for the same set as above:
Many of these normalized plots are basically flat except for noise. Those for 'new', 'time', 'know', and 'twitter' seem to show no special relationship with time that isn't accounted for by the simple fact that more tweets are occurring in total during certain periods. Several of the other words still show strong peaks, 'lol', 'day', and 'today' for example. The series for 'toronto' now has a jagged set of peaks evident just before 6am which were not apparent in the raw time series shown in blue. This technique does indeed appear to be useful in highlighting those words that are used preferentially during certain times of day.
This is another post in a set looking at word usage by time of day in tweets. This time the data includes all the tweets sent from Toronto in June and July of 2009. The post Temporal Correlation for Words in Tweets probably has the most relevant background.
Each of these sets below consists of 5 line graphs showing usage of the word in tweets with the time of day along the horizontal axis. The first series, in black, is the word of interest. The next 2, in blue, are highly correlated with the focus word - the words tend to be used during the same times of day as the word of interest. The last 2 words, in red, have a negative correlation.
Note that these aren't necessarily the words with the strongest correlation. From the stronger matches I've selected the ones that seem most insightful. Many of the strongest positive correlations arise because the words are often used together within the same tweets. For example there are quite a few tweets that talk about eating 'pancakes' or 'eggs' at 'brunch' so it isn't especially surprising that their time of day profiles are similar. The combination 'yoga' and 'pets' seems a bit more surprising. I've checked in the actual tweets and can't find any that contain both words at once.
The negative correlation between 'yoga' and 'guns' isn't very strong but I find it kind of amusing. The strong correlation between 'drunk' and 'ill' and the negative match with 'gym' seems appropriate.
A mysterious person calling herself the perfumeladi contacted me a few weeks ago and asked nicely for a Shaped Word Search puzzle for perfumes. Here it is:
I'm continuing my exploration of how frequently words are used in tweets during the various times of day. If you haven't seen them already, you might want to read Time Series for Word Counts in Tweets and Temporal Correlation for Words in Tweets for background information and details about the dataset.
Here are some word graphs for a few different beverages. 'Coffee' shows the strongest time dependence and is of course at it's peak during the morning hours. Both 'beer' and 'wine' rise gradually from about noon until 2-3am. Showing pretty flat (but noisy) graphs are both 'tea' and 'water'.
Some more collections of graphs follow. You can spot the trends yourself so I won't describe them all. Note that many of these charts are quite noisy. They could obviously be improved by using more data although I am already analyzing half a million tweets to get these results. Using 30 minute time slices rather than the 15 minute slices I'm currently using would smooth out the graphs as well.
The graph for 'happy' has some unusual peaks that look like they occur around 10am, 11am, noon, 1pm, and 2pm. I'm not sure why the regularity over time. These tweets are from Toronto during the month of July which includes the data for Canada Day on July 1st. Here are the graphs for the words highly correlated with 'happy' :
In my last post, Time Series for Word Counts in Tweets, I showed some graphs illustrating how often a word was used in tweets during the various times of day. I'm using the same data here, 575,962 tweets sent from the Toronto area in the month of July 2009. Some of the graphs show very similar shapes, for example 'morning', 'breakfast', and 'coffee' in the set below.
We can spot these visually but if we are analyzing a large number of words, say 1000 or more, it would be useful to be able to calculate the similarity of the curves in order to find matches automatically. We want 'scale invariant' matches - curves with the same shape but not necessarily the same scale. Our curves are just plots of 96 numbers - since I'm summing the counts within 15 minute time buckets and 24 hours * 4 (buckets/hour) = 96 buckets. We can compare two curves by looking at the correlation between their time series values. If the curves go up and down in the same places then they are visually similar and the correlation gives us a way to quantify this.
If we select a time series for a word of interest we can calculate the correlation between that series and each of the others in turn. Then we can show the graphs with the highest correlation to see those with the most similar profile over time of day. Here are the top matches for 'morning':
The correlation coefficient is shown to the right of the graph. A value of '1' means perfect correlation, around '0' is no correlation, and a value of '-1' means an inverse or negative relationship. Below are shown some series that show no correlation at all with 'morning'. I was surprised to see that 'bed' isn't used in tweets around the same time of day as 'morning'.
Here are a few examples of negatively correlated words. The relationship isn't quite as strong as for the best positive matches , -.55 compared to +.90 .
So the word with the strongest inverse relationship with 'morning' is 'bored'. Interesting - I guess people don't get bored in the morning as much as the rest of the day.
I have been playing around with a fairly large collection of tweets looking into the patterns of word usage over the time of day. The dataset contains 575,962 tweets that were sent out from accounts located within 50 miles of Toronto during the month of July, 2009. For each of the most common 1000 words (except for stop words) I counted how often they were used in each 15 minute period of the day. The counts for all the days in July were simply added together so the shape of the series is for a typical July day. The following graph shows the time series plotted for the most common word - 'lol'.
Both the beginning and end of the horizontal axis represent midnight and noon is in the middle. This graph shows a peak around roughly 2-3am in the morning and a low point around 6am.
If we look at the traces for the #1, #10, and #100 most popular words and keep the vertical scale the same we don't have any detail in the smaller series ( for 'girl' ).
If we scale each graph independently so that the fine details are present for each series then we can no longer tell when looking at a set of graphs which ones have the larger counts.
I've been experimenting with drawing both the absolute and independently scaled versions on the same graph so that both the detail and overall magnitude are evident.
It seems to work pretty well. I've used the darker line with the filled area underneath for the absolute scale to give it more prominence.
Here is a set of graphs for some obviously time-dependent terms:
These series seem more interesting than those with a more even distribution over time. Rather than visually scanning a large set of graphs to find these candidates I constructed a metric that measures the clumpiness of each series and used that to focus my search.
There is an obvious similarity evident in the curves for 'morning', 'breakfast', and 'coffee'. In a future post I will describe a technique for detecting these matching curves automatically and show some results based on it.
I just recently finished gathering a complete year of tweets containing the word 'apple' - from Aug 7th, 2008 until Aug 6th, 2009. There were approximately 2.7 million public tweets over that year containing the word. I have used a sample comprised of every 10th tweet of the complete set to create a shaped word cloud showing the words most frequently used. This is a re-creation of a shaped word cloud visualization I did in January that only included tweets from Jan 20-21, 2009.
The dominant words don't seem too surprising. You can click on the words to jump to Twitter Search and see the matching tweets.
In Word Clouds from Adjusted Counts I introduced the idea of accentuated word clouds and mentioned the possibility of breaking down a collection of tweets by geographic origin and contrasting the word counts to uncover geographic patterns. I've done something similar with a large collection of tweets sent from either Toronto, London, or San Francisco. They are actually a 1% sample of all the public tweets sent within 50 miles of the respective city centers during the month of July, 2009.
The three blocks of words reflect those words used frequently and proportionally more often in tweets being sent from the respective cities. Apart from the city names, some prominent words are:
The prominence of 'pumper' for Toronto puzzled me a bit so I looked into the data more closely. There is a series of twitter accounts similar to ToFireE that pump out alerts for every emergency fire unit dispatched in the city. They include reason for dispatch, location information, and also the vehicle which is often named pumper-nnn where nnn is some number.
Another interesting thing that you can pick out from the clouds is that San Francisco tweets contain a lot more hashtags than in London or Toronto. Those that seem largest are: #science, #gaming, #loss, #prop8, #discount, #ffs, #weight, #wine, #sfgiants. It might be interesting to more carefully examine the proportion of tweets that contain hashtags and whether it is changing over time.
I have created another set of Shaped Word Search puzzles. This set of 26 puzzles are in black and white and will print nicely on a black and white printer. The theme is animals and the simple silhouette images are from the freeware font called 'Animals' by Alan Carr.
All 26 puzzles are found in a single PDF file. There are actually two versions: easy and hard. The hard versions use a smaller font size so there are more letters, add more partially matching distractors, and have more of the words in reverse order.
Feel free to print these off for your own personal use but don't post the PDF anywhere else or try and sell it. Have fun!
This new collection of Shaped Word Search puzzles is based on vehicle designs by cemagraphics. They all use the same transportation-related word list which I constructed with a little help from Google Sets.
Click on the images below or use these links to get high-quality PDF versions to print: ladybug, dragonfly, mantis, and ant. They look great when printed in colour but probably not so good in grayscale.
I'm continuing to explore the idea of accentuated Word Clouds that I introduced in the previous post about New Testament Word Clouds. This time I compared the news coverage from four different sources about Obama's recent speech delivered in Ghana. The source texts are from the New York Times, Fox News, Al Jazeera, and AllAfrica.com.
The first word cloud was created from the text of all four articles put together and does a reasonable job of showing the key words for the event. The top words are 'Obama' , 'Africa', 'Ghana', 'president', 'life', 'future' etc.
These four accentuated clouds below are created by comparing each source article in turn against the overall collection. They illustrate the words that are used frequently and proportionally more often in that particular text.
Here are a few prominent words that I notice from a quick glance at these clouds:
The word cloud below was created from the text of the four gospels of the New Testament of the Christian Bible. I used the King James Version from the wonderful Project Gutenberg. The primary words of emphasis are not surprising - 'jesus' , 'son', 'father', 'lord', and 'god'.
Lately I have been exploring the idea of using clouds built from relative word frequency counts to emphasize the differences between a text and some baseline text. I'm leaning toward calling these accentuated word clouds.
I have created four separate accentuated word clouds for each of the gospels and show them below. The baseline text was all four gospels together so each cloud shows which words are used frequently and proportionally more often in that text versus the overall collection. This kind of cloud illustrates the unique aspects of that particular text.
Let's look at a word that is very prominent in one of the clouds. In the gospel of John, the word 'jews' seems central but it either doesn't appear or is very small in the other three. The number of times it appears in the four gospels is 5, 6, 5, and 67 for Matthew, Mark, Luke, and John respectively. If you calculate the number of occurrences per 1000 lines to account for the different sizes of the various texts then you get 1.4, 2.6, 1.3, and 23.2 times/1000 lines.
These accentuated word clouds appear to be doing a good job of highlighting the terms that are characteristic of the various gospels. It is certainly possible to design a visualization that more directly shows the relative frequency of the key words in different texts but the visual simplicity of these accentuated word clouds have some advantages.
When trying to understand something it is often very useful to compare and contrast the data of interest with some related data. This can serve to emphasize the unique characteristics of the data you are studying. Another way of thinking about it is that you are filtering out the background noise in order to clarify the signal.
I mentioned in the recent post Shaped Word Cloud: Canada that I had adjusted the word counts according to how frequently they occurred in a baseline dataset. In this post I give a graphic example of the effects of this type of adjustment.
The data used is a collection of 16,504 tweets gathered during the month of June, 2009 and containing the word 'starbucks' . They are every 10th tweet of the full 165,040 that I collected during this time period. I also discarded the tweets that were obviously non-English. The words 'starbucks' , 'coffee' , and any twitter ID were not used in the analysis.
The following word cloud was constructed from the word frequencies found. It includes stop words and the cloud shows that 'in' , 'to' , 'at', 'is' and many other small words are frequently used in the text. The problem is that this is true for any sizable amount of English text and so this word cloud doesn't illustrate any real useful information specific to 'starbucks'. For this reason, stop words are almost always excluded from word clouds.
This next cloud was generated from the same data and the only change was that stop words were excluded. Now we can start to see some interesting emotion-laden words like 'love' , 'good' , 'work' , 'like' as well as some that are obviously characteristic of the search term like 'hot' , 'cup', 'mocha', 'frap', and 'drinking'.
To reveal more detail specific to 'starbucks' I have adjusted the word counts in this final cloud based on how frequently the words occurred in a baseline data set. The baseline I used here was a collection of tweets containing the word 'coffee' taken over the same time period as the original starbucks tweets. I won't describe the math in detail but, basically, I boosted the counts for words by a factor that is a function of the word frequency rate in the two data sets. If a word is used much more frequently in the starbucks data than the coffee data then it's count is elevated so that it becomes more prominent in the cloud.
This word cloud is much more revealing of those things discussed in tweets together with 'starbucks'. Some of the large terms include, '#starbucks', several variations on 'frap', 'ruling', 'fructose', 'lemonade', 'venti', 'card', and 'sponsorship'.
By choosing different baseline datasets it is possible to accentuate different perspectives of the original data. For example, breaking down a collection of tweets by geographic origin and contrasting the data using this technique would let you uncover geographic patterns. What are people saying about Starbucks in San Francisco that is different from what they say in New York , or London ? If you break up the tweet collection by time you can answer questions like: What are people saying about Starbucks at lunchtime versus in the morning ? Or, What are they saying on Tuesdays versus Saturdays ?
I believe this technique may prove very useful in revealing information from large amounts of text.
The blog Computational Legal Studies has a word cloud using the text of the Declaration of Independence created with Wordle. I liked the idea and so to help all my American readers celebrate the 4th of July I've created a word cloud using the same text in the shape of the US map. I added some stars to fill out the shape better. The word colors are random.
Happy Canada Day ! This is a Shaped Word Cloud created from the text of approximately 168,000 tweets containing the word 'canada'. The tweets were gathered over an 11 month period from July 31, 2008 to June 30, 2009.
Basically, the larger the word the more frequently it appears in the text. Stop words were discarded. I also adjusted the size based on the relative frequency of the word in the canada dataset versus a baseline dataset containing tweets about india and china. A word like 'country' or 'travel' is used approximately the same for canada as for india and china and so will be de-emphasized. Words like 'hockey' , 'canadian', 'snow' and place names within canada will appear bigger. Because of the baseline content the result will not properly reflect any strong associations between canada and india or canada and china. As usual you can click on a word to see the current twitter search results.
Here is another Shaped Word Search in honour of Canada Day tomorrow, July 1st. This one is in the shape of a map of Canada and uses Provinces, Territories, and cities in the word list. Click on the image or here for the PDF version.
In honour of Canada Day tomorrow, July 1st, I have created a Shaped Word Search with a maple leaf design and words I associate with Canada. I improved my tool slightly to sort the words in alphabetical order so it is more convenient to look them up. Thanks to Joe S. for the suggestion. Click on the image or here for the PDF version.
Here is a Venn Diagram made with Twitter Venn that shows the relative frequency of tweets made about the recent deaths of three celebrities - Michael Jackson, Farrah Fawcett, and Ed McMahon. This analysis was done around 7am EST today and the absolute numbers for tweets/day will certainly increase as more people in the US come online. I expect the proportions among the various combination regions to stay roughly the same.
A couple of points of interest:
To explore the data using the interactive application click on the image below or this link: Twitter Venn for #michaeljackson, #farrahfawcett, and #edmcmahon.
Here is a different view of the relationships between the Twitter employee accounts first presented in this post. I measured the similarity between all the twitter employee accounts based on the overlap in vocabulary used in their last 200 tweets. A clustering algorithm was then used to group them together based on the pairwise similarity scores. The algorithm was tuned to limit clusters to have a maximum of 8 members.
The image below was created from the cluster members data, the similarity between clusters, and the similarity within each cluster. To minimize line clutter I am only drawing a connection if it is one of the top 2 strongest for either end node. The clustering and layout code is based on what I used for the Toronto Twitter Community project but has been recently enhanced to support some new client work.
Here is the PDF version of the Twitter Employee Clusters.
Here is another example of a Shaped Word Search. This one uses a Twitter Bird as the image and a list of words related to twitter. I also experimented a bit with adding distractors in order to make the puzzle more difficult. There are a couple of partial matches for each word mixed in to the letter matrix. Click on the image or here for the PDF version.
I celebrated Father's Day this weekend with my wife's parents. While there, I spent a frustrating and unsuccessful 15 minutes looking for one of the few remaining words in a giant word search my father-in-law was working on. We found out later by checking online that there was an error and the word wasn't even present in the puzzle!
Much more enjoyable was the hour or so we spent doing a virtual tour of Malta using Google Earth. My father-in-law was born there and we had great fun zooming in with the aerial views finding the house where he lived, the church where he was baptized, etc. We were also able to easily see wonderful pictures of the many famous churches and natural features like the Blue Grotto. It's a beautiful and fascinating place and I'd love to visit sometime.
Well, the ideas of Malta, word search puzzles, and the usual mishmash from my coding projects mixed together in my brain while I was sleeping and I woke up early realizing I could easily write a tool to create 'Shaped Word Search Puzzles'. Basically, I can take a template image and a list of words and automatically construct a word search puzzle shaped and coloured to match the image.
The first example is below and uses a Maltese Cross with a list of words related to Malta. Most of the words are place names but there are a few other things mixed in as well. For example, Pastizzi are one of my favourite Maltese foods.
I have uploaded the set of tweets I used to create the Iran Election Word Cloud to the wonderful Many Eyes and created a Phrase Net visualization for the data. This image below shows the net for the pattern word1 and word2. So, for example, the arrow connecting 'police' to 'riot' means there were lots of instances of the phrase 'police and riot'.
See below for the interactive version.
I have updated my Tweet Narrative about the Iran election. This one uses 141,000 tweets from the time period June 14-20th, 2009. I have also improved the algorithm that selects the characteristic tweets. The changes are difficult to describe succinctly but did reduce the number of tweets that started with 'RT'. This helps meet my primary goal of constructing a readable summary of the content. For this analysis I also only counted the first 10 tweets from any particular account which helps prevent the tweets from a few individual accounts from dominating the results.
|WTF! They're bringing tanks on the streets in Tehran #iranelection *|
|@Change_for_Iran 5:17am people outside are burning Saderat bank building or as it seems from this far #iranelection *|
|@IranNewsNow HUGE NEWS!!!! CNN reports that GRAND AYATOLLA SANAI has issued FATWA to resist govt that steals #IranElection *|
|Iran supreme leader orders probe of vote fraud #iranelection *|
|BEST FILTER SHEKAN: www.julo.free4r.com/prox.html #IranElection *|
|Please postpone maintenance! #nomaintenance #iranelection *|
|Twitter Reschedules Maintenance Around #IranElection Controversy *|
|Iran has blocked "#iranelection" Use #Tehran or #Iranians *|
|#iranelection cyberwar guide for beginners *|
|unconfmd major incident at Azadi - shooting - fires - ppl running #Iranelection *|
|pls everyone change your location on tweeter to IRAN inc timezone GMT+3.30 hrs - #Iranelection - cont.... *|
|NYT publishing sensitive names of Iranians on Twitter. Get them to stop! #NYTfail #iranelection *|
|BLOCK @serv_ SPREADING MISINFOMATIONS #iranelection *|
|Tehran march TODAY 5pm - 7Tir Sq - Meydan 7 Tir - silent - sea of green - #Iranelection *|
|Show support for #iranelection add green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/ *|
|news - Mousavi & Khatami have delivered joint letter to Ministry of Justice demanding release of protestors - #Iranelection *|
|"Change does not roll in on the wheels of inevitability, but comes through continuous struggle." -Dr.Martin Luther King #iranelection *|
|DOA Remix (Death of the Ayatollahs). Theme song for #IranElection www.myspace.com/revolutionofthemindhiphop *|
|Today - Sea of Green - Imam Khomeine Sq - 4pm - Tehran - All wear BLACK - we pray together - #Iranelection *|
|MOUSAVI - 25% inflation means IGNORANCE - THIEVING - CORRUPTION - where is the wealth of my nation? #Iranelection RT *|
|RT @andylevy BREAKING: Faulty #iranelection results attributed to Clerical errors. *|
|confirmed - Saeed Rajaie's (a prominent Iranian wartime martyr) wife has been arrested while praying in Qom - #Iranelection *|
|[Mashable] Facebook Releases Persian Translation for #IranElection Crisis http://tinyurl.com/kuzmc4 *|
|#iranelection Khamenei: (summery) (( correction )) Crowed yell: Death to england *|
|situation in Iran is now CRITICAL - nation is heartbroken - suppression is iminent - #Iranelection *|
|Mousavi's offices are trashed, Mousavi's staff in police custody, Mousavi is missing. #iranelection #gr88 #clarification *|
|#IranElection Must watch video & read transcript at the same time. Chills Pls RT after you watch. http://bit.ly/10qe5H *|
|whenwill we all stand together ascitizens of thewrld and demandour elected officials tohelp? one day wecould be in that crowd #iranelection *|
|Google Earth to update satellite images of Tehran #Iranelection http://twitition.com/csfeo *|
|Unconfirmed: Bomb Blast in Khomeini's shrine #iranelection *|
This is a Shaped Word Cloud created from the text of approximately 84,000 tweets containing the term #iranelection. The larger the word the more frequently it appears in the text. As usual you can click on a word to see the current twitter search results.
Feel free to follow JeffClark on Twitter to get more updates on my work.
The world is watching with great interest the demonstrations in Iran related to the recent election. The twittersphere is filled with discussion of the event and, of course, much of it is redundant. I have built a Tweet Narrative based on a collection of ~ 60,000 tweets containing the tag #IranElection. Basically, I divided the tweets into 30 groups based on the time they were published and then statistically select the one tweet most representative for that time slot.
|RT @StopAhmadi WTF! They're bringing tanks on the streets in Tehran #iranelection *|
|We people of iran want peace! #CNNfail #iranelection *|
|RT @persiankiwi students being killed in tehran uni dorm in amirabad right now. this must stop. #Iranelection *|
|Follow @Change_for_Iran 5:17am people outside are burning Saderat bank building or as it seems from this far #iranelection *|
|RT @parinaz AhmadiN revoked all permits of foreign media & has instructed them to stop reporting or they will face jail time. #IRANelection *|
|Will you wear green tomorrow to support freedom in Iran? #iranelection #greenscream *|
|RT @greenscreamiran: World to wear green tomorrow for freedom in Iran. RT please. #IranElection #greenscream *|
|RT @IranNewsNow: HUGE NEWS!!!! CNN reports that GRAND AYATOLLA SANAI has issued FATWA to resist govt that steals #IranElection RT THIS *|
|RT @persiankiwi March is NOT CANCELLED today. Mousavi is in danger of being killed. #Iranelection *|
|RT @persiankiwi: March Started: ADVICE - carry photos of imam khomeini. they cannot shoot at us with these. #Iranelection *|
|RT @persiankiwi for later we need proxy address to upload film. we have no upload possibility now, can anyone help? #Iranelection *|
|RT @persiankiwi: Valli Asr st closed to traffic - tens of thousands marching - unbelievable sight. #Iranelection *|
|RT @herrcafe RT @phelo Telegraph reports of Iranian Interior Ministry leak that Ahmedinajad came in thir #IranElection - http://bit.ly/GGUy2 *|
|RT @persiankiwi: streets very dangerous now. groups of militia on motorbikes searching for protesters. #Iranelection *|
|RT @stephenfry Functioning Iran proxies 220.127.116.11:8080 18.104.22.168:808 22.214.171.124:808 126.96.36.199:8080 #IranElection *|
|RT @persiankiwi Gohardasht in Karaj - confirmed - people in street batles with militia - #Iranelection *|
|RT @IranRiggedElect: Please postpone Twitter maintenance #IranElection @twitter @ev @bs @ded @ej @lg @nk @rk @vl @al3x @stop #nomaintenance *|
|RT @nttajohn maintenance is postponed, twitter will be posting press release soon #nomaintenance #iranelection *|
|RT IRAN: we are moving location - seperating - situation in Tehran is tense - cant explain #Iranelection *|
|RT From Iran: CONF: #IRANELECTION tag/string is not filtered in #iran. Plz KEEP USING IT! #iran9 *|
|People in Iran, use https://twitter.com/ instead of http://twitter.com/ to avoid hashtag filtering #Iran9 #IranElection #tehran #iranians *|
|RT from inside Iran: rumour spreading Tehran - Army Generals have met in secret - Army considering position #Iranelection #iran9 *|
|RT @stephenfry @arashamel Pls get this out to your followers. #iranelection has been blocked in Iran. Switch to #Iranians , #Tehran, #Iran9 *|
|RT @stephenfry RT: pls get this out to your followers. #iranelection has been blocked in Iran. Switch to #Iranians , #Tehran, and #Iran9 ... *|
|RT @persiankiwi only official march today is valli asr. others may be a trap - avoid others - #Iranelection #gr88 *|
|#iranelection Iran has banned all foreign journalists from reporting on the sts. *|
|RT @twistedchick: RT URGENT: Army forces entering Tehran. Barricade streets where protests are on. Now. #iranelection #gr88 *|
|RUMOUR: the former prince of #Iran, Reza Pahlavi has announced returning to #Tehran in 36h. #IranElection #GR88 *|
|RT [redacted]: unconfmd major incident at Azadi - shooting - fires - ppl running #Iranelection #gr88 *|
|RT @PCMag: The U.S. State Department asked Twitter to delay downtime to help with #IranElection. *|
I have posted a small update to the Twitter StreamGraphs application to make it more useful. Previously it used Twitter Search to get results for simple queries of the type 'from:twitterid'. Twitter Search currently only gives results going back about 14 days - it used to be much longer. For most people who don't tweet frequently this resulted in a poor quality streamgraph because there weren't many results to work with.
I'm now using the standard Twitter API to retrieve the tweets for any simple user query and it will graph up to a maximum of 1000 tweets irregardless of how far back they go. The difference is shown below for Clay Shirky. The second image shows the new improved results which, for him, go back almost a year. The graph is much richer than the first one which can only base the graph on tweets in the last two weeks.
Here is the result of a Twitter Venn query for Iran, Iraq, and Afghanistan. The recent controversial elections in Iran have obviously grabbed a lot of attention in the Twittersphere. It's interesting that the number of tweets mentioning both Iran and Iraq is roughly the same as the number mentioned Afghanistan and Iraq even though tweets about Iran are so dominant.
Click on the image to see the current Twitter venn diagram for these three terms.
I recently made some improvements in my graph display code for a client and have used it to create a new graph showing the vocabulary relationships between many celebrities on Twitter. The post More Twitter Account Graphs explains a little about what the similarity is based on.
The central people in this set appear to be RyanSeacrest, PaulaAbdul, and TheEllenShow. The similarity score between Ryan and Paula is 19.8% and the top words connecting them together are: 'radio', 'game', 'guys', 'adam', 'movie', 'coast', 'studio', and their respective Twitter IDs.
Another interesting grouping is BarackObama, schwarzenegger, and timoreilly. The similarity score between Obama and Schwarzenegger is 16.7% with the top connecting words being 'health' , 'care', 'video', 'president', 'address', 'vote', and 'event'.
I included jtimberlake in the analysis as well but he was removed from the final graph because he wasn't connected strongly enough with anybody else. His closest match was only 4.5% and was with Oprah.
After my previous John Lennon Flower Portrait I had the Beatles on my brain and stumbled across a lovely set of photographs of beetles on COLOURlovers. I have tried creating an image of The Beatles using beetles but haven't yet come up with a decent design. Instead I made this beetle outline image from 24 different species. I have seen a lovely physical display of beetles arranged in this manner but I'm not sure where it was. It may have been at the Royal Ontario Museum.
Here is another way to look at Obama's speech in Cairo calling for A New Beginning with Muslims. It uses a standard node link graph to show which words were used near each other in the text. There are virtual springs connecting words that are used frequently together and forces pushing apart nodes so they don't overlap too much. The nodes in orange have been fixed to a certain location and the other nodes move based on the springs and forces until a stable configuration is reached. This allows us to stretch out the graph and easily see where terms lay along a spectrum between 2 or more words of interest.
This first view shows that there was more discussion of 'peace' than 'war' and that words like 'palestinian', 'israel', and 'god' were highly associated with 'peace' relative to the other highlighted words.
This second view below is of the same graph but with different words pegged in place. The terms 'nuclear' 'weapons' and 'united' 'states' are both closer to 'iran' than the other countries. Similarly, 'women' 'denied' 'equal' is more associated with 'afghanistan'.
An obvious way to improve these would be to use word stemming to combine different forms of the same word. For example, 'muslim' and 'muslims' would use one node, as would 'peaceful' and 'peace'. This would reduce the number of nodes and probably more clearly expose any relationships.
Obama just delivered a speech in Cairo calling for A New Beginning with Muslims. Here is a StreamGraph prepared from the text. It does a reasonable job of illustrating which major themes were covered at the various points in the speech.
datavisualization.ch has a quick review of a new Google offering called Google Squared. It allows you to see the results of a query organized in a table. One of the suggested queries is 'dog breeds' which seemed to work pretty well. The next one I tried was 'mammals' and it seemed OK as well until I looked more closely at the images shown for 'jaguar' and 'wolverine'...
Dave Winer recently investigated Who do the people of Twitter follow?. He looked at which twitter accounts were followed by the most employees of Twitter and was curious about how that might be related to the accounts suggested to new Twitter users when they sign up.
His idea sparked one of my own - what are the relationships between Twitter employees themselves with respect to similarity of the vocabulary used in their tweets ? Here is the graph created using the same layout technique described in my recent post Twitter Account Graphs.
As a whole, the group of twitter employees seem to be well connected based on this vocabulary similarity metric. There are a few people floating around on their own - thuske, akshay_abd, jeremy, lukester, and em33. There is also a doublet separated from the others - keerthi and mikelimondba. They both only have about 40 tweets so this link is more tenuous than the others which are based on the latest 200 tweets. The bottom right shows a fairly cohesive subgroup connected to most of the rest thru ej or perhaps mzsanford/abdur. Co-founder biz seems to be a more central figure by this measure than CEO ev.
WeFollow has quickly become one of the primary directories of Twitter users. The site lets people assign up to 3 tags to their own account in order to describe their interests. People visiting WeFollow can then see for each tag the list of matching accounts sorted by number of followers.
When you categorize yourself on WeFollow, it sends out a tweet to all your followers having the form: 'Just added myself to the http://wefollow.com twitter directory under: #tag1, #tag2, #tag3'. This automatic viral message has helped WeFollow spread across the twittersphere. Some people have complained that they see too many of these and call them spam. Personally, I find it interesting to see how the people I'm following classify themselves.
These automatic registration messages can be tracked using Twitter Search and reveal lots of information about WeFollow that isn't publically available on their own site. I have analyzed the set of WeFollow registration tweets for the two month period Mar 28 - May 28, 2009. There were 144,506 tweets matching my search pattern in this time frame, or roughly 2400 new people added to the directory per day. Here is the graph over time:
The peak during this time frame occurred at the end of March and was about 6000. The time period for the analysis was shortly after the WeFollow launch which likely accounts for the rough gradual decline shown. It would be nice to see the data for the launch date but unfortunately limitations in Twitter Search prevent me from accessing this data. There appears to be a new peak showing up at the end of May and there are two obvious troughs around April 10th and 22nd. I've checked other data streams I'm monitoring and they don't show troughs or 'holes' during these two dates so it looks pretty likely that there was a problem with WeFollow infrastructure during those periods rather than it being a data collection problem.
The main page of WeFollow shows the 'top tags' but bases this on the number of followers of the people using those tags rather than the tag count itself. Which tags are actually used most often ? An analysis of our sample gives this graph:
The top three tags by follower count on the WeFollow site are Celebrity, TV, and Entrepeneur. When ranking instead by the number of people who actually self-assign these tags these rankings drop to 12 for Celebrity, 44 for TV, and 3 for Entrepeneur. This shows quite clearly that the average account tagged Celebrity or TV has more followers than, say, those tagged with Blogger.
The WeFollow registration tweets also show which tags are used together. I've constructed a couple of different types of graphics to illustrate the tag similarity relationships. This first one is a Clustered Word Cloud and show colored groups of tags that are frequently used together. The big blue group in the middle seems to contain many of the most frequently used tags and doesn't appear particularly cohesive. Many of the others do, at least subjectively, seem to make sense. Here are a couple of example clusters from the image: (church, conservative, christian, pastor, tcot) , (publishing, poetry, books, writing, poet).
This last image was created using the same layout technique as my recent Twitter Account Graphs. Basically, the tag nodes are positioned near others that they are 'similar to' in the sense that they are often used together.
The world is watching carefully the things happening in North Korea and there are lots of tweets discussing the issue. I have created a Shaped Word Cloud using 4000 tweets from the last few days and using the North Korean flag as a template. As usual you can click on a word to see the current twitter search results.
Here is another graph showing a larger set of twitter accounts and their relationships based on a measure of shared vocabulary. The middle left cluster contains many Twitter accounts who discuss web technology including Twitter itself. I'm familiar with many of these accounts and know that the ones around my own icon ( JeffClark ) discuss data visualization (eagereyes, flowingdata, datavis, infosthetics). At the bottom right is a cluster of accounts that I follow which are focused on computational art (blprnt, flight404, toxi, mariuswatz, golan, reas, natzke). The group at the very top contains accounts with an interest in music or entertainment.
To create this graph I'm connecting nodes with a virtual spring if their similarity was greater than 9%. The stronger the similarity the shorter the spring. There are also long springs connecting extremely dissimilar nodes to push them apart but these are not shown. I've tried to avoid the usual tangled mess by not connecting nodes of medium similarity and also by only connecting two nodes if the link is one of the three strongest for either node.
At the end of the previous post, Tweet Stream Similarity, I suggested using a network graph to visualize the similarity relationships between the twitter accounts. Here is such a graph for the same small set of accounts I looked at before:
It nicely shows the small group of technology-related accounts (techcrunch, timoreilly, cshirky), the (britneyspears, mariahcarey) entertainment link, and the fact that the nfl account is not closely related to these others. It's interesting that the twitter ceo, ev, is connected to both the technology group and the entertainment group.
The mariahcarey link to the nba surprised me a bit and I looked into the details. Some of the shared vocabulary that caused the link are 'basket' ( as in Easter basket for mariah, and basketball basket for the nba) , and 'shoot' ( as in photo shoot for mariah and shoot the ball for the nba). It's obvious my metric will confuse different senses of the same word. There are many other shared words between these two accounts like friends, guys, baby, twitter, vegas, and everybody. I'm currently using the latest 200 tweets for each user in the analyis. Using more tweets might give better results.
In my recent Twitter Spam post I showed two Twitter accounts that had an almost identical set of tweets. Being able to detect this situation automatically might have obvious benefit in detecting invalid accounts that should be disabled. We can do this by calculating a text similarity measure between the set of tweets coming from the two accounts. A high degree of similarity (say > 80%) is suggestive of automated duplication. This, coupled with some other likely indicators of spam (lots of links to commercial websites, high rate of updates, very low followers/following ratio, lots of followers showing spam-like behaviour) should be good enough for Twitter to find lots of spam accounts automatically.
A tweet stream similarity metric has some other potential uses as well. Given a set of accounts, we could group them into clusters based on similarity of tweet content. Or we could help a twitter user find new people to follow that seem to have shared interests based on tweet content.
There are lots of different functions that can be used to calculate text similarity. The current one I have designed is based on word frequency and excludes standard stop words (the,of,and...) , ignores URLs, ignores some words extremely common in tweets (RT, via), and discounts some other words found often in tweets (like, good, day, thanks...) . This metric can be refined over time and is fairly crude. It completely ignores word order for example and does not consider the semantics of the text at all. I'm hoping it is useful for detecting similarities at a broad topical level.
I have used my metric to calculate the tweet stream similarity between all pairs of 9 fairly well known twitter personalities. I used the last 200 tweets from each account for the analysis with the exception of britneyspears who only has 144 at this time. The lowest similarity score was 2.8% for ev (the twitter ceo) vs nfl (news about the National Football League). The highest was 20.3% and was between cshirky (Clay Shirky - American writer, consultant and teacher on the social and economic effects of Internet technologies) and timoreilly (Tim O'Reilly - founder and CEO of O'Reilly media). The highest score for THE_REAL_SHAQ ( Shaquille O'Neal ) was with the nba twitter account. The highest score for MariahCarey was with britneyspears. The metric seems to be doing a reasonable job. Here is the complete list:
An obvious next step is to use a better way to visualize this information. I'm thinking of using a network layout with nodes positioned closely and connected for high similarity scores and positioned far apart for low similarity scores. I'm hoping that it would illustrate nicely any structure within the group.
I have taken the collection of tweets I gathered for the American Idol StreamGraph and run them through my tool for creating a Characteristic Tweets Summary to produce the following output. My initial attempt included some obvious spam tweets so I had to refine my technique a little bit. Basically, a twitter spammer who repeated the same text over and over was highly likely to have one of their tweets selected as the 'characteristic tweet' for the time period containing the spam. The refinement was to only analyze one tweet per user per time period.
In the output table I also de-emphasized the twitter account for each tweet since they are statistically selected to be representative of an aggregate. The trailing '*' is a link to the original tweet which, of course, shows the proper attribution.
|May 03, 2009||American Idol winner David Cook's brother dies of cancer. *|
|May 04, 2009||'American Idol' star David Cook's brother Adam dies of brain cancer! *|
|May 05, 2009||getting ready to watch american idol. *|
|May 06, 2009||Headed home for american idol *|
|May 07, 2009||very mad because Allison Iraheta got off American Idol *|
|May 08, 2009||tickets for the american idol tour go on sale saturday @ 10!!!!!!!!! *|
|May 09, 2009||Just got tickets to the American Idol tour!!!! *|
|May 10, 2009||Tickets for the American Idol 2009 Summer tour on Sale|Tour Dates ... http://tinyurl.com/rdmcyl *|
|May 11, 2009||Can't wait to see American Idol!!!! *|
|May 12, 2009||getting ready to watch American Idol *|
|May 13, 2009||American Idol i'm waiting for who is going home tonight !!!! *|
|May 14, 2009||@jordanknight who cares about american idol...you're my american idol! *|
|May 15, 2009||RT @kingsthings: who do you want to win American Idol? *|
|May 16, 2009||What is the difference between the American Idol and Eurovision? *|
|May 17, 2009||Clouds on horizon for "American Idol" juggernaut? (Reuters) http://ow.ly/7q1O *|
|May 18, 2009||britney to perform on American Idol finale? *|
|May 19, 2009||getting ready to watch american idol. come on,kris! *|
|May 20, 2009||American Idol finale!!!! come on kris!!! even though adam has it, i really want you to win!!!! *|
|May 21, 2009||Kris won the american idol *|
Sorry - I couldn't resist. The fish images are Reef Fish of the Commonwealth of the Northern Mariana Islands and the tank outline comes from the free font Tanks-WW2.
Here is a Twitter StreamGraph created from the query "American Idol" OR #idol in the date range of May 3-21, 2009. I had to use a custom version of my tool that used tweet data harvested in a different manner from the online version which is limited to viewing the last 1000 tweets only. Given such a popular topic, 1000 tweets only goes back a few minutes and is uninteresting.
A couple of observations:
It would be interesting to see the graph for a longer time period but Twitter Search is currently only returning data that goes back around 21 days.
I recently stumbled across a collection of text art creations at The Gawno Magazine. Those of you who have enjoyed my Einstein Word Portrait or other designs created from text might find it interesting. A few sample designs are shown below. See the larger versions including references to the original art at Micrography: Text Art and Typography
Yesterday I described how I stumbled across a set of twitter accounts obviously being used for spam. I also mused that it shouldn't be that hard to detect them algorithmically. Well, I happened to check them today and found that Twitter reports they have been 'suspended due to strange activity' ! The accounts had existed for quite some time since they had sent out over a 1000 updates and had a substantial number of followers. I suspect Twitter likely detected them automatically and shut them down as part of a regular process but it does seem a bit of a coincidence that they were shut down so shortly after I wrote about them...
I was looking at the Twitter StreamGraph for 'Star Trek' a little while ago and noticed an unusual pattern. There was a peak caused by many users sending the exact same tweet which contained a long list of trending hashtags that were otherwise unrelated - #googlefail, #whyitweet, #hubble, #star trek, #gmail etc. The tweet actually does vary slightly in that a different ow.ly URL is used but they all lead to the same place. It's obvious twitter spam carefully constructed to catch the attention of people following the trending terms.
Here are snapshots of two of the accounts showing that their last 6 tweets were identical. They have different numbers of followers with the one account acquiring an impressive 924 - more than I have. Presumably there is a large set of spam accounts and many follow each other. Other characteristics that seem to suggest spam besides the redundancy are no evidence of @replies and the fact that every single tweet seems to mention a product name and include a link.
I suspect it wouldn't be too hard to detect these algorithmically.
Here are a couple of venn diagrams created with Twitter Venn for some topics in the news. The first shows H1N1 vs 'swine flu' and it clearly shows that the less technical name is occurring much more frequently in tweets and also that there is a fair amount of overlap. The second example compares 'star wars' with 'star trek' and has a very similar appearance to the first. I'm surprised that with the launch of the new Star Trek movie it doesn't dominate references to Star Wars even more although it does have about a 5-6 x advantage right now. It may be because there was some discussion recently on twitter about the many plot parallels between the new Star Trek movie and the original Star Wars. Notice in the word cloud for tweets containing both terms the high visibility of 'rips' and 'off'.
Click on the images to see the current diagrams inside the interactive tool.
|A wonderful example of a composite logo is that of Unilever, one of the world's largest consumer goods companies. There are 25 small icons put together to form a large 'U'. Here is a description of the various icons and what they represent.|
Just for fun I've taken the individual icons and rearranged them in a few different ways. Below, see Unilever Man, Woman, and Baby.
The outline icons came from AIGA Signs and Symbols.
Happy Mother's Day to all the moms out there ! Here are a couple of simple designs to celebrate. The first was created with my recent custom tool for filling space with images and the second was made using the old Word Hearts application. You can use it to create a customized version with your own words and colors.
I've been thinking lately about composite images that are built from smaller sub-images as in my recent Butterfly Falcon and Butterfly Plane. While wandering in the store yesterday I saw some notepads with some interesting composite image cover designs. I've found the designers online at illo Art. A couple of their designs are shown in small form below.
Integra-Magazine is a biannual popular journal on Integrative Tourism and Development, published by respect, an Austrian based NGO. I recently gave permission for them to use my World Peace image on the cover of their next issue which has the theme of Peace/Tourism and Conflict. It just came out of press recently and the cover image is shown below. The site is in both English and German.
This is a different kind of spider man. The image was generated with custom code written in Processing that is a variation on the code used for my Word Portraits. I was inspired by Quasimondo (Mario Klingemann) as mentioned in my last post to experiment with more complex constituent images and image rotations. Source images are Man Silhouette and Spider.
The excellent computational artist known as Quasimondo (Mario Klingemann) has posted an interesting set of images to Flickr that he created with an algorithm he calls 'image foam'. The technique has some similarites to what I do to generate some of my images - World Peace , and Einstein for example. The base concept is to fill 2D space using component images(or letters) without any overlap. Quasimondo has used more complex and colourful constituent images and placed them in a more varied and interesting manner than I have. Smaller versions of a few of his images are shown below - click on them to see his originals.
Here is another example of the Characteristic Tweets idea. The troubles of GM and Chrysler have been in the news for some time now and have been widely discussed in the twittersphere. I have a personal connection to Chrysler having grown up in Windsor, Ontario where they are a major part of the economy and I still have family members who work there.
I have analyzed six months of tweets containing 'chrysler' for the time period Nov 1, 2008 until Apr 30, 2009 - around 66,000 in all. Rather than finding a characteristic message for every day I have split the set into 20 equal-time periods and found the most representative for each period. It tells the sad story fairly well I think. Let's hope if I repeat the exercise in another six months that it has a happier ending.
|Nov 02, 2008||odanielpavon: No big sellers in sight to save troubled Chrysler (AP): AP - In crises past, Chrysler has somehow managed to stamp out a b..|
|Nov 13, 2008||reutersbiz: Goldman suspends GM rating, Chrysler urges aid: DETROIT (Reuters) - Goldman Sachs suspended its rating.. http://tinyurl.com/6pwcgo|
|Nov 21, 2008||mayankchandak: Chrysler's Web Edition vehicle package: includes WiFi, iPod touch and a Dell Mini 9: Chrysler has been toying with in-car ..|
|Dec 05, 2008||odanielpavon: Senators grill auto CEOs, eye GM-Chrysler deal (Reuters): Reuters - The chief executives of General Motors Corp and Chrysl..|
|Dec 12, 2008||michaelreuter: Chapeau! US Senate "No bailout for GM, Ford, Chrysler"|
|Dec 17, 2008||nishachittal: Is chrysler really closing all its plants for a month??|
|Jan 03, 2009||odanielpavon: Chrysler gets $4 billion U.S. government loan (Reuters): Reuters - Chrysler LLC on Friday received an initial $4 billion emergency loa..|
|Jan 05, 2009||studentoflife: Chrysler U.S. December sales drop 53%|
|Jan 20, 2009||karlturnbull: fiat to buy 35% stake in chrysler|
|Jan 26, 2009||magneda2: Reuters: Chrysler urges dealers to order cars, cut costs: NEW ORLEANS (Reuters) - Chrysler LLC on Sunday.. http://tinyurl.com/b46y2x|
|Feb 05, 2009||dugg: GM, Chrysler offer to buyout nearly 100% of hourly workers: General Motors is offering buyouts to virtually all .. http://tinyurl.com/c2sw64|
|Feb 14, 2009||googlenewsbiz: GM, UAW talks break off Chrysler talks stall - Reuters:|
|Feb 18, 2009||latimesnews: GM, Chrysler seek billions more in federal loans: General Motors asks for $9.1 billion to $16.6 billion and says.. http://tr.im/grbq|
|Feb 26, 2009||feedsontap: Acquisition Chrysler company_beingacquired Fiat SpA company_acquirer|
|Mar 12, 2009||MobileAuto: Chrysler threatens Canada pull out - The Associated Press|
|Mar 20, 2009||wopularall: Fiat says it won't assume Chrysler debt in deal http://ff.im/-1CV1s|
|Apr 01, 2009||alexanderwatson: Obama: Bankruptcy only option for GM and Chrysler.|
|Apr 08, 2009||toledonews: Chrysler rolls out new Jeep Grand Cherokee after government scolding http://ff.im/-20515|
|Apr 15, 2009||borgellaj: Fiat CEO warns Chrysler unions: cut costs or we walk|
|Apr 30, 2009||SecurityCanada: Chrysler will file for Chapter 11 bankruptcy|
There are huge numbers of Twitter status messages being created every day. I've been tracking tweets containing the word 'obama' for more than 250 days now and on average there are more than 10,000 tweets/day. There is so much data that it can be overwhelming to try and extract useful information. The nature of the twitter platform means that any useful information for a particular topic is highly fragmented. There is also a large amount of redundant information especially since so many tweets are actually 'retweets'.
Can we construct something approaching a narrative from all the bits ? Can we eliminate much of the redundancy ? I've started to tackle this problem with the following approach:
As an example I have analyzed a sample of tweets taken from Obama's first 100 days in office. The table below shows the characteristic tweet for each day. I used every 25th tweet containing 'obama' in the time period and discarded non-English tweets. This left me approximately 75,000 tweets for the analysis. It seems to work fairly well. You can read through them and get a pretty good summary of the various Obama-related events that have recently occurred.
|Jan 20, 2009||charlesta: watching Barack Obama's inauguration on TV|
|Jan 21, 2009||francis_gt: watching obama's inauguration speech|
|Jan 22, 2009||GeorgeReese: Obama retakes the oath of office tonight :)|
|Jan 23, 2009||Hops11: Obama overturned global gag rule! YES!|
|Jan 24, 2009||PoliticsFix: Obama reverses abortion-funds policy - http://is.gd/h1VQ - WFIE-TV|
|Jan 25, 2009||odanielpavon: Some global adversaries ready to give Obama chance (AP): AP - In his inaugural address, President Barack Obama signaled conciliation t..|
|Jan 26, 2009||dustytrice: Breaking: Obama will direct EPA to move swiftly to grant 14 states the right to set strict auto emission standards on Mon (via @Populista)|
|Jan 27, 2009||nyycarl07: @ricksanchezcnn Obama's Al-Arabiya interview/Mitchell Mideast visit...mending fences with the Arabic world..meaningful dialog..long overdue!|
|Jan 28, 2009||YahooNews: Obama open to compromise on $825B stimulus bill (AP)|
|Jan 29, 2009||keramurphy: Obama signed the Lilly Ledbetter Equal Pay Bill. Love it.|
|Jan 30, 2009||binikadwa: Even Obama's rooting for the Steelers|
|Jan 31, 2009||bigkumadog: Obama's half brother arrested on charge of drug possession: NAIROBI, Kenya - George Obama, the half brother of U.. http://tinyurl.com/dzazy8|
|Feb 01, 2009||wbaustin: Obama Takes Jab at Chief of Staff at Alfalfa Club Dinner: President pokes fun at his volatile chief of staff Rah.. http://tinyurl.com/cbhkrd|
|Feb 02, 2009||caerickson: Rooney just thanked Obama for supporting the Steelers!|
|Feb 03, 2009||Headline_News: Daschle withdraws as HHS nominee: Former Sen. Tom Daschle has asked President Obama to withdraw his nomination f.. http://tinyurl.com/d66eaj|
|Feb 04, 2009||idigg: Obama To Cap Executive Pay At $500K For Bailout Recipients http://tinyurl.com/abqpq2|
|Feb 05, 2009||gregspradlin: Reading about Fairey and AP......AP alleges copyright infringement of Obama image .. http://tinyurl.com/czxvat|
|Feb 06, 2009||nelking: @joshcagan Headline: "Senate Struggles on Stimulus in Nighttime Session" Related news: Obama adds Dr. Ruth to Economic Advisory Board|
|Feb 07, 2009||latimesnational: Artist of famed Obama poster arrested in Boston: Police in Boston say the artist famous for his "Hope" posters o.. http://bit.ly/FPN6|
|Feb 08, 2009||inaug: #Inauguration Lompoc man has front row seat at Obama inauguration - Lompoc Record: Lompoc man has f.. http://tin.. http://tinyurl.com/bm ...|
|Feb 09, 2009||ElkhartTruth: Obama: "We've got the best workers right here in Elkhart." #obamaelkhart|
|Feb 10, 2009||jclayiv: watching the obama press conference|
|Feb 11, 2009||fwstylewatch: breaking... michelle obama's march vogue cover finally unveiled!|
|Feb 12, 2009||Love_The_Oscars: Obama praises Lincoln's legacy at Ford's Theatre|
|Feb 13, 2009||Politisite: Republican Senator Judd Gregg withdraws as Obama's Commerce Pick over conflict on stimulus #tcot|
|Feb 14, 2009||NewsOnTwitter: MSNBC - Obama: Stimulus bill is 'major milestone': President Barack Obama, savoring his first major victo.. http://tinyurl.com/cvv6gc|
|Feb 15, 2009||lemonhed77: news update Air Force One is one 'spiffy ride,' Obama says: It's longer than a hockey rink, has two f.. http://tinyurl.com/b8wky4|
|Feb 16, 2009||imacsweb: Obama decides on task force to oversee auto industry reform rather than appoint "car czar" http://tinyurl.com/cv66z3|
|Feb 17, 2009||keyc: Pres. Obama Signs Stimulus Bill in Denver | http://keyc.tv|
|Feb 18, 2009||timesnews: Obama to unveil mortgage foreclosure plan http://www.timesoftheinternet.com/47845.html|
|Feb 19, 2009||caniba: Obama goes to Ottawa, ON, Canada and what do the Internets call it? #Obamawa -- I don't say this enough but... I love you Internets.|
|Feb 20, 2009||ThomasGalvin: thinks its funny that Obama is lecturing mayors to "spend wisely"|
|Feb 21, 2009||roadkillrefugee: Obama's Weekly Video Address: Quickest & Broadest Tax Cut EVAH! http://tinyurl.com/dxdg7b|
|Feb 22, 2009||IvorKellock: Obama aims to halve deficit by 2013 http://ff.im/-1aRkZ|
|Feb 23, 2009||AccordionGuy: Sasha Obama Keeps Seeing Creepy Bush Twins While Riding Tricycle Through White House: http://is.gd/kyi1|
|Feb 24, 2009||sumbonet: NewsOnTwitter: BBC NEWS - Japan PM visits Obama White House: Japan's Prime Minister Taro Aso will be the first... http://ff.im/-1bNY1|
|Feb 25, 2009||amyz5: For those who missed my post speech commentweet last night: Obama is to Jindal as Dylan is to the Jonas Brothers. #nsotu|
|Feb 26, 2009||neilkelty: Disappointed in President Obama's budget.|
|Feb 27, 2009||profchandler: RT: @NewsHour: At 11:45 Obama will address Marines at Camp Lejeune.expected to announce withdrawal of U.S. combat forces from Iraq Aug 2010|
|Feb 28, 2009||headlinenews: AP: Obama moved toward commanders in Iraq decision: WASHINGTON (AP) -- President Barack Obama leaned heavily .. http://tinyurl.com/chavyl|
|Mar 01, 2009||ReddingNews: Data on Obama's Helicopter Breached Via P2P?: Tiversa, headquartered in Cranberry Township, Pa., reportedly disc.. http://tinyurl.com/cd28gf|
|Mar 02, 2009||thebodybreaks: Obama nominates Gov. Sebelius for health post: Kansas Gov. Kathleen Sebelius, President Obama's nominee to head .. http://tinyurl.com/d989ha|
|Mar 03, 2009||atifunaldi: Sources: Obama to shelve species rule|
|Mar 04, 2009||TechGlance: Obama taps Julius Genachowski to head the FCC http://tr.im/h10T|
|Mar 05, 2009||leeharveydent: Watching CNN: Obama's Rx for health care reform.|
|Mar 06, 2009||news_by_robots: Obama to Lift Ban on Funding for Embryonic Stem Cell Research @Washington_Post|
|Mar 07, 2009||caketeagirl: Pleased about Obama's decision to reverse Bush's limits on stem cell research|
|Mar 08, 2009||ftantillo: "The Rock" Obama on SNL = awesome|
|Mar 09, 2009||Atticus_James: yay obama and stem cell research!|
|Mar 10, 2009||HootieMcBoob: Go Obama on the stem cell research! WOOOT! :D|
Brain Pickings just built a typographic visualization using Wordle based on the title text from the various TED talks. If you don't know about TED already then be sure to check it out. They provide 'riveting talks by remarkable people, free to the world' and it's some of my favourite content on the web.
Brain Pickings used the title text from this spreadsheet to generate their cloud. I've taken both the title and summary text to produce my own shaped word cloud based on their logo. Click on a word to see the related talks, pick one and then watch it !
Yesterday President Obama delivered an address to the National Academy of Sciences. I am a strong believer in the critical importance of science and technology as a means of improving the average quality of life in our world and it was refreshing to hear from a president who believes the same. Here are a few snippets:
At such a difficult moment, there are those who say we cannot afford to invest in science. That support for research is somehow a luxury at a moment defined by necessities. I fundamentally disagree. Science is more essential for our prosperity, our security, our health, our environment, and our quality of life than it has ever been.
we are restoring science to its rightful place ... Under my administration, the days of science taking a
back seat to ideology are over. Our progress as a nation – and our values as a nation – are rooted in free and open inquiry. To undermine scientific integrity is to undermine our democracy.
The streamgraph below was created fom the complete text of the speech. Click on it to see a high resolution PDF version.
I did a fair number of posts last year that analyzed various texts related to the US election. A number of different techniques were used including StreamGraphs , Speech Contrast Diagrams, an interactive transcript visualizer, and, of course, word clouds. I introduced Martin Krzywinski in my last post as the creator of Circos. Martin has also done some excellent work in the area of lexical analysis and visualization of text in the post Lexical Analysis of 2008 US Presidential and Vice-Presidential Debates — who's the Windbag?
Here is a portion of one of his graphics that illustrates thematic profiles for Obama and McCain during a debate. It has some conceptual similarity to my interactive transcript visualizer.
These word clouds below were created by Martin and use different colours to show the words spoken uniquely by Obama in green, uniquely by McCain in blue, and by both men in white. The first one shows nouns and the second is limited to adjectives. I think the idea of limiting the cloud to a particular part of speech is a fruitful one to explore.
In the same document Martin also formulates and calculates an interesting 'windbag index' that is a composite of measures of repetition in various aspects of speech.
FlowingData recently had an interesting guest post about an alternative way of visualizing tabular data. It was by Martin Krzywinski and featured his visualization tool called Circos. Circos can produce a wide variety of information-rich, radial-based diagrams.
Some of the comments on FlowingData were quite negative and inspired a follow-on post by Nathan titled Narrow-minded Data Visualization. His post and the many related comments are interesting reading for those who care about data visualization and the tension between traditional/novel , expert/amateur, and cautious/exuberant approaches.
Some of these diagrams are very information-dense and might be a challenge to decode for those without much experience in interpreting them but I believe they are likely a powerful technique in the right situation. I suspect that no matter what your feelings are on the utility you will find it stimulating to examine a few example diagrams created with Circos.
I was too busy yesterday to create this for Earth Day so here it is one day late. Besides, shouldn't every day be earth day? Around 3500 tweets containing the text 'Earth day' were analyzed and the shaped word cloud below was created based on the frequency of the other words used. Click on a word to see the latest matching tweets. I used the same globe image as in My World Has Room For Wildlife and World Peace. The image was made with NASA World Wind.
A few months back I was contacted by someone at McKinsey & Company for permission to include a graphic in a publication they were producing. They used my tool Twitter Spectrum to create an image illustrating the words associated with the terms 'collaboration' and 'individualism' in the latest tweets on Twitter. This was used in a section of a printed book called What Matters - Ten questions that will shape our future . The book was distributed to leading business executives and world leaders at the World Economic Forum annual meeting in Davos Switzerland at the end of January. I was very pleased to be associated, even in such a small way, with such a prestigious undertaking.
The online version of the content does not include the image but the scanned image is shown below. It shows that 'collaboration' was used more frequently than 'individualism' in tweets. Dominant terms related to collaboration are: blogging, power, world, strategy, socialtext, and tomorrow. Terms related to individualism include rugged, hyper, sovereignty, obama, and american.
The image above was generated in Nov 2008. Just for fun I have created the current spectrum to see how it compares. It looks quite different and is much more balanced. Note that McKinsey manually recreated the image they used in order to get the colours they wanted.
There has been a lot of attention on Twitter this week to three celebrity-related topics. Early in the week there was a lot of discussion about Susan Boyle, the candidate on Britain's Got Talent. In the middle of the week there was Ashton Kutcher becoming the first Twitter user to have more than 1,000,000 followers. Finally, on friday, Oprah joined Twitter and featured it on her show.
I've used Twitter Venn to compare the current rate at which these three people are being referred to in the TwitterSphere. Susan Boyle is slightly behind Oprah right now and far ahead of Asthon Kutcher. These results reflect the current zeitgeist and could be quite different tomorrow. It's also interesting to note the high frequency of the hashtag #herebeforeoprah within oprah references. Click on the image or this link to see what it's like right now.
We are 108 days into the year and Neoformix has had 28 posts to date. This works out to about 1 post every 4 days. I'm making a public commitment right now to try and post more often with a target of averaging 1 post/day for the rest of the year. I will continue to highlight my own work but you can expect to see more posts about other data visualization related material on the web.
The Mesh Web Conference just finished in my hometown of Toronto. I didn't attend but it looked like it would have been an interesting experience. I built another shaped word cloud based on tweets containing the text 'mesh09' sent over the last few days. The larger the word, the more frequently it was used. Click on any word to see the related tweets in Twitter Search. It seems to illustrate the primary topics and speakers reasonably well.
The image map below was constructed from the most popular content words in tweets about the Web 2.0 Expo taking place in San Francisco. The larger the word, the more frequently it was used. Click on any word to see the related tweets in Twitter Search.
I have used Twitter Venn to look at the tweets containing references to the leaders of the US, Canada, and the UK during the G20. This is a snapshot taken around 9:30 EST on April 2nd. I combined the word clouds for several Venn regions into the single image below for ease of comparison. A few obvious observations:
Click here or the image below to see the latest Twitter Venn for the names of these leaders in the context of G20.
The wonderful collaborative data visualization site Many Eyes has just introduced a new type of text visualization called a Phrase Net. It does a brilliant job of letting you explore a text and see which words are related to each other. The image below shows words related through simple juxtaposition. Words in darker blue more often appear at the beginning of a pair and those in lighter blue at the end. This image below clearly shows that 'strawberry fields' , 'good day', and 'blackbird singing' occur frequently in the dataset. The data here is a set of the lyrics from Beatles songs.
static image - embedded interactive version is below
I made two minor changes to the Twitter StreamGraphs application. First of all, you now get the latest 1000 tweets containing your search term rather than the latest 200. Secondly, I changed the default search term to 'data visualization'. The first change should help make it a little more useful although for any popular term 1000 tweets doesn't go back very far.
One technique that might be useful is to include more simple words in the query that will match fewer tweets and let you see back farther into time. For example a search on 'coffee' will likely only show a couple of hours back but a search on 'coffee to the and' goes back almost two days. If you are using the 'q=' parameter in the URL then separate different words with '+' like so: http://www.neoformix.com/Projects/TwitterStreamGraphs/view.php?q=coffee+to+the+and
Twitter StreamGraph for 'Data Visualization' (click to use interactive tool)
Obama delivered an Address to the Joint Session of Congress last night. I have compared it below to the last couple of State of the Union Addresses using the Sentence Bars technique. It clearly shows a shift away from Security issues towards Economic and Domestic issues. The summary pie charts at the bottom make it even more obvious - Security fell from around 40% in 2007 and 2008 to 13% this year. Vocabulary associated with the Economy was about 20% in 2007 and 2008 and 40% this year. References to Domestic issues have also increased dramatically this year.
This previous post shows diagrams for 2001 and 2002 as well. The 2001 version, which was prior to 9/11, shows a vocabulary profile almost identical to the one this year - Government 21%, Domestic 29%, Economy 38%, and Security 13%.
Address to Congress - Sentence Bars with Topic Colours (click to see larger version)
I really admired the interactive graphic produced by the New York Times that showed the map of popular Super Bowl words used on Twitter. I created something of my own using pretty much the same data. Mine supported many powerful features like the ability to zoom in and out of the map, to see individual tweets, and also to filter the tweets that were shown. Despite this, I preferred the NYT visualization over my own because the design was so accessible and it directly showed something interesting on the map with a minimum of fuss. I decided that I would try and emulate the design in my next twitter visualization.
Lot's of people were twittering away last night during the Academy Awards so I gathered the data and constructed a new visualization very similar in design to the NYT Superbowl map. I grouped words into several categories and let you select which one to see. The categories are:
This first sample map shows which people were being mentioned the most in tweets shortly after 9:30pm (EST) in the black text. The person being discussed throughout most of the country at that time was Ben Stiller and there were a few areas talking more about Joaquin Phoenix or Natalie Portman. The text in bright red shows the top adjective associated with that person in that location during that time period. If there was no adjective used for that person/location/time combination then the most common adjective associated with that person for any time or location is shown in a darker red color. This technique was not used in the NYT SuperBowl vizualization but seemed like a good way to show more meaningful information.
The second example map shows the movies being discussed just prior to the end of the show.
You can grab the handle on the timeline and drag it around or use the arrow keys on your keyboard to move back and forth a single time period. Give the interactive version of the Oscar Twitter Map a try !
The Atlantic has an nice interactive feature accompanying the article The Shaping of America by Richard Florida. It shows Patents per Captita for US cities and how they have varied over time as well as population and income data. It has many similarities in structure to my recent interactive maps on Obama's inauguration and the Superbowl as well as the Walmart and Target expansion maps from FlowingData.
This year is the 200th anniversary of the birth of Charles Darwin and it has been 150 years since the publication of On the Origin of Species. I read the book when I was 12 or 13 and followed it up with The Descent of Man. It's likely I had encountered the idea of evolution before this but I can still recall, thirty years later, how astounded I was with the weight of the evidence presented and especially the enormous explanatory power of his ideas.
The fact that all the amazingly diverse life on earth shares a common ancestry seems difficult for some people to accept but I find it inspiring. Darwin was a master at discovering and illustrating the patterns in the forms of life he observed. His theory of evolution is an excellent example of the power of simple iterated processes to generate both great beauty and complexity.
I have created a structural map of the Twitter community based in the area where I live, Toronto, Canada. Tweets were gathered using the Twitter Search API for a radius of 50 miles around Toronto during the two week period January 17-31, 2009. This yielded 337,782 tweets - approximately 24,000 tweets/day. Of these, 147,166 tweets contained an @ reply directed to at least one other twitter ID. These messages were analyzed to count the number of tweets between pairs of people both residing within the Toronto area. The final raw dataset defining the structure of the community had 3938 distinct twitter IDs and 18,831 relations connecting then together.
Such a large set of nodes and edges is difficult to represent visually in a pleasing manner. With this many connections a standard node and edge diagram is usually a tangled mess. I tried to overcome this by clustering nodes together in a hierarchical fashion based on the connections between them. Connections between individual twitter IDs are only drawn if they are within the same group. Edges are drawn at the group level to show relationships between groups.
If you want to examine the structural map more closely then look at the PDF version of the Toronto Twitter Community Structure. You can do a text search on a name within a PDF viewer to see where you are and who you've been grouped with. Note that a bug in my tools prevented me from placing the text labels precisely where I wanted and when they overlap with another label the text search may fail.
The first image below shows the overall structure. The area in the red square is shown at a higher resolution in the image that comes after it.
This sequence of 5 images takes us from the whole community to a small group surrounding Mathew Ingram (mathewi), who is a technology columnist at The Globe and Mail and is well known in the Toronto Twitter community. It clearly shows the Toronto-based people he communicated with on Twitter using the @ reply mechanism during the last two weeks of January. For one of these people, sarah_mitchell, there was no direct connection but they were grouped together because Sarah had strong ties to tmaduri.
The second last image shows connections coming out of Mathew's group into several other groups. This shows that there are connections between some of the individuals inside these groups. The clustering algorithm imposes constraints on how large the groups can get so they can't all be placed in one giant group. These high order group connections can occur at every level so it is possible that Mathew had conversations with other people, perhaps some on the far side of the map in the red region. However, the intent is to show most of the people that were tightly bound within the same immediate group.
Note that this diagram completely ignores who follows who on Twitter. It's based entirely on @reply data. There may very well be important members of the community that didn't use @replies to other Toronto people during the time period and so do not show up.
By the way, feel free to follow me on Twitter if you would like to discuss this or hear more about my work in the future. I'm at http://twitter.com/JeffClark
I've gathered some tweets sent during the time of the superbowl last night and created another map-based tool to visualize them. I took every 10th tweet from 6:00 pm to 10:30 pm EST that contained 'superbowl' or 'super bowl' and plotted all those for which I could find the longitude and latitude coordinates of the author. This process yielded 5711 tweets to explore. The interactive tool can be found here - Superbowl Twitter Visualization.
Here are a few screen shots captured from the tool. The first map shows that references to the two teams seem to be fairly equal and there are no obvious large clumps for one or the other.
The attention to the event was more global than I expected given that this is a sport primarily played in the US and Canada.
Here is a word cloud for the Superbowl played last night shaped with the logos of the Steelers and Cardinals. I used a sampling of all the relevant twitter messages between 5:30 and 10:30 EST for the text that was analyzed. In addition to the usual stop words I removed 'superbowl', 'super' and 'bowl' from the analysis since those words were used to select the text and would dominate too much. This cloud is clickable and brings you to matching tweets.
Around the middle of December 2008 many parts of Canada and the US had fairly severe snowstorms. The name 'Snowmageddon' became popular for describing the event. I've collected data from Twitter on how this term, and the spelling variation 'snowmaggedon', was used. I've built a new interactive visualization to support exploring these messages along the dimensions of time, geography, and the actual message content. This is based on the work I did recently showing the tweets related to Obama's inauguration. For the impatient you can try out the application here: Snowmaggedon on Twitter.
Here are some maps showing the results. This first one shows a few things:
This second image shows the great lakes region around what appears to be the peak which occurred around 2pm on Dec 19th. Starbursts are used to highlight references around the current time and the time slider has been adjusted here to show the maximum number of starbursts.
A close-up of Toronto with the filters set to 'love' and 'hate' shows that the extreme weather caused extreme emotions in some people. Note that a little bit of noise was added to the latitude and longitude coordinates so that at high magnifications there is some separation between messages that were sent from the location 'Toronto'.
A map showing the entire world shows that use of the term was not limited to Canada and the US.
Of course the real fun comes from playing with the interactive version so make sure you give Snowmaggedon on Twitter a try !
There was obviously a lot of attention directed towards Obama's inauguration last week. Naturally, this extended to the twittersphere and there were a huge number of Twitter messages regarding the event. I've built a new interactive visualization to support exploring these messages along the dimensions of time, geography, and the actual message content.
The video below is a bit fuzzy but gives an idea of what the tool does. For those who want to jump right to the application you can find it here: Obama Inauguration As Seen Through Twitter .
The application supports scrolling back and forth through time or animating the changes over time. The map support allows the standard zooming and panning as well as having special buttons to frame both the US and the whole world with one click. The mouse wheel can also be used to zoom in and out. There is text filtering support where you can enter one or two different strings and see which messages match.
I started with about 126,000 tweets but took a subset based on those that expressed strong sentiment and for which I was able to get proper geographic coordinates. The final application uses 11,389 messages. The sentiment filtering step likely had the effect of removing many of the non-English messages since it involved counting English high impact words (love,hate,beautiful,sucks etc).
This is my first map-based application on Neoformix and it was created much more easily because I was able to build it on top of something called Modest Maps. A special thanks go to Tom Carden of Stamen Design and the others who created it !
Give it a try ! Obama Inauguration As Seen Through Twitter
This Twitter word map is constructed from tweets containing 'obama' during January 1-21, 2009. Higher frequency words are larger and you can click on any word to jump to Twitter Search and see the matching tweets. This is done using a simple HTML image map I generate along with the image. The base image is derived from the iconic Hope/Progress image designed by Shepard Fairey.
Here is another version of the shaped word cloud for tweets containing 'apple'. The major difference from the previous one is that this one is clickable. You can click on the words to jump to Twitter Search and see the matching tweets ! This is using a simple HTML image map I generate along with the image. I also made the image a bit smaller so you can more likely see it all at once.