Neoformix Blog

A Short Reflection on Storytelling in Data Visualization

By: Jeff Clark    Date: Mon, 28 Apr 2014

The role of storytelling in Data Visualization has become much discussed over the last year or so. One reason I find this aspect of Data Visualization so interesting is that my own natural tendencies are to focus on exploratory visualization. Much of my own past data visualization work is weak in the storytelling side of things. Coming from a scientific background and personally enjoying the act of discovering patterns in data means my default approach is to build exploratory tools. For me, personally, this whole storytelling aspect seems a rich area to mine in order to improve my work.

I just finished listening to the latest Data Stories podcast called Visual Storytelling which is a discussion of the topic by hosts Moritz Stefaner, Enrico Bertini, and their guests Alberto Cairo and Robert Kosara. It's an excellent conversation from a number of perspectives on the subject and I found it very stimulating. If you haven't already heard it then make sure you have a listen.

I was surprised that one aspect of the topic wasn't discussed in the podcast: storytelling techniques in data visualization can be abused to express falsehoods. One thing that is of critical importance to me in data visualization work is that it is grounded in reality - it's based on data which are, hopefully, objectively true or based on some real measurements. To be sure, there is often uncertainty involved and for some topics objectivity is difficult but still, data visualization should be about describing reality as best we can.

Like many people with an engineering, mathematical, or scientific background, I'm suspicious of salesmanship and marketing. I'm wary of other people using emotion and a good story to persuade me to believe something that isn't true. I have some concern that data visualization work that emphasizes storytelling is more likely to be 'Data Fiction' - or propaganda. The designer, through careful choice of selected facts, use of emotion, drama, conflict, and all the other techniques of storytelling can craft a message at odds with reality. The use of 'data' will even lend an air of authority to that message.

Storytelling is a powerful tool for leading a person efficiently to the main points uncovered in a dataset and can dramatically increase the impact of a work. It's very important that the story emerges from quality data and that this connection is open to inspection. Let's make sure that all our data stories are true.

Markham Winter of 2014

By: Jeff Clark    Date: Tue, 01 Apr 2014

Winter has finally ended in Markham where I live and it has seemed a very long and cold season this year. I decided to take a look at the weather data from Environment Canada and see whether my impression is supported by the data. The result is the graphic below. Click on it to see a larger version.

Yes, 2014 was the coldest winter in Markham since 1994. We had an average temperature during the winter of -8.2 C this year and in 1994 it was -9.2 C. Both last year and especially 2012 were warmer than usual so it likely felt that much worse in comparison. We also had the 4th most snow in the last 20 years so it was both very cold and snowy.



Toronto Visible Minorities

By: Jeff Clark    Date: Fri, 27 Sep 2013

Toronto is the most multicultural city in the world. According to the 2011 National Household Survey, 46% of the population were foreign-born immigrants and 47% are members of a visible minority. (ref) These immigrants come from a wide variety of places across the globe and their diversity makes the city a truly remarkable place.

I have created a Dot Map that shows a single point for every person in the Toronto area, coloured by visible minority status. There are 5,700,628 in all and they are positioned at their place of residence and coloured based on the information from the 2011 census and National Household Survey. They do not depict actual individual locations but are based on the statistics over small areas.

This first image is zoomed in slightly and shows Toronto with only a few outlying areas. You can see regions of higher and lower population density as well as how the visible minorities are distributed across the city.



You can explore the map in detail with this Zoomable Dot Map of Toronto.

The section below is a close-up of the high-density string of condos along Yonge Street north of HWY 401. You can spot the blank rectangle of the cemetery to the left, the Don river valley, and commercial areas where no people reside.



The next image shows the white, predominantly Italian, area of Woodbridge with the South Asian concentration obvious to the west in Brampton.



This work was inspired by the previous creations of Eric Fischer, and Dustin Cable.

It was created with population data from Statistics Canada and map reference data from OpenStreetMap. The OpenStreetMap data was taken from the very helpful Metro Extracts provided by Michal Migurski. The TileMill tool from MapBox was used to compose a map used to mask out non-residential areas and also the basemap underneath the dots. Custom code written with Processing was used to place the actual dots and create the final images. Thanks!

Toronto 311 Visualization

By: Jeff Clark    Date: Fri, 06 Sep 2013

The calls people make into the 311 service line in Toronto give an interesting glimpse into the pulse of the city. The City of Toronto makes this data available through their Open Data initiative. I did some analysis and design work with it to produce a visualization for illuminating time-based patterns during 2012.


The visualization is a set of small multiple calendar heatmaps, one for each data series. The one shown above is for reports about 'long grass and weeds'. I was inspired to use this visual form by this example: Vehicles involved in fatal crashes by Nathan Yau. I experimented with a few different visual methods but this one did the best job of revealing both the seasonal and day of week patterns. I chose to use a unique colour scale for each series in order to maximize the amount of detail.

The image below shows the top 20 most common types of requests. Click on the image to load the full sized version. You can also view all the data series with an interactive version of the Toronto 311 Visualization.



This was created with Processing JS and contains information licensed under the Open Government Licence - Toronto.

Visual Book Selector

By: Jeff Clark    Date: Wed, 08 May 2013

One common pattern I see in many interactive applications is to support a person who is selecting a few items from some larger set. Often these items have various characteristics that the person wants to use in some way to guide their selection process. The characteristics can be numeric quantities, dates, categories, or names of things. Showing all the items in a list and allowing the person to sort by one of the attributes is often a decent default solution.

In other cases it's more useful to consider multiple attributes at a time during the selection process. Maybe you want items that are high in one attribute, low in another, and are from a particular category. Ideally the selection process should be one of exploration and successive refinement where various filtering criteria are adjusted until some small subset of items are defined and they can be investigated individually.

I have built an example of this concept which I call the Visual Book Selector. The books are directly represented with small circles and filters can be applied to progressively exclude books by various criteria. The filters are depicted visually as gates through which some of the items can pass and others cannot. The image below shows one possible configuration.

There are about 1000 books which start in the top segment of the display when no filters have been applied. In this example three of the category gates have been opened so books from those categories can pass through. The ones that don't pass this filter pile up near their closed gate which helps give some understanding of their distribution. The books that pass the first criteria encounter a second filter on the average rating of the book from Google Book reviews. This filter gate is set to only allow books having an average rating of at least 4.0 to pass through. The final gate does a pattern match on Author name and allows 4 books to the bottom which have passed all of the criteria.

The best way to get a feel for it is to try out the Visual Book Selector yourself. You can use the dropdown selectors on the left of each segment barrier to choose different criteria on which to filter. Hover over a book to see details and click on it's circle to visit the corresponding Google Books page.

The list of books and their categories comes from the 2009 article in the Guardian 1000 novels everyone must read: the definitive list. The other data was gathered from Google Books.

I should also note that an excellent solution to this multi-attribute selection/exploration problem posed here is the Elastic Lists concept by Moritz Stefaner. It supports what's called Facet Browsing and enhances it with the visualization of proportions and distributions as well as animated transitions.

Star Wars Movie Fingerprints

By: Jeff Clark    Date: Wed, 27 Mar 2013

Recently YouTube had a video that showed all six Star Wars movies at once. They were placed in a 2 by 3 matrix and had an audio track of all the movies superimposed. It was an interesting experiment that has since been removed based on copyright grounds. Before it was removed I was able to do some simple analysis on the video and extract some details of the individual episodes of the Star Wars series.

Basically, I produced something very similar to a classic work called Cinema Redux™ by Brendan Dawes, done in 2004. Each individual movie in the series was reduced to a collection of small snapshots taken at 1 second intervals. The snapshots are layed out 60 images per row so a row corresponds to a minute in the film. These 'fingerprint' images reveal some aspects of the film structure.

Click on any of these images to see higher resolution versions.

Episode I: The Phantom Menace StarWars1


Episode II: Attack of the Clones StarWars2


Episode III: Revenge of the Sith StarWars3


Episode IV: A New Hope StarWars4


Episode V: The Empire Strikes Back StarWars5


Episode VI: Return of the Jedi StarWars6

I used some fairly simple code in Processing to analyze the video and create the output images.

Obesity Slopegraph

By: Jeff Clark    Date: Tue, 26 Feb 2013

Last week the wonderful Guardian Datablog published an interesting post called Obesity worldwide: the map of the world's weight. It contains a map that shows with color the rates of obesity around the world. The accompanying chart gives data for different time frames and for both male and female which you can select and view on the map. When I saw the chart I immediately thought of a number of interesting questions that could not be easily answered with the map or chart.

  1. What is the trend over time?
  2. Do these trends exist worldwide?
  3. Which countries are exceptions to the trend?
  4. Which countries have the highest or lowest rates of obesity?
  5. Are there large gender-based differences in obesity rates in various countries?

Much of my past work has been driven by personal curiousity. That, together with my own background in science, have shaped my work such that most of it has been exploratory in nature. Recently I have been thinking more about the storytelling or communicative aspect of data visualization. This has been triggered by my admiration for the amazing work of the New York Times Graphics Department, and the writings of Alberto Cairo, Robert Kosara, Andy Kirk, and Jonathan Stray.

I decided try and build an interactive visualization that helped answer the questions above. I also tried to build something that explicitly highlighted some of the more interesting aspects of the data without sacrificing freeform exploration. I settled on using a Slopegraph which was first described by Edward Tufte and is featured on the cover of Cairo's excellent book The Functional Art.

This first image shows the trend for male obesity organized by continent. It's a difficult problem to show labels for so many countries along one axis so I tried to alleviate it by letting the user expand or hide countries by continent group. In this case 'North America' is expanded to show its' individual countries. Labels are only shown if they don't overlap with others. The largest countries by population are placed first.

Individual country lines can be clicked on to emphasize them with colour.

The third example shown below charts female values on the left against male values on the right in order to emphasize gender differences.

The interactive visualization includes a 'stepper' that takes the user through four different views. This helps introduce functionality gradually as well as serving to emphasize important patterns in the data.

In addition to the people and organizations mentioned above I would like to acknowledge the people behind Processing and Processing JS which was used to build the application. The code for the dashed lines comes from J David Eisenberg. Thanks!

Neoformix Site Redesign

By: Jeff Clark    Date: Tue, 19 Feb 2013

In 2006, I started this blog as an outlet for my creative personal work as well as to gather in one place references to interesting work by other people. Since then, Neoformix has grown into a full-time business for me specializing in the development of custom data visualizations. I have just spent some time giving the website it's first facelift in 7 years. I hope you like it!

I've tried to simplify the design and emphasize that Neoformix is a business by designing a main page that highlights some projects and moving the blog to a secondary page. Thanks to Twitter Bootstrap for a powerful front-end framework which I made use of in the redesign.

Word Hearts Updated

By: Jeff Clark    Date: Tue, 05 Feb 2013

About five years ago I posted a simple little application called Word Hearts which lets you fill a heart shape with words. Last year it was the most visited page on my site despite the fact that it was still a java applet based application which many modern browsers won't render. I have updated this tool to use ProcessingJS so it runs well in modern browsers. There is also enhanced functionality like:

  • You can fill circles, diamonds, stars, and squares as well as the original heart shape
  • There are more fonts to choose from
  • You can easily use small symbols like hearts, happy faces etc., in your list of words
  • A nice color picker
  • Word orientation options
  • Vary the word colors so it looks more interesting
  • Save your image

Here are a couple of examples of what you can do:



Launch the interactive version of Word Hearts to try it out.

This was created with Processing JS and also uses the JSColor color picker and the JQuery Font Chooser. Thank you!

Grimm's Fairy Tale Metrics

By: Jeff Clark    Date: Thu, 31 Jan 2013

I have built another little digital humanities project based on the text of the 62 stories in Grimm's Fairy Tales. This one is called Grimm's Story Metrics and presents an interactive matrix of stories together with various metrics calculated from their text. You can click on a column to sort by that data, click again to reverse the direction, and click on a story name to open it in another window. The image below shows the stories sorted by the 'Royalty' metric which indicates, as you would expect, how many references there are to words related to the topic of royalty. Click on the image to go to the interactive tool.


Hovering over any of the bars shows details about that particular measurement. Most of the metrics, like 'Royalty', are based on topics and the details shown are the words characteristic of that topic used in the story. So, for example, the details for 'Royalty' in the 'Frog-Prince' are princess, prince, king, kingdom which are listed in frequency order. These topical metrics are normalized based on total words in the story so longer stories have no scoring advantage.

The 'Lexical Diversity' is a ratio of the number of unique words in the story to the total words. These stories are fairly short and you can observe a rough inverse relationship between 'Story Length' and 'Lexical Diversity'. 'Clever Hans' is an outlier in this relationship. If you examine the text for this story you'll see that there is a great deal of repitition.

This was created with Processing JS. The text analyzed is the English translation by Edgar Taylor and Marian Edwardes available at Project Gutenberg. Thank you!

Les Miserables Word Graph

By: Jeff Clark    Date: Fri, 25 Jan 2013

Here is a word graph for the text of the novel Les Miserables by Victor Hugo. Click on the image to see a large (4 MB) version which makes all the words legible.


Area of the words reflects frequency in the text. The top three most similar words are considered for connections with the word similarity metric defined by collocation within the text. The outer ring of words only have one weak connection to another word in the graph.

Grimm Fairy Tale Browser

By: Jeff Clark    Date: Tue, 22 Jan 2013

My previous post on the Grimm's Fairy Tale Network showed a graph illustrating the strongest connections between the various stories. I used a few techniques to try and prevent the usual mess of connections that often obscure the relationships of interest.

Another way of tackling graphs with lots of connections is to only show a small portion of the graph at a time and use interaction to provide navigation. This lets you browse around a complex network of nodes and relations and repeatedly get views centered on a node of interest. I've created an example of this for the Grimm's fairy tale data which I call the Grimm Fairy Tale Connection Browser.

The image below shows the connections to the story 'Little Red Riding Hood'. The larger circles are stories and the smaller ones represent key words in the collection. The inner ring shows the words and stories closely connected to the story of interest. The outer ring gives the related stories and words that are related but with less strength. You can click on any story or word to make it the new focus node. Click on the image below to launch the interactive version.


This second example shows the stories and other words highly related to the word 'wolf'. The interactive tool shows the Gutenberg version of the stories in a panel on the right. When a new story is made the central focus of the visualization the right panel shows the story text.


This was created with Processing JS.

Grimm's Fairy Tale Network

By: Jeff Clark    Date: Tue, 15 Jan 2013

I have had some fun playing around analyzing the text of the stories in Grimm's Fairy Tales. There are 62 stories in this set and they contain many popular tales such as Little Red Riding Hood, Snow White, and Rapunzel. The text analyzed is the English translation by Edgar Taylor and Marian Edwardes available at Project Gutenberg.

Story Connections

The graphic below is a simple network showing which stories are connected through the use of a common vocabulary. There are three different strengths of connection shown and I've tried to minimize the usual 'hairball' nature of these types of diagrams by only showing the top three connections for a story. Some stories will have more than three links because the link meets the top-three threshold for the story on the other end of the link. The shade of blue simply indicates the number of connections for that story - the darker the shade the more connections. Click on the image to see a larger version.


The diagram shows in the upper-right corner for example that 'Little Red Riding Hood' is strongly linked to 'The Wolf and the Seven Little Kids'. My analysis shows that the strength of this connection is due to them both using words like wolf, stones, door, belly, scissors, drowned, and devour.

Novel Views: Les Miserables

By: Jeff Clark    Date: Tue, 08 Jan 2013

The project 'Novel Views' consists of a series of visualizations of the novel Les Miserables by Victor Hugo. The text analyzed is the English translation by Isabel F. Hapgood available at Project Gutenberg.

Character Mentions

This graphic shows where the names of the primary characters are mentioned within the text. Click on any of these images to see larger versions.


Characters are listed from top to bottom in their order of appearance. The horizontal space is segmented into the 5 volumes of the novel. Each volume is subdivided further with a faint line indicating the various books and, finally, small rectangles indicate the chapters within the books. In the 5 volumes there are a total of 48 books and 365 chapters. The height of the small rectangles indicate how frequently that character is mentioned in that particular chapter.

Radial Word Connections

A word used in multiple places in a text can be interpreted as a connection between those locations. Depending on the word itself the connection could be in terms of character, setting, activity, mood, or other aspects of the text. This graphic shows a number of these word connections.


The 365 chapters of the text are shown with small segments on the inner ring of the circle with the first chapter appearing at the top and proceeding clockwise from there. The outer ring shows how the chapters are grouped into books of the novel and the book titles are shown as well. The words in the middle are connected using lines of the same color to the chapters where they are used. The edge bundling technique together with the Volume - Book - Chapter hierarchy of the text are used so the patterns of connections are more easily revealed.


(More...)

Delaunay Images II

By: Jeff Clark    Date: Tue, 02 Oct 2012

A few years back I played around with creating Delaunay Images as described here and here. That work was inspired by these Delaunay Images created by Jonathan Puckey.

The delaunay process involves creating a triangular mesh in order to construct a more abstract version of a starting image based on some control points. In the past I either manually selected the control points or chose them randomly. I just recently came across some javascript code by 'atm2' for creating these types of images and discovered that it uses a more clever approach. Basically, edge detection is done on the base image and the delaunay control points are chosen from points on the edges. Using this idea as a starting point I modified the code a bit to make the triangles more transparent as they decrease in size. This basically lets us create a triangularized abstract version of an image while letting the details of the original show through in key areas. An example is below:

I really like the effect and it's completely automatic which opens up some interesting possibilities. The original base image is by Steve McCurry and is of Sharbat Gula. A retrospective on her life done by National Geographic can be found here.

Ablaze

By: Jeff Clark    Date: Thu, 20 Sep 2012

I recently came across an interesting javascript tool to generate images based on connecting lines between pairs of moving invisible points if they come within a threshold distance. It's called Ablaze and it was created by Patrick Gunderson. It's got a bunch of options to give you some creative control over what gets produced.

Movement in Manhattan Video

By: Jeff Clark    Date: Tue, 08 May 2012

In my last post about visualizing Movement in Manhattan I mentioned that it would be interesting to explore a more direct view of the data by using an animation. I have created such a video based on a fresh collection of tweets from Monday, April 30th. I gathered new data because I realized that my previous data set was collected over the weekend and I suspected that a weekday might provide more obvious patterns. It compresses 24 hours of data into 1 minute of video. Here it is:

I was influenced by the 'Fireflies' video showing iPhone traces done by Michael Kreil. In particular, I like the idea of using larger but more transparent graphics to represent the increased uncertainty when drawing interpolated locations. Basically, if a person tweets at location A and then again at location B ten minutes later the model I used assumes they moved at a constant speed in a straight line between those two events. This is an obviously crude approximation and leads to unrealistic paths in many cases. By increasing the transparency in between the two measured events it shows this uncertainty in a visual manner.

Again, as I saw in the original version, the patterns of tweets, both moving and static are quite chaotic. You can easily see the rise and fall of tweets over the changing time of day and some local patterns that look interesting but the patterns are still a bit of a jumble.

The geolocated tweets were collected with the library Twitter4J which was used from code written in Processing. I used this tutorial created by Jer Thorp to get started with the library. Code from this flow field sample by Daniel Shiffman was used as a starting point to create my flow maps. The background map is from OpenStreetMap. Thanks everyone!

Movement in Manhattan

By: Jeff Clark    Date: Wed, 18 Apr 2012

Inspired by the beautiful and elegant Interactive Wind Map created by Fernanda Viegas and Martin Wattenberg I have begun to explore the flow of people within a city. An ideal dataset to do this would include the GPS traces from thousands of people wearing trackers for weeks as they go about their daily lives. Organizations such as crowdflow.net and OpenPaths collect voluntarily donated data of this type and might be fruitful to explore. I decided, instead, to use geolocated tweets to try and see how the movement of people is affected by the urban landscape.

The image below shows an area of Manhattan roughly from Houston Street north to 72nd Street which corresponded to the region with the most geolocated tweets that I collected. It includes Times Square, Grand Central Station, the Empire State Building, Rockefeller Center, the southern portion of Central Park, and many other well known landmarks. The blue and red markings are an attempt to show the flow of people based on the data.

Basically, tweets sent by the same person within a 4 hour time-window were used as samples of speed and direction. These samples were used to construct a vector field representing the average flow of people within the area. The vector field and total tweet density over the space were then used to simulate the movement of people. Particles, representing people, were released at locations where actual tweets were recorded and their subsequent movement was determined by the flow field. The particles start out blue and gradually change through purple to red over time so each trace shows the direction of movement. Locations where there is little movement will have blue dots or very short blue traces. Longer traces with more red show a greater speed at that point.

The density and direction of the flow patterns seem reasonable but they do appear fairly chaotic - much more so than the patterns seen in wind flow for example. This makes sense for many reasons. One, people are much less deterministic than the molecules that make up the air. Secondly, the environment that they exist in is extremely complex. Also, statistically we are dealing with a much smaller sample size. In this case, roughly 34,000 geolocated tweets with only 9,600 path segments. If we had a million-times more data then the average patterns would be more clear. Another important factor is that this data was collected over a few days and so there may be clear patterns for specific times of day that are mixed together visually.

I have produced three more images that separate out the data by time of day. This first one only uses data from 6-11 am. It does appear to be a bit simpler and shows a few interesting patterns but it is still fairly chaotic. There is a strong flow east out from Central Park near 65th Street. There is also a more scattered flow from the east into New York University near the bottom left.

The afternoon flow map shows a greater overall density indicating a greater number of locations from which people are tweeting. There also appears to be a strong convergence on the area of 14th Street - 4th Avenue.

The evening map is also quite busy with lots of small local patterns. There is heavy action between 50th and 57th Streets. Comparing these three versions is easier with this Flickr lightbox version of the images.

Overall, there are lots of flows and some of them likely reflect real movement of people within Manhattan. Many others probably just reflect noisy data because the sample size is so small. It's difficult to distinguish between the two cases here. The technique itself might warrant further study with more data. Another interesting avenue to explore would be to more directly visualize the data with an animation like this 'Fireflies' video showing iPhone traces done by Michael Kreil.

The geolocated tweets were collected with the library Twitter4J which was used from code written in Processing. I used this tutorial created by Jer Thorp to get started with the library. Code from this flow field sample by Daniel Shiffman was used as a starting point to create my flow maps. The background map is from OpenStreetMap. Thanks everyone!

Datavis Subgroup Word Analysis

By: Jeff Clark    Date: Mon, 05 Mar 2012

This is Part 4 of a set of posts related to the analysis of the Data Visualization Field on Twitter. For context or more information you may want to read those other posts first. They are:

  1. The Data Visualization Field on Twitter
  2. Data Visualization Field Subgroups
  3. Datavis Blue-Red Connections

In the previous posts we have seen that there are two fairly cohesive subgroups of twitter accounts that emerged from our analysis of the original 1000 accounts. I've been calling them the 'blue' and the 'red'. They were determined by looking exclusively at the references to twitter IDs within the tweets that were sent.

Presumably the fact that there are two fairly distinct groups would also be reflected in what they are discussing. I've done some analysis of the words used within the tweets for both groups. English stop words ('the' , 'and' , 'or', ... ) and other words commonly found in tweets ('new', 'via', 'like', 'day', ...) were excluded. Word clouds definitely have their limitations but I believe they can be an effective way to get a qualitative feel for a body of text. I have used Wordle to construct word clouds for the two groups.




It's clear that the blue group tweets a lot about 'art', 'code', 'design', 'processing', 'project', 'app' and 'workshop'. The red group tweets about 'data', 'visualization', 'design', 'infographic', and 'visual'. There is some overlap for sure but it's clear that they emphasize different things in what they are talking about.

Right from the very start I was calling the whole set of accounts the 'Data Visualization Field'. Of course, a more accurate description was that I was looking at the 'Set of Accounts on Twitter Connected Through Tweet Mentions from @moritz_stefaner, @datavis, @infosthetics, @wiederkehr, @FILWD, @janwillemtulp, @visualisingdata, @jcukier, @mccandelish, @flowingdata, @mslima, @blprnt, @pitchinteractiv, @bestiario140, @eagereyes, @feltron, @stamen, and @thewhyaxis'. It doesn't exactly roll off the tongue. From looking at these word clouds it appears that the red group could reasonably be named 'The Data Visualization Field' and the blue group something like 'Computational Artists and Designers'.

If we want to contrast these two groups more directly we can look for words that are used much more frequently in tweets of one group than the other. I've done this for words that met both an overall frequency threshold and an author support threshold - they were used by at least 10% of the group members. The bar charts show the frequency proportion. So, for example, in the large sample of tweets I looked at from both of the two groups if you count the number of times the word 'makerbot' was used then 99% of those instances were in tweets from people in the blue group.




This shows even more clearly the different things that these two groups emphasize.

Datavis Blue-Red Connections

By: Jeff Clark    Date: Fri, 02 Mar 2012

The recent post on Data Visualization Field Subgroups had an interesting reaction on Twitter that I didn't expect. Many people that were placed in the 'red group' by the community detection algorithm in Gephi joked about being part of the 'team' and being happy to represent it and be grouped together with the others. Jen Lowe lightheartedly suggested a scrimmage at #eyeo between the red and blue. There was much less reaction from the 'blue group', likely because I'm embedded within the reds myself and so they likely paid more attention to my posts and the subsequent reaction on twitter.

There does, indeed, seem to be two fairly cohesive groups of people here but I suspect there are very many connections between the groups as well. We can use some simple network analysis to get a feel for this. Here are a few statistics calculated on the blue and red groups only:

Characteristic Blue Red
Number of Nodes 216 244
Total In-Links 6734 5712
Total Out-links 6070 6376
Avg In-Links 31.18 23.41
Avg Out-Links 28.1 26.13
Total Intergroup links 665 1329
Total Intragroup links 5405 5047
Percent Intergroup links 10.96% 20.84%

Both groups are pretty similar in most respects. The primary difference is that blue group members have on average more incoming links and that the percentage of intergroup links going from someone in one group to someone in the other is roughly double for reds. Remember that a link from A to B means that A referenced B in a tweet through a reply, a retweet, or just mentioning them in some context. When considering just the links between these two groups the people in red are referring to the people in blue at twice the rate of the reverse.

If you look at the graph showing both groups together (edges not drawn) it's clear that some nodes, for example blprnt and pitchinteraciv, are on the border between the groups which suggests they likely have a fair number of cross-group connections.

By looking at the details of the connections and their strengths we can quantify the 'blueness' or 'redness' of any particular node. This indicates how embedded they are within their own group. We can also do this separately for both incoming and outgoing links but I'll keep it simple for now and show one value that reflects both types of links together. This first table shows the top blue accounts (by degree) sorted by how 'blue' they really are.

Degree Blueness %
134 99.03
166 98.5
147 98.39
136 97.51
149 96.78
148 96.38
191 93.69
231 92.76
232 90.46
276 88.57
249 87.18
149 86.99
181 85.62
123 84.42
126 84.18
135 77.7
207 73.75
187 73.19
309 66.23
132 54.73

You can see that feltron, blprnt, eyeofestival, and ben_fry are all tending towards the red which matches what we see in the network graphic where they are on the border. This table below shows how 'blue' the top twitter IDs are that were placed in the red group. Again we see that some accounts had significant linkages to the blue group.

Degree Blueness %
165 35.48
326 24.34
163 18.27
290 18.25
198 17.71
146 15.9
149 14.48
142 11.49
180 10.34
172 7.98
154 7.57
243 7.45
133 6.17
244 5.77
140 4.66
239 2.46
199 1.44
138 1.36
204 0.8
163 0.44

Older Posts...