Here is a fifth post in a series looking at word usage by time of day in tweets. The first four posts are useful background material if you haven't read them yet:
If you look at the time series for the top ten words you will notice that many of them have a very similar shape. For the words 'lol', 'new', 'time', 'love', 'know', 'great', and 'twitter' they all seem to peak around 1-2am, drop off to a lowest point between 3-5am, and gradually rise during the day. Why should there be a relationship between the curves for these words ? Do lots of people write tweets that use these words together ? Or is there some special temporal relationship between these words ?
The answer is much simpler. One of my readers, Kyle McDonald, posed an interesting question: is tweet density (tweets over time) relatively constant throughout a day?. The data I'm using here all comes from Toronto. It's a single location and is therefore from a single time zone which is important when looking into the time of day that the words were used. If we look at the curve for number of tweets over time of day for this data we get this:
So, no, the tweet density is not relatively constant throughout the day for a specific location. This curve is very similar to the common shape we see for the set of words listed above. The counts for these words are basically just tracking the number of tweets. Or, in Kyle's terms, the word count density over time is just tracking the tweet density. So the interesting features in the curve for the word 'love' seem to arise because more tweets are getting sent out during those times of day and are not due to any special temporal property of the word itself.
Kyle goes on to suggest that it would be really helpful to see the same plots normalized by tweet density. Here are the normalized curves for the same set as above:
Many of these normalized plots are basically flat except for noise. Those for 'new', 'time', 'know', and 'twitter' seem to show no special relationship with time that isn't accounted for by the simple fact that more tweets are occurring in total during certain periods. Several of the other words still show strong peaks, 'lol', 'day', and 'today' for example. The series for 'toronto' now has a jagged set of peaks evident just before 6am which were not apparent in the raw time series shown in blue. This technique does indeed appear to be useful in highlighting those words that are used preferentially during certain times of day.