The Gumbaby Project recently passed its one-year anniversary. To celebrate, here’s a few visualizations of this data which I recently developed: some in Wordle, and some using Google Maps.
The Wordle clouds (the word art above the maps) were created from specific sites I’ve covered — specifically, some of the longest threads. They represent the most common words in each comment thread (aside from prepositions, conjunctions, and other common words). Larger words appear more often than smaller ones. I did take out some common words — names of months and the phrase “posted by” appeared inordinately often, because blogs generate them along with posts.
I really like representing the threads this way. Word frequency in these cases reveals the basic message conveyed by a majority of commenters. Of course, I did not separate “stranger” commenters from “blog reader” commenters (as I’ve started to think of the two camps), so the messages are a little mixed — in the cancel AOL thread, the word “stupid” shows up more than it might if I was only looking at people who wanted their accounts canceled.
Now on to the maps. What you’re seeing here is 1) a minimally-state/province-specific Western Hemisphere dataset and 2) an international dataset which includes all other data, including both commenters who only gave their nationality (including US and Canada), and those who gave street, city, or province addresses outside the other dataset.
If you click the individual points you will be given the most specific information available from the commenter’s report (although in cases where commenters have left street addresses, which happens alarmingly often, I have only included their ZIP code in order to try to protect anonymity they may not have known they wanted). Keep an eye out for points labelled by country name or state name — these are not very accurate, spatially, because the data was vague, so they may not give a clear sense of pattern. (This is one of the reasons I separated out US/Canadian and international data. Probably I should have separated out city/ZIP and state/province data, too.)
Both are overlaid over a helpful world population density map I found. I have also had fun overlaying this data on Webfoot’s US Census data map, so I suggest finding that map (click “Browse The Directory”) and trying it out at a close-ish zoom level with my own data.
Why am I picking these specific overlays? Well, oftentimes bloggers and their readers cast aspersions about the strangers making errant comments; they suggest that they are “backwoods hicks” or from Republican-voting states, etc. These overlays suggest otherwise. For the most part, the distribution of commenters within the US seems to basically follow population density patterns, with few exceptions. The distribution internationally seems to follow patterns of technological development and English-language use. Thus, looking at population data suggests that what bloggers are saying is not necessarily true. It also suggests that if Internet literacy education is needed, it’s pretty much needed everywhere, not just in rural areas where adoption of the technology may be more recent.
I should note that in my hamfisted haste to get this data out, I did NOT separate out the “blog reader” and “blogger” data from the “stranger” data. So this is not entirely clean data about gumbaby “strangers,” and I haven’t indicated here which data is from strangers and which from readers. However, I can tell you that strangers were much more likely to include their geographic information than bloggers and readers were, so I don’t feel this has a huge impact.
* * *
Here’s additional fine print about the dataset, for reasons of academic accuracy — it probably won’t be of interest to most people, but I want to be sure I’m clear for those who do care.
The data on the maps was coded from the original blog comment threads in AtlasTI. Every time a commenter mentioned his or her location or nationality, it was coded as “geographic location.” Some country codes in email addresses were included, when they did not conflict with other information in the post. Repeat commenters were coded only once.
These data come from my dissertation data set, so they don’t include data from all of the posts I’ve put up here at Gumbaby. At the moment, here’s how I’m defining comment threads to be included in my dissertation:
- they’re on blogs, not on forums, social networking sites, listserv archives, guestbooks, etc.;
- bloggers or their readers have identified “strangers” as “doing something wrong;”
- that identification has been made within the page itself, not just by linking elsewhere or by personal communication to me;
- the topic is within the realm of popular knowledge and not overly technical (which excludes most of the “plz send me teh codez” threads).
So that’s the data included here; if you really really want a list of the sites under consideration I guess I can send it to you. I should also note that because of technical difficulties a few major threads that *should* count here were *not* included in this round. The Overhaulin’ thread, in particular, was corrupted by my coding software and I couldn’t get to it before I packed up this data and went on vacation. I expect that thread to include a huge amount of additional North American data which may in fact skew towards the rural or suburban rather than the urban. (The file magically uncorrupted itself — o thanx, AtlasTI >:-P — so it should eventually be included in the map above along with the others left out.)
Unfortunately, skew in general is a problem with this dataset. The blogs I’ve found were gathered through a “snowball sample,” not a methodical search through a large number of blogs; they were found by referral. Many of them were found linked to from MetaFilter or were sent in by very Internet-savvy people, many of them developers or networkers. Not to mention that some threads have garnered a lot more comments than others — some of them in the hundreds, while others only have two or three. This latter fact is likely to ensure certain groups are over-represented; for example, the thread where women are trying to sell their wedding dresses is quite long.
As a result, it can’t really be said that trends in this geographic data are indicative of “Internet illiterates” as a whole. The cool thing is, though, that this data following basic population density patterns says something even if the data is skewed. It might indicate that the data sample I’m looking at does turn out to be pretty random. I have to check with my profs on whether this is a reasonable conclusion.
Because unfortunately, other aggregate demographic data I’ve pulled seems to have dramatic patterns. Namely, the “strangers” overwhelmingly identify as female. So I’m not sure how random this dataset really is, even if the geography pans out evenly. The wedding dress thread and the Maury Povich thread may be skewing the gender data a lot — both are very long, and most strangers commenting on them are female. Not to mention that it seems a majority of the *bloggers* and their readers identify as male. That could indicate that using blogs and commenting are somehow gendered behavior, whether it’s a matter of language-use patterns or inequalities in technology skills. Blog “insiders” could simply be attempting to enforce masculine online behavior among strangers.
There ya are. More to come, as time and data permit.