After waking up today, I said to myself: “Hey. Today seems like a good day to run useless statistics that may just be totally off base.”
Well. Here’s what I did.
I got a copy of the planet.ubuntu config file, and started to work through it. First pass on the script was to yank all the URLs out.
$ HOSTS=`cat config.ini | grep -v "^#.*" | grep "\[.*\]" | tr -d "[" | tr -d "]"`; for x in $HOSTS; do echo $x >> hostnames; done
Fancy. Now, let’s see how many lines we have:
$ cat hostnames | wc -l
404
badass. This is already looking great. That, and it’s funny.
Now that I have a list of the hosts, I wanted to see how many servers self-identified.
$ HOSTS=`cat hostnames`; for x in $HOSTS; do ID=`curl $x | grep "\"`; if [ "x$ID" != "x" ]; then echo "$x $ID" >> positive-ids; fi; done;
After that finished, I checked the result
$ cat positive-ids | wc -l
253
This is a really really bad way of doing it. I never said it was pretty. More on this later.
Next, I wanted an overview on how many dead hosts there are. Since ping won’t work ( filtering ping is not only normal, but a good idea ). To do this, I used curl ( again ).
$ HOSTS=`cat hostnames`; for x in $HOSTS; do curl $x > /dev/null; if [ $? -ne 0 ]; then echo "$x" >> errord; fi; done
Well, that ran, the output looked good, so I took a look at it
$ cat errord | wc -l
11
Great.
Now, let’s go back to the positive-ids. I extracted the data from the tags using a bit of sed-voodoo.
$ sed -n -e 's/.*\(.*\)\/generator>.*/\1/p' ./positive-ids > blog-engines
Now, I have a file full of all the blog engines ( or homebrew softwares ). So, I did a quick check on it.
$ cat blog-engines | wc -l
144
Careful readers will point out that this number is less then the count of my positive ids. Yes, you’re right. My script snagged newlines. As a result, there are a few lines that are “runover” from the last. This output is good.
So, now. Let’s figure out what the most popular feed generator is.
$ cat blog-engines | sort | uniq -c | sort -n -r > counts
And the results? Well, I’m getting there!
42 http://wordpress.org/?v=3.0.1
22 http://wordpress.com/
21 http://wordpress.org/?v=2.9.2
8 http://wordpress.org/?v=3.0.2
5 LiveJournal / LiveJournal.com
5 http://wordpress.org/?v=2.8.4
5 Blogger
4 http://wordpress.org/?v=3.0
3 Dotclear
2 Serendipity 1.5.4 - http://www.s9y.org/
2 mod_virgule
2 http://wordpress.org/?v=abc
2 http://wordpress.org/?v=2.9.1
2 http://pipes.yahoo.com/pipes/
2 Apache Roller (incubating) 4.0.1 (20090102102238:dave)
1 TYPO3 - get.content.right
1 Tumblr (3.0; @schwuk)
1 Tumblr (3.0; @paultag)
1 Tumblr (3.0; @castrojo)
1 Tumblr (3.0; @bholtsclaw)
1 Serendipity 1.2 - http://www.s9y.org/
1 PyBlosxom http://pyblosxom.sourceforge.net/ 1.3.2 2/13/2006
1 http://wordpress.org/?v=3.1-beta1-16590
1 http://wordpress.org/?v=3.1-alpha
1 http://wordpress.org/?v=2.9
1 http://wordpress.org/?v=2.8.6
1 http://wordpress.org/?v=2.7.1
1 http://wordpress.org/?v=2.6.5
1 http://wordpress.org/?v=2.5.1
1 http://wordpress.org/?v=2.2.2
1 blosxom 2.1.2+dev
1 blosxom/2.1.2
So, remember. There are 404 total blogs. Let’s come up with some statistics!
Results!
The totals ( combined )
Wordpress: 114
LiveJournal: 5
Blogger: 5
Tumblr: 4
Dotclear: 3
blosxom: 3
Serendipity: 3
Other: 5
Percentage-wise:
Of all the blogs, 28.2% of blogs reported that they run Wordpress. LiveJournal and Blogger, as well as other engines not included in this paragraph ( combined ) both power roughly 1% of the blogs on planet.ubuntu. Tumblr has 0.9%. Dotclear, blosxom and Serendipity power 0.7% of the blogs on planet.ubuntu.
Out of all of the blogs that reported, 80% run Wordpress. LiveJournal, Blogger, and other reporting engines hold 3.5% ( each ). Tumblr holds a solid 2.8% ( myself included ). Dotclear, blosxom and Serendipity hold 2.1% each.
Overview
Wordpress is clearly the most popular blog engine that reported. Wordpress powers an astounding 80% of the reporting engines. This is the same as 28.2% of all blogs on planet.ubuntu. My script only identified 142 ( out of 404 ) of all of the blogs’ RSS engines. That’s a mild 35.1% reporting rate.
LiveJournal and Blogger power roughly 3.5% ( each ) of reporting engines, which corresponds to 1% of the total number of total blogs. Sweet.
2.7% of the blogs failed to render a page at the URL set in planet.ubuntu. They have either been deleted, or their domain has expired. All the domains that threw an error were on personal domain names.
I did not see any Drupal strings, so I think there must be a bug somewhere in the code.
I plan to re-write this at some point to be a bit more accurate. For now, I think that’s enough work.