For the last few years, I’d wanted to write a spider that would crawl the web and gather some statistical information. Specifically, I wanted to see how many sites reported being capable of serving PHP pages. Finally, about two hours ago, I sat down to write the system that would collect this data, and within about 25 minutes, I had it doing my dirty work for me. After I finished writing the core of the system, I then added a quick reporting tool to give me real-time statistics, because I was far too impatient to wait a few days, weeks, or even months before I could look at some numbers.
Keep in mind the following:
- The spider gathers hostnames dynamically; it started by spidering the Google homepage (not search results), and went from there.
- The recursion algorithm is not only imperfect — it’s impatient. If a site doesn’t respond within 5 seconds, it’s omitted (uncommon).
- Only homepages (indexes) are spidered. This is still very effective, but will limit the results.
- The recursion algorithm skips /^www[0-9]?/ subdomains, but will crawl others.
- The system does not care if one site resides on the same physical or virtual server as any of the others.
- The stats collection engine only gathers what the server is configured to tell it
- We only check the HTTP headers; we do not spider the site itself to see if it links to any PHP scripts within itself.
This initial crawl will probably take at least a couple of weeks to reach some useful numbers. Spidering sites to grab new URLs, check the database, ignore dupes, and recurse into the index to do it all again takes an average of 1.7 seconds per hostname. Grabbing header data and recording statistics eats up about 3.0 seconds per host, and only works in chunks of hosts per run.
Since all you likely care about – much like myself – are the numbers themselves, you can
find them here. (Statistical data is no longer available.) The text file will be updated there every five minutes. If you’re interested in the latest figures and that file seems to be stale (i.e. – >10 minutes old), send me an email and I’ll go kick the crap out of the server and remind it who’s boss (until it becomes sentient, that is).
Over the next couple of days and weeks, I’ll spend a few more minutes here and there to improve the collection times, but the actual data itself is quite obviously of the most value in this project. For now, all I care about is PHP market saturation, but I’m also collecting data on rival languages, web server software, cookies, page caching, OS and version information, and more. Sure, other folks collect the same kind of information, but look at me…. do I look like them? Then shut up.