PHP Saturation Survey
Posted by Dan in Uncategorized on 5 January, 2012
For the last few years, I’d wanted to write a spider that would crawl the web and gather some statistical information. Specifically, I wanted to see how many sites reported being capable of serving PHP pages. Finally, about two hours ago, I sat down to write the system that would collect this data, and within about 25 minutes, I had it doing my dirty work for me. After I finished writing the core of the system, I then added a quick reporting tool to give me real-time statistics, because I was far too impatient to wait a few days, weeks, or even months before I could look at some numbers.
Keep in mind the following:
- The spider gathers hostnames dynamically; it started by spidering the Google homepage (not search results), and went from there.
- The recursion algorithm is not only imperfect — it’s impatient. If a site doesn’t respond within 5 seconds, it’s omitted (uncommon).
- Only homepages (indexes) are spidered. This is still very effective, but will limit the results.
- The recursion algorithm skips /^www[0-9]?/ subdomains, but will crawl others.
- The system does not care if one site resides on the same physical or virtual server as any of the others.
- The stats collection engine only gathers what the server is configured to tell it
- We only check the HTTP headers; we do not spider the site itself to see if it links to any PHP scripts within itself.
This initial crawl will probably take at least a couple of weeks to reach some useful numbers. Spidering sites to grab new URLs, check the database, ignore dupes, and recurse into the index to do it all again takes an average of 1.7 seconds per hostname. Grabbing header data and recording statistics eats up about 3.0 seconds per host, and only works in chunks of hosts per run.
Since all you likely care about – much like myself – are the numbers themselves, you can find them here. (Statistical data is no longer available.) The text file will be updated there every five minutes. If you’re interested in the latest figures and that file seems to be stale (i.e. – >10 minutes old), send me an email and I’ll go kick the crap out of the server and remind it who’s boss (until it becomes sentient, that is).
Over the next couple of days and weeks, I’ll spend a few more minutes here and there to improve the collection times, but the actual data itself is quite obviously of the most value in this project. For now, all I care about is PHP market saturation, but I’m also collecting data on rival languages, web server software, cookies, page caching, OS and version information, and more. Sure, other folks collect the same kind of information, but look at me…. do I look like them? Then shut up.
Quick: How Many Potential Key Combinations Exist In 4096-bit Encryption?
Posted by Dan in Uncategorized on 12 September, 2011
Bash ‘for’ Loop and Filenames With Spaces
Posted by Dan in *NIX, Bash Scripting, Computer Science on 12 August, 2011
A quick post for my own future reference, primarily.
After banging me face off the desk for a while trying to figure out how to batch-convert a heaping spoonful of space-laden-named Excel files to CSV for a project I’m doing for my wife, I found a solution in the $IFS environment variable. Thus:
#!/bin/bash IFSTMP=$IFS; IFS=$(echo -en "\n\b"); for i in $(ls -1 *.xls); do xls2csv $i > $i.csv 2>/dev/null; done; IFS=$IFSTMP;
Easy as pie. And I don’t mean like a mince pie or something that many folks don’t even like, but like convincing toddler to eat a chocolate cream pie. Yeah, that easy.
Phishers Are Getting Lazy
Posted by Dan in Random Moments of "Dude, What the @#$%?!?" on 28 June, 2011
Back in the early to mid-1990′s, access to email really was free, though the general Internet was rather expensive. AOL would charge $9.95 US per month for up to five hours in that month (no rollover), and then $2.95 US per additional hour[1]. CompuServe, which – at the time, I preferred over AOHell – was a whopping $12.80 US per hour for basic and $22.80 US per hour for web access. They then generously reduced their cost to just $8 US and $16 US per hour, respectively[2]. My favorite ISP of the time, however, was Prodigy — which was a mere $9.95 US per month for up to five non-rollover hours, then $2.95 US per additional hour[3]. That’s without remembering, of course, that some new providers like MSN were coming out with flat-rate plans. And the best part was that – remember this?!? – Internet access was like old home phone plans. Nights and weekends were cheaper, while business hours were considered “peak times” and would cost more. And that’s not including anything about the speed of your modem and how those rates were considered “premium” rates and were also billed accordingly.
Price didn’t matter, though. I was a teenager without a credit card, and my mother thought the Internet was too dangerous. Thankfully, many of the dial-up ISP’s of the time were all-too willing to offer free trials, and they did a poor job of keeping track of who tried what, when. You couldn’t keep an always-on connection, but if you spread it out among several, you could connect for a good 5-10 hours per week.
Obviously, I’ve seriously digressed here. Far enough from the opening statement of free email, and a universe apart from the title of this rant. Be patient. I’m coming full-circle in a few moments, starting with the free email bit.
What The Heck Is World Environment Day?
Posted by Dan in PHP, Programming, Random Cool Stuff, Random Moments of "Dude, What the @#$%?!?" on 23 June, 2011
Last Tuesday evening, with a large party/barbecue planned for the then-upcoming weekend, I found myself swamped once again with a combination of client projects and work needing to be done around the house. Wishing I’d used my time more wisely that and the previous day, I had come up with an idea for something I was going to use as an internal organization system and self-motivator for completing more projects per week. I’d hoped that, in the future (as it was too late at the time), I’d stay better focused. My idea: a hyped-up TODO list for the week that needs to be 75% complete by each Wednesday, affording more time for spending with the family on the weekend. I decided to call it (rather obviously) the Wednesday Challenge.
Because I’m a geek, I started building it out (in PHP, natch), with hooks to my office computer to display alerts on my desktop, SMS for text alerts throughout the day, and of course, email. Eventually, I hoped, I’d hook it to the API for my Google Calendar as well. The plan was, with enough reminders throughout the day, I’d be less susceptible to distractions leading to a complete derailing of my originally-planned schedule. Obviously, I needed a domain name, too, because – who knows? – maybe friends or family would want to use the system, too. So I opted for a simple version of my uninspired project name: wedchallenge.com. I registered it that Tuesday night.
Enter actor Don Cheadle and fashion model Gisele.
I had noticed that I was getting a bunch of traffic on the domain, so I checked to see if perhaps the domain name had been a drop — that is, a domain which had once been registered, but then expired. Oddly enough, it all seemed to be not only organic, but without referrals. That meant it was being typed directly into the browser. I puzzled over it for many, many seconds before I got sidetracked on something else.
Then, one evening, while preparing dinner, I happened to see a few commercials, thanks to the useless and skip-laden DVRs we have with DirecTV (that’s another story altogether). One commercial caught my eye, not because of the content or the actors, but the domain: wedchallenge.org. At first glance, I thought it was my own domain. Then it clicked: that’s why I’m getting hits, and that’s exactly why it’s all type-in traffic.
The wedchallenge.org site has nothing to do with my little project. Instead, theirs is actually WED Challenge — World Environment Day Challenge. A campaign launched by the United Nations Environment Programme. Apparently, it was held on 5 June, 2011, but ads were still being run on TV at least as recent as the 16th or 17th of the month.
So while I can’t offer any information, advice, or anything at all with regard to the WED Challenge, I can tell you that, if you came in here via wedchallenge.com, you should instead go to wedchallenge.org. And while I find it a really odd coincidence, you can accept my half-hearted apology, I suppose, since you’re apparently not just all interested in my idea. ;-P
Announcing the Release of the System Detonation Library for PHP
Posted by Dan in *NIX, Computer Science, Open Source Technology, PHP, Programming, Random Cool Stuff, Random Moments of "Dude, What the @#$%?!?", Server Administration on 3 June, 2011
As discussed somewhat at length in a rapidly-devolving thread on the PHP General mailing list, I am in favor of a function that, when called, will initiate on the host system a self-destruct sequence. Well, being a nice, sunny, spring Friday morning, I decided to offer just that:
Introducing the first public release of the System Detonation Library for PHP.
This useless extension provides one function with one purpose: to cause your server to explode. Due to the obvious hazards involved, including (but not limited to) loss of hardware, limbs, and potentially life and liberty, this has only been tested on one single occasion, using a PC with Ubuntu 10.10 and a heavily-modified SVN version of PHP 5.3.6. Thankfully, as the test was successful, there were no serious injuries.
Firstly, you may download the package here.
Second, as a very basic course on the compilation and installation of this unofficial PHP extension, here are some simple instructions for Linux users. All others are on their own, and this may (read: probably) will not work anyway…. which is a shame, because I know plenty of Windows boxes that should have the right to self-destruct as well.
- Download the package above.
- Extract it: tar -zxf detonate-0.2.tar.gz
- Change to the newly-created directory where the files are located: cd detonate-0.2/
- Build the wrappers for your version of the Zend/PHP API: phpize (NOTE: on Ubuntu-built packages, this command may be: phpize5)
- Build the necessary makefiles for your system: ./configure –with-detonate
- Compile the code: make
- Install the binary (as root, or using sudo): make install
- Edit your php.ini to load the newly-installed extension by adding this line: extension=detonate.so
- If you plan to use it via the CLI, you’re done. For use on the web, remember to reload/restart your web server.
- Create a basic PHP script with the following: <?php detonate(); ?>
- Check your insurance coverage.
- Run the script created in Step #10.
And that’s all there is to it. Feel free to install this on all of your systems and use it as a replacement for exit or die() in your scripts. Because, unlike die(), this function will absolutely get the point across, once and for all.
Replacing One Character In A String With A Random Character
Posted by Dan in *NIX, Computer Science, Open Source Technology, PHP, Programming on 24 April, 2011
Just an hour or so ago, Ron Piggot asked a question on the PHP General mailing list. The original question was how he could replace a single matching character in a string containing multiple matches with another random character. I mocked up a working example in about five minutes or so. It’s far from perfect, and not very elegant, but it’ll work as a starting point of reference.
The code I sent back in reply follows:
<?php
$str = '%ECARBME%TIPLUP%%%%%%%E%%';
$chars = 'abcdefghijklmnopqrstuvwxyz';
echo multi_replace($str,'%',$chars).PHP_EOL; // Case-sensitive, random, straight replace
echo multi_replace($str,'%',$chars,0,1,1).PHP_EOL; // Case-insensitive, random, randomg casing
function multi_replace($str,$targ,$chars,$sens=true,$random=true,$caserand=false) {
// Loop through while $targ is still found in $str
while (strstr($str,$targ) !== false) {
// If we're randomizing, pick the character; else the first (or only)
if ($random == true && !is_null($random)) {
$replace = $chars[mt_rand(0,(strlen($chars) - 1))];
} else {
$replace = substr($chars,0,1);
}
// If we want random casing, do that; else make no change
if ($caserand == true && !is_null($caserand)) {
if (mt_rand(0,1) === 0) {
$replace = strtolower($replace);
} else {
$replace = strtoupper($replace);
}
}
// If we don't want case sensitivity, set the regexp modifier
$mod = $sens == false || is_null($sens) ? 'i' : '';
// Perform the match and replace only one character now
$str = preg_replace('/'.$targ.'/U'.$mod,$replace,$str,1);
} // End of loop
// Return the modified string
return $str;
}
Windows Server Says, “Network Cable Unplugged” When It’s Not?!?
Posted by Dan in Computer Science, Desktop "Administration", Network Technology, Server Administration, Windows on 31 January, 2011
Once again, stuck managing a Windows box. Yeah, I know, I’ll whine, bitch, moan, and cry you a river another time.
The Problem: Using the secondary NIC (PNET/VLAN), I found a lock of packet collision during negotiation, handshaking, and identification, causing Windows to give up and basically say, “well, since it’s not working, the cable must physically have been removed, because there’s no way I could ever be wrong.”
Wro…. err…. incorrect, Windows. (You’re wrong.)
The Discoveries: The truth was, at least in my case, that it wasn’t properly handling the gigabit capabilities of the card on the box. I’m not the administrator for these machines (though their housed in our datacenter), so I can’t be certain that nothing had changed recently, but their staff said nothing at all had been modified. Perhaps that really was the case, and nothing had been changed — Windows has been known to do stranger things than this, of course, sometimes out of the blue.
The Solution (for my case): Go to the screen where you can view your network adapters (your version of Windows dictates the path of navigation, hence the ambiguity). Next, right-click the adapter with the “Network Cable Unplugged” message and click “Properties.” Click the appropriate button to configure the network adapter. Then click the tab on that dialog for “Settings” or something of the like (sorry, but I logged out in a hurry, so this is from memory), and you’ll see a list of parameters on the left, with their values on the right. Find one related to speed and duplex, and if you see it set to “Auto” or similar, drop it to “100Mbps Full Duplex” and click OK. Close the properties dialog by clicking “OK” and see if the settings are already bringing the network adapter back online. If not, disable and re-enable the adapter, and – if it was indeed the same issue – you should be back online within a few seconds.
Distributing php.net’s Synchronization Infrastructure
Posted by Dan in *NIX, Open Source Technology, PHP, Random Cool Stuff, Server Administration on 27 January, 2011
Several days ago, the primary server hosting all of the data comprising the php.net site for synchrony with all of the mirrors around the world became completely inaccessible. Due to security policies with the provider hosting the server, it was some time before we were able to have the machine returned to normal operational status. As a result, network content became stale, and automated tests on the mirrors saw them as outdated and deactivated them. It pointed out a flaw that, though this time was just an inconvenience, has the potential to grow into something more serious – including a sort of self-denial-of-service, if you will, if it went unnoticed for several days and all mirrors were seen as outdated.
Mark Scholten from Stream Service, an ISP based in the Netherlands and provider of an official mirror for their countrymen at nl.php.net, offered to put up a second rsync server, which gave me an idea: take the load off the primary server by distributing it across three regions.
![]()
(Click the image to view the full size version.)
Mark set up the European (EU) box in their Amsterdam datacenter, we (Parasane) had already set up an emergency rsync mirror in case the primary dropped out again which would be repurposed for the Americas (AA), and I contacted Chris Chan at CommuniLink in Hong Kong for what would become the Asia-Pacific (AP) region. Chris had submitted an application to the official waiting list to become an official PHP mirror back in February of 2010.
Compiling data over the course of the last 12 months from mirrors in our network which had it readily available, accurate, and up to date, I drew out a plan for the regions so as to limit the load and stress on each new mirror. Thus, the tri-colored map above. I also learned in the process that we will have served roughly 223 gigabytes of data over HTTP, network-wide, by the end of January, 2011, which averages out to about 1.9GB per mirror, per day, with the 115 active mirrors we have worldwide as of right now.
Setting myself an arbitrary date of 30 April, 2011, the goal is to have all existing official mirrors flipped over to using the rsync server designated for their country. Visitors to php.net should see no difference and should experience no negative impact during the transition, but the benefits will be great: far less of a likelihood of a mirror being automatically dropped from rotation due to stale content; the ability of the maintainer to decrease the amount of time to synchronize their mirror to hourly, providing the freshest content as it becomes available; less latency and greater speeds for many of those who are far from the current central rsync server; far, far less stress on our own network.
The immediate goal is ensuring that there are no snags, and that we can successfully synchronize all of the data to the website mirrors without omission. Beginning right away, I’ll be coordinating privately and directly with a few mirrors around the world to beta test the new layered design according to the rsync distribution plan. By 12 February of this year – a bit more than two weeks from now – I hope (and expect) to have all of the kinks straightened out. After that, we’ll begin migrating the rest of the network in its entirety to the new design.
All new mirrors from that point forward will be instructed to use their local rsync mirror as well, as defined by the map above.
It’s no large task, of course, but I’m hoping that the addition of just three new servers will help to ensure the health and stability of the network as a whole for years to come. While I don’t expect anyone to notice any difference – good or bad – in the user experience, behind the scenes I think we’ll not only see some differences in operations, but also begin to come up with even more ways to improve performance in the future.
Elance “Skills Assessment” Tests — HA!
Posted by Dan in Computer Science, Random Moments of "Dude, What the @#$%?!?" on 28 December, 2010
Sometime in 2000 or 2001, I was asked to create a skills assessment for senior-level relational database management systems experts for a very young Brainbench. I was not alone: the company selected a total of four of us, all given the assignment of coming up with forty-five multiple-choice problems related to the general concepts of RDBMS. It took me about two days to complete the task, and a few weeks later, when the test went into a public beta, folks could take the test and vote on the quality of the questions. The results of the votes were not shown to the public, and we (the consultants who created the questions) were not privy to the voting statistics either. Several weeks after that, the test became official, and I aced it: every one of the forty-five pages were from my packet.
Despite being rather proud of myself for a relatively small accomplishment, I was actually really surprised; the quality of the submissions from the other consultants was all – in my opinion – very, very good. In fact, I felt more as though some of my own paled in comparison. It seemed that all of the other submissions were by folks who really understood the topic thoroughly, and were masters in their field. In fact, I did later learn that I was the only one of the four hired who did not have a computer science degree. Talk about humbling.
Today I decided to take a few skills assessment tests on Elance – a leading online freelance marketplace – on a variety of technical subjects. Included in the ones I took were tests to evaluate one’s comprehension of Linux and Amazon Web Services.
I was disgusted.
The grammar was horrible. The content was filled with fluff and trash. The questions weren’t representative of someone’s working knowledge on the subject — in fact, some even took text from “About Us” sections that described the company, not the service provided. And in the Linux test specifically, there were cases of areas where multiple answers were correct, but only one could be chosen; other times where no answer was technically correct, but a choice had to be selected. The most appalling of it all: questions on obscure, unnecessary things like “Which of the following software packages is used to create a closed-circuit television system?” I had to look that one up after the fact. I didn’t take a test on how to set up a video system, I took a test on Linux skills. I highly doubt the MSCE or MVP tests ask for the steps of motion tweening in Flash.
It was quite obvious that the tests were created by folks with limited knowledge on the subject matter. In fact, it was probably completed – at least in majority – by the lowest bidder, who may very well have been a non-native-English administrative assistant. Hell, nowadays, anyone with Internet access thinks they have the skills and marketability to work as a professional freelancer. Some do…. most – and I really mean MOST – do not. These so-called “skills assessment” tests were proof-positive of that; they’re a joke, and folks serious about testing the skills of others would be ashamed to have them as the representation of their own knowledge on a given subject.
Granted, I can’t speak for all of the tests. There are many available, and on a wide variety of topics. I’m sure that some are much better than others, and that some of those may actually be very good at gauging an individual’s skill on the matter. Now they just need to try to get that same quality across the board.
Because if I can take a test on something of which I admittedly have almost zero knowledge, be more confused by the spelling and sentence structure of almost every single question and option, score a barely-passing 65%, yet still be in the “Top 10%” of all test-takers, something must be wrong.