The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, May 29, 2010

Death by a thousand SQL queries

The Company just got hired to take over the maintenance and development of a mid-sized Web 2.0 social website that's the next Big Thing™ on the Internet. For the past week we've been given access to the source code, set up a new development server and have been basically poking around both the code and the site.

The major problem with the site was performance—loads exceeding 50 were common on both the webserver and database server. The site apparently went live in January and has since grown quickly, straining the existing infrastructure. That's where we come in, to help with “Project: SocialSpace2.0” (running on the ubiquitous LAMP stack).

The site is written with PHP (of course), and one of the cardinal rules of addresssing performance issues is “profile, profile, profile.”—the bottle neck is almost never where you think it is. Now, I've profiled code before, but that was C, not PHP. I'm not even sure where one would begin to profile PHP code. And even if we had access to a PHP profiler, profiling the program on the development server may not be good enough (the development server has maybe half the data of the production server, which may not have the pathological cases the production server might encounter).

So what to do as the load increases on the webserver?

Well, this answer to profiling C++ code gave me an idea. In one window I ran top. In another window a command line. When a particular instance of Apache hit the CPU hard as seen in top, I quickly get a listing of open files in said process (listing the contents of /proc/pid/fd to find the ofending PHP file causing the load spike).

Laugh if you will, but it worked. About half a dozen checks lead to one particular script causing the issue—basically a “people who viewed this profile also viewed these profiles” script.

I checked the code in question and found the following bit of code (in pseudocode, to basically protect me):

for viewers in SELECT userID
	  	FROM people_who_viewed
	  	WHERE profileID = {userid} 
		ORDER BY RAND()
  for viewees in SELECT profileID
	  	FROM people_who_viewed
	  	WHERE userID = {viewers['userID']}
		ORDER BY RAND()
    ...
  end
end

Lovely!

An O(n2) algorithm—in SQL no less!

No wonder the site was dying.

Worse, the site only displayed about 10 results anyway!

A simple patch:

for viewers in SELECT userID
	  	FROM people_who_viewed
	  	WHERE profileID = {userid} 
		ORDER BY RAND() LIMIT 10
  for viewees in SELECT profileID
	  	FROM people_who_viewed
	  	WHERE userID = {viewers['userID']}
		ORDER BY RAND() LIMIT 10
    ...
  end
end

And what do you know? The site is actually usable now.


Alas, poor Clusty! I knew it, Horatio: a search engine of infinite results …

YIPPY is foremost the world's first fully-functioning virtual computer. A cloud-based worldwide LAN, YIPPY has turned every computer into a terminal for itself. On the surface, YIPPY is one-stop shopping for the web surfing needs of the average consumer. YIPPY is an all-inclusive media giant; incorporating television, gaming, news, movies, social networking, streaming radio, office applications, shopping, and much more—all on the fastest internet browser available today.

Yippy » About

I wish I could say the above was a joke (but you should read the rest of the above page purely for its entertainment value—the buzzword quotient on that page is pure comedy gold), but alas, it is not. Some company called Yippy has bought Clusty and turned what used to be my favorite (if occasionally used) search engine into some LSD-induced happy land of conservative values:

Yippy.com, its sub-domains and other web based products (such as but not limited to the Yippy Browser) may censor search results, web domains and IP addresses. That is, Yippy may remove from its output, in an ad-hoc manner, all but not limited to the following:

  1. Politically-oriented propaganda or agendas
  2. Pornographic Material
  3. Gambling content
  4. Sexual products or sites that sell same
  5. Anti-Semitic views or opinions
  6. Anti-Christian views or opinions
  7. Anti-Conservative views or opinions
  8. Anti-Sovereign USA views or opinions
  9. Sites deemed inappropriate for children

Yippy » Censorship

I cannot (even if I may agree with some of the above) in good conscience, endorse such censorship from a search engine, nor if I refuse to use it, force others to use it. Even my own site has gambling content on it so thus I too, could be censored (or at least my site from search results). Not to mention it encourages people to report “questionable content:”

Yippy users are our greatest defense against objectionable material. Should a keyword or website by found that returns this kind of material it may be reported in the CONTACT US tab located on the landing page of the Yippy search engine. Our staff will quickly evaluate all responses and reply back within 24 hours for a resolution notice. We thank you in advance for helping keep Yippy the greatest family friendly destination online today.

Yippy » Terms of Service

Thus, I'm going back to Google, and have removed Clusty (sigh) from my site.

Obligatory Picture

Trying to get into the festive mood this year

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

No AI was used in the making of this site, unless otherwise noted.

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.