MySQL using SELECT IN (x, y, z) vs one at a time

Finally deployed the Zentester heatmapping re-write today and wanted to share a 3 things I’ve learnt on the way on optimization.

  1. Do “WHERE IN ()” instead of “WHERE =” Wherever possible – The biggest thing I discovered was that when selecting individual rows from a database by doing something like:
SELECT * FROM database WHERE id = 5
SELECT * FROM database WHERE id = 6

It’s much much faster to select them all at once rather than doing it one row at a time. How much faster? In one section of the heatmapping I was grabbing visitor id’s from ipaddresses from the visitors database like so:

SELECT id FROM visitors WHERE ipaddress LIKE 'xxx.xxx.xxx.xxx'

The ipaddress column was a primary key but even then every time I did this query it took approximately 0.1 seconds. I combined all of these into one huge SQL query looking something like this:

SELECT id FROM visitors WHERE ipaddress IN ('xxx.xxx.xxx.xxx', 'yyy.yyy.yyy.yyy', 'zzz.zzz.zzz.zz')

Loading 1000 different rows in this way took just 0.7 seconds. Which equated to a 150 x speed increase over the old approach.

  1. PHP Arrays are really fast – I used to think PHP Associative arrays were slow and It’d be better to just grab everything from MySQL one at a time rather than caching it in php. Oh how wrong I was. I changed the overall code structure of my data processing from the following (psudo code):
while (data) {
$visitorData = SELECT * FROM visitors WHERE ipaddress IN data->ipaddress
$pageData = SELECT * FROM pagedata WHERE page IN data->page
$siteData = SELECT * FROM sitedata WHERE site IN data->site
//Process and insert data into another database
}

To this:

while (data) {
$ipAddressArray[] = $data->ipaddress;
$pageArray[] = $data->page
$siteArray[] = $data->site
}

$ipData = SELECT * FROM visitors WHERE ipaddress IN ($ipAddressArray) 
while ($ipData) {
$ipAddressMap[$ipData->ipaddress] = $ipData->visitorid;
}

$pageData = SELECT * FROM pages WHERE page IN ($pageArray)
while ($pageData) {
$pageMap[$pageData->page] = $pageData->id;
}

$siteData = SELECT * FROM sites WHERE site IN ($siteArray)
while ($siteData) {
$siteMap[$siteData->site] = $siteData->id;
}

while (data) {
//process data using the maps instead of constantly querying the database
}

Instead of looping through all the data to be processed and processing it one at a time It’s figuring out up front every single piece of data that’s going to be needed from the database and grabbing it in just a few queries.

Then all this data is stored in associate php arrays (which actually hash the keys of the array so you can make the array keys literally anything, in one case ip addresses were used as the key). Then everything is processed grabbing the data out of these php associative arrays rather than pulling it from the database every single time.

I’m processing approximately 10,000 rows of data at a time and after implementing this change it went from taking 5 – 10 minutes to process down to 10 – 15 seconds.

  1. Batch your inserts – I also got a small speed increase by combining all my inserts into one large statement, though don’t make this statement too large or MySQL will run out of memory when you try to run it (I discovered about 5000 inserts at once works best for me). Batching inserting took the insert time from about 60 seconds down to about 10, so it wasn’t as huge of an increase as the batched SELECT but it is noticeable especially when you’re dealing with lots of data.