Archive for April, 2006

Multi-group, multi-join, no-union, no-view mysql query

April 24th, 2006 by daryl

Chances are than unless you googled to find this post, both the title and the post will be pretty much nonsense. If so, don’t continue reading on my account. There are no hidden gems unless you’re looking for an answer to a mysql question.

So, my dilemma was that in a Drupal module I’m writing, I was asked to provide data sorting that requires groupings on multiple columns in different tables. Three tables are involved. Table “node” holds content; “votes” holds ratings users have given nodes; and “extension_dowloads” holds download counts for nodes of a given type. The challenge was to get average votes, vote counts, download counts, and node information for all nodes of a given type, sortable on each of these columns the aggregation of which requires grouping.

There are several possible approaches. The least favorite to my mind was post-processing. That is, I could have gotten all node info, all download info, and all voting info, tossed it into a multi-dimensional array and sorted on various keys in the array. Downsides to this approach include the fact that I have to retrieve all potentially relevant data from the database (rather than just the subset I need), and then there’s processing time to do the sorting. I can see this scaling terribly. It also keeps me from using the nice hooks Drupal has in place for paged queries, sorting, etc., and that just seems like flagrant waste.

Another approach is trying to do some complex multi-join table, but that turned out to have weird results. You have to use a left outer join in order to get nodes that have neither votes nor downloads, but grouping on the two satellite tables as you must in order to get counts and sums out of them throws the numbers off. If you group by multiple columns, you get redundant data (ie, nodes listed twice, once for the match in the downloads table and once for the match in the votes table). If you group by just the common key, strange things happen as well. In some cases, something like an unexpected multiplication (across columns!) seemed to be happening as a result of grouping. I never quite figured that one out. In any case, I couldn’t find a single multi-join query that would work.

Yet another approach is to try to use a union. A union lets you daisy-chain the results of multiple queries together. Column names and types are taken from the first query in the union, and as long as there are no major incompatibilities (e.g. different numbers of columns in subsequent queries), the results of the queries are dumped out as one result set. I was hoping I could use a union to combine output from the download count query and the vote calculation query, but a union returns distinct rows. If I selected NULL for the download count in the voting query and NULL for the voting calculation columns in the download count query, I did get back all the results I wanted, but there were two rows of data per content node because rows were not unique. This would require further post-processing.

And finally, I considered using either temporary tables (select both result sets into temporary tables and then join the two to the nodes table to select the results I wanted) or views. Temporary tables seem messy and inefficient, so I was reluctant to use those, and views require a more recent version of mysql than I’m using, so that was out.

Now on to the solution. I had heretofore used sub-selects only to get scalar lists of ids to select from: “SELECT foo from bar WHERE id IN (SELECT id from other_table)”. It turns out that you can get whole result sets from sub-selects. What you’re doing in this case is in essence to select some results and define them as a table to select from within your wrapping query. (Really, I guess it’s a temporary table scenario, though it seems less hacky and possibly somehow more efficient/optimized than issuing CREATE, INSERT, SELECT, and DROP statements per request to get some data.) Here’s what I arrived at to solve my problem:

SELECT node.*, votes.vote, votes.vote_count, votes.vote_ratio,
downloads.downloads
FROM node,

(SELECT node.nid, SUM(votes.vote) vote, COUNT(votes.vote) vote_count, (SUM(votes.vote) / COUNT(votes.vote)) vote_ratio
FROM node
LEFT OUTER JOIN votes
ON votes.content_id = node.nid
WHERE node.type = "extension"
GROUP BY node.nid) votes,

(SELECT node.nid, COUNT(extension_downloads.eid) downloads
FROM node
LEFT OUTER JOIN extension_downloads
ON extension_downloads.eid = node.nid
WHERE node.type = "extension"
GROUP BY node.nid) downloads

WHERE votes.nid = downloads.nid
AND node.type = "extension"
AND node.nid = votes.nid;

I get everything from the node table. Then I do my first sub-query, which left outer joins the votes table on the node table to get all ids and relevant vote stats. I alias that sub-select as “votes” so that in the wrapping query, I can refer to its columns using “votes” as a prefix. Next I do a similar query on the downloads table. Finally, I constrain my wrapping query by node type and id. Since I’m doing left outer joins in my sub-queries, the row count for all three “tables” I’m selecting from is the same, and a simple nid = nid correspondence makes my data line up.

I may look into improving this further. It’s possible that I can reduce my data transfer and query burden by avoiding the left outer joins in the sub-queries and doing a single left outer join onto the two virtual tables in the wrapping query. I’m not sure whether aliasing the sub-queries will allow this or not.

Update: I was able to tweak the query as I speculated I might be able to above. Here’s the new query:

SELECT node.*, votes.vote, votes.vote_count, votes.vote_ratio, downloads.downloads
FROM node

LEFT OUTER JOIN
(SELECT votes.content_id nid, SUM(votes.vote) vote, COUNT(votes.vote) vote_count, (SUM(votes.vote) / COUNT(votes.vote)) vote_ratio
FROM node, votes
WHERE node.nid = votes.content_id AND node.type = "extension"
GROUP BY votes.content_id)

votes ON votes.nid = node.nid

LEFT OUTER JOIN
(SELECT extension_downloads.eid nid, COUNT(extension_downloads.eid) downloads
FROM extension_downloads GROUP BY eid)

downloads on downloads.nid = node.nid

WHERE node.type = "extension"

What’s going on now is that I’m joining on the results of the sub-queries rather than within the sub-queries. If I join within, then each sub-query returns as many rows as there are relevant nodes. If I join outside the sub-queries, each sub-query returns a number of results equal to the subset of nodes for which there is relevant data. Say I’ve got 500 nodes and that 300 of them have download counts and 250 of them have been voted on. In the original query, each sub-query returned 500 rows of data. In the new query, the sub-queries return 300 and 250 rows, which are then merged back into the 500 rows selected from the node table. The change stands to provide a significantly more efficient query that will scale better as more content nodes are added over time. What was really at question at the end of my original posting was whether or not aliasing the sub-query and using that alias in the join would work (I’ve had issues with aliasing in joins before), and it did.

Caterpasta

April 21st, 2006 by daryl

A few weeks ago, I set myself a goal of writing around 5,000 words a week between this blog and my work blog. Travel, real life, and sloth have kept me from doing it, and a period of relative prolificacy has petered out into the recent stagnation both here and at my work blog. In keeping with the habit of late, this’ll be brief.

We’ve been eating a lot of pasta lately as part of our diet change, and we’ve had in particular a lot of the corkscrew pasta. The other day at lunch, Lennie ran out of pasta on her plate, looked at one of our plates, and promptly asked for more caterpillars. In the last couple of days, she’s started drinking a lot more cow milk and juice than previously.

I’ve noticed a trend lately when I’m giving her an early diaper change wherein she looks up at me and catalogues my facial features. Just out of the blue, “eye, nose, ear, eyebrow.”

She’s quite the singer now, and I think it’s pretty interesting that she can remember tunes more reliably than she can remember words. She pretty frequently sings “Twinkle, twinkle, little star,” and she gets the tune right but has to muddle through a few of the words. She does the same with the alphabet song (which is really a twofer, sung to the tune of “Twinkle, twinlke, little star”). She’s trying more and more to count, though she seems to have a particular aversion to the number four. Even in repeating numbers after me as we insert coins into her piggy bank or count out other items, she balks at four more than at any of the other numbers. In the non-musical arts, she’s been painting a lot with watercolors and coloring with various implements. She’s also taken up graffiti, doing several vertical scribbles on the wall in the hallway outside my office.

Lennie can almost jump (it’s a two-year-old skill, so she’s right on track), and she runs with more reckless abandon than ever now. She generally declines to sit in her booster seat anymore, preferring to sit on her butt or her knees while we eat.

In the people department, she asks for people by name all the time now and has greater object/person permanence. She and Mleeka went to Vanderbilt to visit Ashley while I was out of town last week, and she saw Ashley from a distance and started calling out her name excitedly. She frequently asks to look at pictures on Mleeka’s computer and has begun asking for particular people and particular pictures of them. For example, she’ll come up to Mleeka and say sweetly “Ella sleeping.” She’s also liked seeing pictures of herself and of a pretty recent visit with grandma and granddaddy. As for perfect strangers, she’s learned pretty reliably to identify general categories. She knows boys from girls and will point them out, and at a bookstore tonight, she saw a woman probably in her 60s and said “see grandma.” Most men are daddys. Lennie’s very good now about greeting people with a friendly “hi.”

She’s definitely started dreaming. One morning, she was talking in her sleep about sharing stickers. The morning I last left for California, Fleda stayed over to be around while Mleeka took me to the airport, and Lennie woke up early anyway, talking about Fleda, whom she had cried out for pitifully and at length the night before when she went off to bed. And then there’s the Easter Bunny fiasco and followup nightmare. Mleeka and Lennie and Stacia and Lowen were at the mall one morning after their baby/mommy group playtime meeting, and they saw an Easter Bunny kids could have their pictures taken with. From the second level of the mall, Lennie laughed and smiled and waved at the bunny, but when Mleeka took her down for a closer look, she cried and clung to Mleeka with no intention whatsoever of actually interacting with the bunny. That night, she woke up crying at some point, and it occurred to us later that what she had been saying when she woke up was “rabbit.”

We’re going on a beach vacation in a couple of weeks, and Mleeka was interested in renting bikes while we were there. A week or two ago, we had Dave take Lennie for a ride on his bike to see how she took to it, and her reaction was one of puzzlement and pleasure. So we went out and bought bikes and a seat for Lennie and helmets and a bike rack for the car. Her response to the bike ride itself so far has been pretty lukewarm, though she’s fascinated with bikes. She actually calls them bicycles. She’s interested in the idea but balks a bit at the fact of riding one. She’s not terribly fond of putting her helmet on, and on a couple of occasions, when we’ve tricked her into wearing it and have gotten her strapped into the seat, she’s lost interest. The last time we tried, we took a longish hilly ride around our neighborhood, and she seemed to like it. I think it helps when Mleeka and I are both around because one of us can ride along beside and provide encouragement and really try to sell the experience.