Archive for the ‘mysql’ Category

phpMyAdmin and Designer mode

May 17th, 2007 by daryl

phpMyAdmin Designer ViewAnybody who’s done any open source development knows about this nifty tool called phpMyAdmin that lets you manipulate mysql databases through a web interface. For Luddites like me, using the command line interface is usually preferable to using any sort of GUI tool that requires pesky mouse moves and clicks, so I’ve generally deployed phpMyAdmin to be used by others who needed to screw with databases but haven’t been comfortable with (or had access to) the command line.

One of my tasks for today is drawing up some diagrams to show relationships between tables in a big fancy dimensionally modeled database I’m trying to blunder my way through designing. The problem is that I utterly hate all the tools I’ve tried for this on Linux. What I wanted was something that would read in a mysql schema file and populate the basic diagram for me so that all I had left to do was to map out the relationships. There’s a UML-diagram application called Umbrello that does entity-relationship diagrams of the type I need to deliver, and a user has contributed a script that will read in a schema file (though it’s finicky). I don’t like the application itself very much, but it fit my basic requirements. This morning, it started crashing, though, and was unusable. Luckily, I hadn’t done much manipulation of the model in it yet. So the quest for a better tool was back on.

I tried various versions of DBDesigner4 and MySQL Workbench, even going so far as to try running the Windows versions in wine when I experienced problems with the Linux versions, but nothing worked out.

Finally, I decided to take another look at phpMyAdmin to see if it had been enhanced with any sort of modeling capabilities. And it has! In recent versions, you can enable a “Designer” view by uncommenting a couple of lines in the config file and slurping in the tables for the phpmyadmin database in the scripts directory. It’s not a perfect tool by any means, but when I click the “Designer” tab in the app, it shows me a nice DHTML view of the tables that lets me drag them around, specify relationships, toggle to show or hide tables that have no relationships defined, etc. It comes with a nifty little palette and a toggleable sidebar to handle these operations, and it’s really a pretty elegant little piece of work. What’s more, and what makes this really useful for my purposes, is that I can save the frame (phpMyAdmin keeps a navigation frame open on the left) to my local disk, zip it up, and send it to somebody, who can then perform the same DHTML manipulations I was able to perform, making it ideal for sending along a complex schema that can have portions of it disabled for ease of viewing. (To clarify, they can save none of the information back to the database, but there’s some degree of flexibility with respect to how they can control the static view.) And to top if all off, if I want to make changes to tables, I can do it at the command line or right there in phpMyAdmin, and there’s no re-importing of a schema — the Designer view will be up to date the next time I reload it. In Umbrello, it’s my impression that any changes I made after import (e.g. drawing relationships) could not be exported back out in a useful way for porting back to the database, so I would constantly have been updating the schema, importing, and redrawing relationships.

This is a great tool for my purposes, and of course phpMyAdmin’s core features are also very useful in many environments.

Multi-group, multi-join, no-union, no-view mysql query

April 24th, 2006 by daryl

Chances are than unless you googled to find this post, both the title and the post will be pretty much nonsense. If so, don’t continue reading on my account. There are no hidden gems unless you’re looking for an answer to a mysql question.

So, my dilemma was that in a Drupal module I’m writing, I was asked to provide data sorting that requires groupings on multiple columns in different tables. Three tables are involved. Table “node” holds content; “votes” holds ratings users have given nodes; and “extension_dowloads” holds download counts for nodes of a given type. The challenge was to get average votes, vote counts, download counts, and node information for all nodes of a given type, sortable on each of these columns the aggregation of which requires grouping.

There are several possible approaches. The least favorite to my mind was post-processing. That is, I could have gotten all node info, all download info, and all voting info, tossed it into a multi-dimensional array and sorted on various keys in the array. Downsides to this approach include the fact that I have to retrieve all potentially relevant data from the database (rather than just the subset I need), and then there’s processing time to do the sorting. I can see this scaling terribly. It also keeps me from using the nice hooks Drupal has in place for paged queries, sorting, etc., and that just seems like flagrant waste.

Another approach is trying to do some complex multi-join table, but that turned out to have weird results. You have to use a left outer join in order to get nodes that have neither votes nor downloads, but grouping on the two satellite tables as you must in order to get counts and sums out of them throws the numbers off. If you group by multiple columns, you get redundant data (ie, nodes listed twice, once for the match in the downloads table and once for the match in the votes table). If you group by just the common key, strange things happen as well. In some cases, something like an unexpected multiplication (across columns!) seemed to be happening as a result of grouping. I never quite figured that one out. In any case, I couldn’t find a single multi-join query that would work.

Yet another approach is to try to use a union. A union lets you daisy-chain the results of multiple queries together. Column names and types are taken from the first query in the union, and as long as there are no major incompatibilities (e.g. different numbers of columns in subsequent queries), the results of the queries are dumped out as one result set. I was hoping I could use a union to combine output from the download count query and the vote calculation query, but a union returns distinct rows. If I selected NULL for the download count in the voting query and NULL for the voting calculation columns in the download count query, I did get back all the results I wanted, but there were two rows of data per content node because rows were not unique. This would require further post-processing.

And finally, I considered using either temporary tables (select both result sets into temporary tables and then join the two to the nodes table to select the results I wanted) or views. Temporary tables seem messy and inefficient, so I was reluctant to use those, and views require a more recent version of mysql than I’m using, so that was out.

Now on to the solution. I had heretofore used sub-selects only to get scalar lists of ids to select from: “SELECT foo from bar WHERE id IN (SELECT id from other_table)”. It turns out that you can get whole result sets from sub-selects. What you’re doing in this case is in essence to select some results and define them as a table to select from within your wrapping query. (Really, I guess it’s a temporary table scenario, though it seems less hacky and possibly somehow more efficient/optimized than issuing CREATE, INSERT, SELECT, and DROP statements per request to get some data.) Here’s what I arrived at to solve my problem:

SELECT node.*, votes.vote, votes.vote_count, votes.vote_ratio,
downloads.downloads
FROM node,

(SELECT node.nid, SUM(votes.vote) vote, COUNT(votes.vote) vote_count, (SUM(votes.vote) / COUNT(votes.vote)) vote_ratio
FROM node
LEFT OUTER JOIN votes
ON votes.content_id = node.nid
WHERE node.type = "extension"
GROUP BY node.nid) votes,

(SELECT node.nid, COUNT(extension_downloads.eid) downloads
FROM node
LEFT OUTER JOIN extension_downloads
ON extension_downloads.eid = node.nid
WHERE node.type = "extension"
GROUP BY node.nid) downloads

WHERE votes.nid = downloads.nid
AND node.type = "extension"
AND node.nid = votes.nid;

I get everything from the node table. Then I do my first sub-query, which left outer joins the votes table on the node table to get all ids and relevant vote stats. I alias that sub-select as “votes” so that in the wrapping query, I can refer to its columns using “votes” as a prefix. Next I do a similar query on the downloads table. Finally, I constrain my wrapping query by node type and id. Since I’m doing left outer joins in my sub-queries, the row count for all three “tables” I’m selecting from is the same, and a simple nid = nid correspondence makes my data line up.

I may look into improving this further. It’s possible that I can reduce my data transfer and query burden by avoiding the left outer joins in the sub-queries and doing a single left outer join onto the two virtual tables in the wrapping query. I’m not sure whether aliasing the sub-queries will allow this or not.

Update: I was able to tweak the query as I speculated I might be able to above. Here’s the new query:

SELECT node.*, votes.vote, votes.vote_count, votes.vote_ratio, downloads.downloads
FROM node

LEFT OUTER JOIN
(SELECT votes.content_id nid, SUM(votes.vote) vote, COUNT(votes.vote) vote_count, (SUM(votes.vote) / COUNT(votes.vote)) vote_ratio
FROM node, votes
WHERE node.nid = votes.content_id AND node.type = "extension"
GROUP BY votes.content_id)

votes ON votes.nid = node.nid

LEFT OUTER JOIN
(SELECT extension_downloads.eid nid, COUNT(extension_downloads.eid) downloads
FROM extension_downloads GROUP BY eid)

downloads on downloads.nid = node.nid

WHERE node.type = "extension"

What’s going on now is that I’m joining on the results of the sub-queries rather than within the sub-queries. If I join within, then each sub-query returns as many rows as there are relevant nodes. If I join outside the sub-queries, each sub-query returns a number of results equal to the subset of nodes for which there is relevant data. Say I’ve got 500 nodes and that 300 of them have download counts and 250 of them have been voted on. In the original query, each sub-query returned 500 rows of data. In the new query, the sub-queries return 300 and 250 rows, which are then merged back into the 500 rows selected from the node table. The change stands to provide a significantly more efficient query that will scale better as more content nodes are added over time. What was really at question at the end of my original posting was whether or not aliasing the sub-query and using that alias in the join would work (I’ve had issues with aliasing in joins before), and it did.