Category: Technical

Anything that uses 1s and 0s

  • UNION and ORDER and LIMIT

    From a discussion on Freenode (and don’t feel bad if you don’t get it; I had to ask Sinisa and Monty to explain it to me. Sinisa explained how it worked, and Monty found a bug.)–

    USE test;
    DROP TABLE IF EXISTS `t`;
    CREATE TABLE `t` (
    `a` int(11) DEFAULT NULL
    );

    INSERT INTO t VALUES (RAND() * 50);
    INSERT INTO t SELECT RAND() * 50 FROM t;

    Repeat that last INSERT a few times until you have about 1000 rows.

    What would you expect from the following?

    SELECT * FROM t WHERE false UNION SELECT * FROM t ORDER BY a LIMIT 19, 1;
    (SELECT * FROM t WHERE false) UNION (SELECT * FROM t ORDER BY a LIMIT 19, 1);
    (SELECT * FROM t WHERE false) UNION (SELECT * FROM t ORDER BY a) LIMIT 19, 1;

    Two things to remember:
    UNION is equivalent to UNION DISTINCT
    – an ORDER BY with no LIMIT is optimized away in UNIONed queries

    The first query:

    SELECT * FROM t WHERE false UNION SELECT * FROM t ORDER BY a LIMIT 19, 1;

    is equivalent to:

    (SELECT * FROM t WHERE false UNION SELECT * FROM t) ORDER BY a LIMIT 19, 1;

    Since the UNION is by default DISTINCT, get each distinct value from the table, then order by `a`. Take the row at offset 19. Assuming even distribution of our random data, your value should be ’19’.

    The second query:

    (SELECT * FROM t WHERE false) UNION (SELECT * FROM t ORDER BY a LIMIT 19, 1);

    Order all the rows by `a`. That’s roughly 20 rows of value ‘1’, then 20 rows of value ‘2’, then 20 rows of value ‘3’, etc. Take the row at offset 19. Assuming even distribution of our random data, your value should be ‘1’.

    The third query:

    (SELECT * FROM t WHERE false) UNION (SELECT * FROM t ORDER BY a) LIMIT 19, 1;

    The ORDER BY is discarded. We get distinct values in random order. Take the row at offset 19. Since the values are in random order, the value is not predictable.

  • Multi-master replication

    Implement multi-master to single-slave replication by periodically saving state and changing masters, and do it all in SQL.*

    Just a proof of concept. I suggest you add some error checking (like, don’t make the switch if the slave isn’t running).

    Remember that all this stuff goes on the SLAVE. If you absent-mindedly log in to the master, like I did a few times, you might wonder why your relay_log.info file isn’t read.

    Set up a table to keep track of each master file. Mine use the same user name and password for all hosts; you can add those fields and adjust the procedure accordingly if you like.

    USE mysql;
    DROP TABLE IF EXISTS rotate_master;
    CREATE TABLE rotate_master (
    id int auto_increment primary key,
    master_host varchar(255),
    master_port int unsigned,
    master_log_file varchar(255),
    master_log_pos int unsigned,
    in_use boolean DEFAULT 0
    );

    Insert the list of masters. The one that is currently master should have `in_use` set to 1; all the others should be 0. The order you insert them is the order they will rotate.

    INSERT INTO rotate_master VALUES (NULL, 'localhost', 3306, 'bin.000001', 0, 1);
    INSERT INTO rotate_master VALUES (NULL, 'localhost', 3308, 'bin.000001', 0, 0);

    Then, a procedure to grab the current position (by reading from the relay_log.info file, since the slave position is not otherwise exposed in any accessible manner), record it in the above table, and change masters to the next one in the list.

    DROP PROCEDURE IF EXISTS rotate_master;
    DELIMITER //

    CREATE PROCEDURE rotate_master()
    BEGIN
    DECLARE _info text;
    DECLARE _master_file varchar(255);
    DECLARE _master_pos int unsigned;
    DECLARE _master_host varchar(255);
    DECLARE _master_port int unsigned;
    DECLARE _id int;

    STOP SLAVE;

    -- fetch and store current position
    SELECT LOAD_FILE(@@relay_log_info_file) INTO _info;
    SELECT
    SUBSTRING_INDEX(SUBSTRING_INDEX(_info, '\n', 3), '\n', -1),
    SUBSTRING_INDEX(SUBSTRING_INDEX(_info, '\n', 4), '\n', -1)
    INTO _master_file, _master_pos;
    UPDATE mysql.rotate_master SET master_log_file = _master_file, master_log_pos = _master_pos, id = LAST_INSERT_ID(id) WHERE in_use = 1;

    -- fetch next host
    SELECT
    id,
    master_host,
    master_port,
    master_log_file,
    master_log_pos
    INTO _id, _master_host, _master_port, _master_file, _master_pos
    FROM rotate_master
    ORDER BY id <= LAST_INSERT_ID(), id LIMIT 1; -- advance to next host SET @sql := CONCAT( 'CHANGE MASTER TO master_host=', QUOTE(_master_host), ', master_port=', _master_port, ', master_log_file=', QUOTE(_master_file), ', master_log_pos=', _master_pos); PREPARE myStmt FROM @sql; EXECUTE myStmt; -- mark host as changed UPDATE mysql.rotate_master SET in_use = 0 WHERE in_use = 1; UPDATE mysql.rotate_master SET in_use = 1 WHERE id = _id; START SLAVE; END// DELIMITER ;

    Finally, schedule it to rotate. I used 10 seconds just to play with; I imagine a few minutes would be more practical. *If you don't have 5.1 or later, you'll have to schedule this part externally, with cron or task scheduler.

    DROP EVENT IF EXISTS rotate_master;
    CREATE EVENT rotate_master
    ON SCHEDULE EVERY 10 SECOND
    DO CALL mysql.rotate_master();

  • LOAD DATA and recovery

    A little two-part quiz. If you get the first one without peeking, you’re worth your pay as a DBA. If you get the second one without peeking, you may tell your boss that some random guy on the Internet says you deserve a raise.

    Start with a text file, ‘test.txt’, with these three lines:

    1
    1
    2

    Set up the test in MySQL:

    DROP TABLE IF EXISTS t1;
    CREATE TABLE t1 (id int primary key);
    LOAD DATA INFILE 'test.txt' INTO TABLE t1;

    (more…)

  • Restore from mysqldump --all-databases in parallel

    csplit backup.sql /"^-- Current Database"/ {*}; for f in xx*; do cat xx00 "$f" | mysql& done;

    and then rm xx* when the background jobs have finished.

    I haven’t done any benchmarking, and I haven’t worked out what to do if the number of databases exceeds the number of available connections.

  • mysql’s --xml and some XSLT (now with JSON!)

    Somebody asked in Freenode the other day how to get their data out of MySQL into a specific XML format.

    Both mysqldump and the mysql client have --xml options. The output from the cli looks like this:

    <resultset statement="SELECT here">
      <row>
        <field name="field1">value a</field>
        <field name="field2">value b</field>
      </row>
      <row>
        <field name="field1">value c</field>
        <field name="field2">value d</field>
      </row>
    </resultset>

    I wanted to change those <field name=”fieldName”> into <fieldName> tags, and all it takes is a little XSLT, which I’ve uploaded here.

    An XSLT processor lets you do something like this:

    mysql --xml -e "SELECT here" | xsltproc mysql.xsl -

    The result will be something like:

    <resultset statement="SELECT here">
      <row>
        <field1>value a</field1>
        <field2>value b</field2>
      </row>
      <row>
        <field1>value c</field1>
        <field2>value d</field2>
      </row>
    </resultset>

    If JSON is your thing, you can use xml2json to take that output a step farther:

    mysql --xml -e "SELECT here" | xsltproc mysql.xsl - | xsltproc xml2json.xsl -

    Result (whitespace added):

    {"resultset" : 
      {
        "statement" : "SELECT here", 
        "row" : [
          {"field1" : "value a", "field2" : "value b"},
          {"field1" : "value c", "field2" : "value d"}
        ]
      }
    }

    That will suffice until the mysql client has a –json parameter.

  • Make an ALTER TABLE script with MySQL Workbench

    So you have two versions of a database sitting around – maybe you’ve been trying things out on your development box, and now you want to move it all over to production. If it’s just a new table here and an extra column there, you could do it by hand, but if you’ve made lots of changes, it would be nice to let the computer do it for you.

    Enter MySQL Workbenchmysqldiff.

  • Trigger on firing table

    CREATE TABLE t1 (id int);
    CREATE TRIGGER t1_ai AFTER INSERT ON t1 FOR EACH ROW 
      INSERT INTO t1 VALUES (NEW.id);
    

    That’s not allowed; it won’t let you create the trigger.

    CREATE TABLE t1 (id int);
    CREATE TABLE f1 (id int) ENGINE=FEDERATED CONNECTION='localhost/test/t1';
    CREATE TRIGGER t1_ai AFTER INSERT ON t1 FOR EACH ROW 
      INSERT INTO f1 VALUES (NEW.id);

    The server lets that through. Don’t do it.

  • On unions and derivatives

    On occasion I need to generate a number of rows. I don’t care about the contents, just how many rows there are. Maybe I want to create a table filled with a sequence of numbers. One common way to do this is to join a bunch of subqueries built out of unions:


    SET @count := 0;
    INSERT INTO table SELECT @count := @count + 1
    FROM (SELECT 1 UNION SELECT 2) a
    JOIN (SELECT 1 UNION SELECT 2) b
    JOIN (SELECT 1 UNION SELECT 2) c

    Of course, I could union more than two selects (e.g. SELECT 1 UNION SELECT 2 UNION SELECT 3). That makes each derived table longer, but reduces how many must be joined.

    Question: What’s the optimal size of each derived table?

    Defining “optimal”: since this is a one-off query, I don’t much care about performance (within reason), but I do care about how much typing I have to do. So I will favor “fewer characters” over “fewer joins”.

    Each derived table in the query takes up 15 * numSelects + 2 characters.

    To generate a given number of rows requires CEILING(LOG(numRows) / LOG(numSelects)) derived tables.

    If I want 1000 rows in the result, I’d need 10 derived tables of two rows each (320 characters), or 7 of three rows (329 characters), or 3 of ten rows (456 characters).

    For any given numRows, what’s the minimum of CEILING(LOG(numRows) / LOG(numSelects)) * (15 * numSelects + 2) ?

    The answer varies with numRows – 2, 3, or 4 are alternately best when generating less than 130 rows, but any higher than that and the best choice usually turns out to be 3.

    (SELECT 1 UNION SELECT 2 UNION SELECT 3)

  • group date

    A non-rigorous, non-scientific, totally off-the-cuff test of which function to pick when you need to group by year and month.

    I populated a table with 262K rows of random dates, and then ran

    SELECT %s, COUNT(*) FROM table GROUP BY %s ORDER BY NULL

    with various functions, which should all result in the same grouping. I repeated each query five times and show the average time, using three different column types DATE, DATETIME, and TIMESTAMP.

    expression DATE DATETIME TIMESTAMP
    EXTRACT(YEAR_MONTH FROM d) 0.362 0.369 0.581
    LAST_DAY(d) 0.374 0.389 0.582
    DATE_SUB(d, INTERVAL DAY(d) DAY) 0.429 0.882 1.452
    d – INTERVAL DAY(d) DAY 0.429 0.887 1.535
    SUBSTRING(d, 1, 7) 0.454 0.488 0.719
    YEAR(d), MONTH(d) 1.046 1.126 2.045
    MONTH(d), YEAR(d) 1.116 1.196 2.112
    LEFT(d, 7) 1.307 1.405 2.123
    DATE_FORMAT(d, ‘%Y%m’) 1.480 1.565 2.112
    DATE_FORMAT(d, ‘%Y-%m’) 1.514 1.615 2.564
    DATE_FORMAT(d, ‘%m%Y’) 1.517 1.604 2.420
    DATE_FORMAT(d, ‘%m-%Y’) 1.562 1.656 2.465
    MONTHNAME(d), YEAR(d) 1.613 1.713 2.812
    YEAR(d), MONTHNAME(d) 1.663 1.766 2.873

    And just in case you want to extract a full date (which really only makes sense for datetime and timestamp):

    expression DATE DATETIME TIMESTAMP
    DATE(d) 0.357 0.374 0.591
    EXTRACT(YEAR_MONTH FROM d), EXTRACT(DAY FROM d) 0.377 0.407 0.730
    EXTRACT(DAY FROM d), EXTRACT(YEAR_MONTH FROM d) 0.395 0.422 0.751
    MONTH(d), YEAR(d), DAY(d) 0.395 0.426 0.870
    YEAR(d), DAY(d), MONTH(d) 0.398 0.440 0.862
    YEAR(d), MONTH(d), DAY(d) 0.406 0.441 0.867
    DAY(d), YEAR(d), MONTH(d) 0.409 0.444 0.859
    LEFT(d, 10) 0.437 0.487 0.728
    SUBSTRING_INDEX(d, ‘ ‘, 1) 0.439 0.475 0.743
    DAY(d), MONTH(d), YEAR(d) 0.441 0.477 0.901
    MONTH(d), DAY(d), YEAR(d) 0.442 0.472 0.914
    SUBSTRING(d, 1, 10) 0.460 0.496 0.729
    DATE_FORMAT(d, ‘%d%Y%m’) 0.539 0.571 0.852
    DATE_FORMAT(d, ‘%m%Y%d’) 0.542 0.570 0.841
    DATE_FORMAT(d, ‘%m%d%Y’) 0.543 0.572 0.846
    DATE_FORMAT(d, ‘%Y%m%d’) 0.544 0.570 0.842
    DATE_FORMAT(d, ‘%Y%d%m’) 0.544 0.572 0.843
    DATE_FORMAT(d, ‘%d-%Y-%d’) 0.547 0.574 0.848
    DATE_FORMAT(d, ‘%d%m%Y’) 0.549 0.573 0.842
    DATE_FORMAT(d, ‘%m-%Y-%d’) 0.552 0.583 0.869
    DATE_FORMAT(d, ‘%Y-%m-%d’) 0.558 0.583 0.854
    DATE_FORMAT(d, ‘%Y-%d-%m’) 0.569 0.593 0.867
    d – INTERVAL HOUR(d) HOUR – INTERVAL MINUTE(d) MINUTE – INTERVAL SECOND(d) SECOND 0.573 0.653 1.249
    DATE_FORMAT(d, ‘%d-%m-%Y’) 0.598 0.637 0.920
    DATE_FORMAT(d, ‘%m-%d-%Y’) 0.601 0.633 0.904
  • GROUP UNCONCAT

    Dunno why you’d rather do this in SQL than in your application layer, but if you do, here’s one way to turn a delimited string of values back into multiple rows – just the opposite of GROUP_CONCAT:

    SET @sourceString = 'a,b,c,d,e';
    SET @sql = CONCAT('INSERT INTO t VALUES (\\'', REPLACE(@sourceString, ',', '\\'),(\\''), '\\')');
    PREPARE myStmt FROM @sql;
    EXECUTE myStmt;

    Just to show what’s going on:

    mysql> SELECT @sql;
    +----------------------------------------------------+
    | @sql                                               |
    +----------------------------------------------------+
    | INSERT INTO t VALUES ('a'),('b'),('c'),('d'),('e') |
    +----------------------------------------------------+