A Little Noise

19Nov/0923

Fine-tuning FULLTEXT: Adding word characters

Perhaps you want "." or ":" or "-" or something to be treated as a word character so that you can do a fulltext match against an IP or MAC address.

If you want to change the set of characters that are considered word characters, you can do so in two ways...
* Modify the MySQL source ... and recompile MySQL.
* Modify a character set file: This requires no recompilation.

Thus sayeth the Manual. What it doesn't say is just what to modify in that character set file. (Edit: Now it does.) I had to ask MySQL developer Alexander Barkov for help to get it right.

In the user comments on that manual page, John Navratil has provided most of the answer, but there's an important step missing: adding a new collation.

The "normal" collations like latin1_swedish_ci are compiled in. But we don't really want to mess with those anyway, because then the parser might become confused: you still want "-" to indicate subtraction everywhere except inside your fulltext search, right?

First, find your character set files. Typically they're under [MySQL Install Path]\share\charsets, but you can double check that with SHOW VARIABLES LIKE 'character_sets_dir';

Edit Index.xml. Find the section for the character set you want to use (I'm using latin1), and add a new collation, with a new name and an unused id:

<collation name="latin1_ft_ci" id="62"/>

Edit the character set file (latin1.xml in my example). Near the top you'll find a <ctype> array. There's a leading 00 (which is there to offset the array to 257 characters for some legacy EOF convention, per the manual.) After that, you'll find 256 bytes which identify the type of each character in the set. Find the hex value for the character you want (MySQL's HEX() function is handy for this.) The value for ":" is 0x3A. Find that position in the array. Remember to start at 0x00, so 0x3A is the fourth row down, and the eleventh in from the left. Change the 10 there (which means spacing character) to 01 (which means upper case letter.) You'll find the rest of the possible character types in the manual.

Scroll down in the same file and find the collations. Copy and paste the whole map from whichever one you normally use (like latin1_swedish_ci) and change the name to match the one we created in the Index.xml file (like latin1_ft_ci).

Restart the server and check that the new collation appears:

SHOW COLLATION LIKE 'latin1_ft_ci';

Finally, alter your table to use the new collation for the fulltext column:

ALTER TABLE tblName MODIFY textColumn TEXT COLLATE latin1_ft_ci;

And now your fulltext search ought to work.

Before and after:

mysql> SELECT * FROM t1 WHERE MATCH(d) against ('00:12:34:56:78:9A');
Empty set (0.00 sec)

mysql> ALTER TABLE t1 MODIFY d TEXT COLLATE latin1_ft_ci;
Query OK, 4 row affected (0.03 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM t1 WHERE MATCH(d) against ('00:12:34:56:78:9A');
+-------------------+
| d                 |
+-------------------+
| 00:12:34:56:78:9A |
+-------------------+
1 row in set (0.01 sec)
Filed under: MySQL FAQ Leave a comment
Comments (23) Trackbacks (1)
  1. Nice advice, just wish it worked for me :(

  2. Same thing for me Peter.

    I get an error when trying to change the collation.

  3. Peter, Aaron: What errors do you get? What version are you using?

  4. Hi, thanks for this step-by-step guide. I got it working!

    Question: changing the ctype array at the top of the file will not make a general change to latin1 charset?
    Or is it because all the collations are compiled-in that it will only have effect on new collations?

    Thanks,
    Yoni

  5. Yoni: Correct, the compiled-in collation will remain unaffected.

  6. Thanks snoyes for your help in IRC and in this blog entry. Worked like a charm and was just what I needed.

  7. So how does this work if your’re using UTF8? There isn’t a charset xml file for that?

  8. Eric: You’ll have to modify the source for UTF8.

  9. Thanks for the great article.
    “Modify the source for UTF8″ Can you show what you mean? Also, would it be possible to put the collation directly into the Index.xml for UTF8 e.g. below here:
    http://bugs.mysql.com/file.php?id=6814

  10. One follow up – this line:
    ALTER TABLE tblName MODIFY textColumn TEXT COLLATE latin1_ft_ci;
    forces the DB to rebuild the fulltext index?

  11. Alas the Alter Table line blows up if the fulltext field is part of any index. Any suggestions? Error is:
    blob/text column used in key specification without a key length

  12. Great article, was just what I needed, really appreciate the full explanations.

  13. Actually is seems there is a caveat to doing this: the hyphen no longer represents ‘NOT’ for fulltext searches under the user defined collation. Is there a way round this? Or is it a choice between which functionality is more required for a particular dataset?

  14. @MSW – delete the indexes and then recreate them afterwards.

    Absolutely gutted about UTF8. Not really in a position to compile my own version, so I’m going to have to find another solution…

  15. Thanks, just what I was looking for! It worked on first try for ‘&’ and ‘/’. My db has data with those characters and it was required an ft search without interpreting those two characters as word separators.
    MySql workbench doesn’t show the new collate in gui, but not a big deal.
    I rate this article very useful.

  16. @Hector: Edit modules/data/mysql_rdbms_info.xml and add the new collation.

  17. Thanks Scott. The new collation is showing now in MySql workbench. The file was at /Applications/MySQLWorkbench.app/Contents/Resources/modules/data/mysql_rdbms_info.xml, as you indicated. I really appreciate your response.

  18. I’m having trouble getting the ‘_’ character to be a spacing character. I’m using the default latin1_swedish_ci and it has 5F as 10 but my queries use it as part of the text.

    >INSERT INTO t1 (title) VALUES (‘aaaa_aaaa’), (‘aaaa^aaaa’);
    >SELECT * FROM t1 WHERE MATCH title AGAINST(‘aaaa’);
    +——————-+
    | title |
    +——————-+
    | aaaa^aaaa |
    +——————-+
    1 row in set (0.01 sec)

  19. Hi there,

    Thanks a lot, nice article!

    I did the same on an “easyphp server” and it works…But on my dedicated serveur; the query show no results. (its the same DB); any idea ?

    Cheers,
    G

  20. I have the same problem as Rob, 5F is already set to 10 (spacing character). I think the only way to get 5F to be recognized as a non-word character is to modify the source code… wish there were a better way to do this with mysql 5.6.

  21. Using MySQL 5.6.
    I Get error
    “Error Code: 1283
    Column ‘mycol’ cannot be part of FULLTEXT index”
    when I try to alter the collation to the new latin_ft_ci.

  22. It seems I needed to drop the table and re-create it to have the FULLTEXT index with the new collation.
    It now works.

  23. *** IMPORTANT ***

    For newer versions of MySql, DON’T FCUKING USE 62 as your id.

    Use 249 instead.

    Cheers


Leave a comment