A Little Noise

November 19, 2009

Fine-tuning FULLTEXT: Adding word characters

Filed under: MySQL FAQ — snoyes @ 8:27 am

Perhaps you want “.” or “:” or “-” or something to be treated as a word character so that you can do a fulltext match against an IP or MAC address.

If you want to change the set of characters that are considered word characters, you can do so in two ways…
* Modify the MySQL source … and recompile MySQL.
* Modify a character set file: This requires no recompilation.

Thus sayeth the Manual. What it doesn’t say is just what to modify in that character set file. (Edit: Now it does.) I had to ask MySQL developer Alexander Barkov for help to get it right.

In the user comments on that manual page, John Navratil has provided most of the answer, but there’s an important step missing: adding a new collation.

The “normal” collations like latin1_swedish_ci are compiled in. But we don’t really want to mess with those anyway, because then the parser might become confused: you still want “-” to indicate subtraction everywhere except inside your fulltext search, right?

First, find your character set files. Typically they’re under [MySQL Install Path]\share\charsets, but you can double check that with SHOW VARIABLES LIKE 'character_sets_dir';

Edit Index.xml. Find the section for the character set you want to use (I’m using latin1), and add a new collation, with a new name and an unused id:

<collation name="latin1_ft_ci" id="62"/>

Edit the character set file (latin1.xml in my example). Near the top you’ll find a <ctype> array. There’s a leading 00 (which is there to offset the array to 257 characters for some legacy EOF convention, per the manual.) After that, you’ll find 256 bytes which identify the type of each character in the set. Find the hex value for the character you want (MySQL’s HEX() function is handy for this.) The value for “:” is 0x3A. Find that position in the array. Remember to start at 0x00, so 0x3A is the fourth row down, and the eleventh in from the left. Change the 10 there (which means spacing character) to 01 (which means upper case letter.) You’ll find the rest of the possible character types in the manual.

Scroll down in the same file and find the collations. Copy and paste the whole map from whichever one you normally use (like latin1_swedish_ci) and change the name to match the one we created in the Index.xml file (like latin1_ft_ci).

Restart the server and check that the new collation appears:

SHOW COLLATION LIKE 'latin1_ft_ci';

Finally, alter your table to use the new collation for the fulltext column:

ALTER TABLE tblName MODIFY textColumn TEXT COLLATE latin1_ft_ci;

And now your fulltext search ought to work.

Before and after:

mysql> SELECT * FROM t1 WHERE MATCH(d) against ('00:12:34:56:78:9A');
Empty set (0.00 sec)

mysql> ALTER TABLE t1 MODIFY d TEXT COLLATE latin1_ft_ci;
Query OK, 4 row affected (0.03 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM t1 WHERE MATCH(d) against ('00:12:34:56:78:9A');
+-------------------+
| d                 |
+-------------------+
| 00:12:34:56:78:9A |
+-------------------+
1 row in set (0.01 sec)

30 Comments »

  1. […] recompiling MySQL) is to define a new collation and modify the character set in the process. This post, linked from the bottom of the MySQL fine-tuning post, is a great practical example. Docs are hard […]

    Pingback by Further Fine-Tuning MySQL Full-Text Search « Chomp Technology Blog — January 14, 2010 @ 3:30 pm

  2. Nice advice, just wish it worked for me :(

    Comment by Peter — February 18, 2010 @ 7:38 am

  3. Same thing for me Peter.

    I get an error when trying to change the collation.

    Comment by Aaron — February 24, 2010 @ 8:43 am

  4. Peter, Aaron: What errors do you get? What version are you using?

    Comment by snoyes — February 24, 2010 @ 11:00 am

  5. Hi, thanks for this step-by-step guide. I got it working!

    Question: changing the ctype array at the top of the file will not make a general change to latin1 charset?
    Or is it because all the collations are compiled-in that it will only have effect on new collations?

    Thanks,
    Yoni

    Comment by Yoni B — June 3, 2010 @ 4:52 pm

  6. Yoni: Correct, the compiled-in collation will remain unaffected.

    Comment by snoyes — June 3, 2010 @ 5:05 pm

  7. Thanks snoyes for your help in IRC and in this blog entry. Worked like a charm and was just what I needed.

    Comment by Karamon — September 21, 2010 @ 6:00 am

  8. So how does this work if your’re using UTF8? There isn’t a charset xml file for that?

    Comment by Eric — October 15, 2010 @ 1:01 pm

  9. Eric: You’ll have to modify the source for UTF8.

    Comment by snoyes — October 15, 2010 @ 1:27 pm

  10. Thanks for the great article.
    “Modify the source for UTF8” Can you show what you mean? Also, would it be possible to put the collation directly into the Index.xml for UTF8 e.g. below here:
    http://bugs.mysql.com/file.php?id=6814

    Comment by MSW — October 21, 2010 @ 9:39 am

  11. One follow up – this line:
    ALTER TABLE tblName MODIFY textColumn TEXT COLLATE latin1_ft_ci;
    forces the DB to rebuild the fulltext index?

    Comment by MSW — October 21, 2010 @ 10:08 am

  12. Alas the Alter Table line blows up if the fulltext field is part of any index. Any suggestions? Error is:
    blob/text column used in key specification without a key length

    Comment by MSW — October 21, 2010 @ 3:24 pm

  13. Great article, was just what I needed, really appreciate the full explanations.

    Comment by Paul — November 4, 2010 @ 10:44 am

  14. Actually is seems there is a caveat to doing this: the hyphen no longer represents ‘NOT’ for fulltext searches under the user defined collation. Is there a way round this? Or is it a choice between which functionality is more required for a particular dataset?

    Comment by Paul — November 5, 2010 @ 11:41 am

  15. @MSW – delete the indexes and then recreate them afterwards.

    Absolutely gutted about UTF8. Not really in a position to compile my own version, so I’m going to have to find another solution…

    Comment by Matt Langley — November 11, 2010 @ 8:37 am

  16. Thanks, just what I was looking for! It worked on first try for ‘&’ and ‘/’. My db has data with those characters and it was required an ft search without interpreting those two characters as word separators.
    MySql workbench doesn’t show the new collate in gui, but not a big deal.
    I rate this article very useful.

    Comment by Hector — July 4, 2011 @ 7:10 pm

  17. @Hector: Edit modules/data/mysql_rdbms_info.xml and add the new collation.

    Comment by snoyes — July 5, 2011 @ 8:13 am

  18. Thanks Scott. The new collation is showing now in MySql workbench. The file was at /Applications/MySQLWorkbench.app/Contents/Resources/modules/data/mysql_rdbms_info.xml, as you indicated. I really appreciate your response.

    Comment by Hector — July 5, 2011 @ 10:20 am

  19. I’m having trouble getting the ‘_’ character to be a spacing character. I’m using the default latin1_swedish_ci and it has 5F as 10 but my queries use it as part of the text.

    >INSERT INTO t1 (title) VALUES (‘aaaa_aaaa’), (‘aaaa^aaaa’);
    >SELECT * FROM t1 WHERE MATCH title AGAINST(‘aaaa’);
    +——————-+
    | title |
    +——————-+
    | aaaa^aaaa |
    +——————-+
    1 row in set (0.01 sec)

    Comment by Rob — July 11, 2011 @ 6:37 pm

  20. Hi there,

    Thanks a lot, nice article!

    I did the same on an “easyphp server” and it works…But on my dedicated serveur; the query show no results. (its the same DB); any idea ?

    Cheers,
    G

    Comment by deltagm — June 2, 2012 @ 4:41 am

  21. I have the same problem as Rob, 5F is already set to 10 (spacing character). I think the only way to get 5F to be recognized as a non-word character is to modify the source code… wish there were a better way to do this with mysql 5.6.

    Comment by puff the magic — April 21, 2013 @ 3:33 am

  22. Using MySQL 5.6.
    I Get error
    “Error Code: 1283
    Column ‘mycol’ cannot be part of FULLTEXT index”
    when I try to alter the collation to the new latin_ft_ci.

    Comment by Barry Galbraith — February 25, 2014 @ 4:38 pm

  23. It seems I needed to drop the table and re-create it to have the FULLTEXT index with the new collation.
    It now works.

    Comment by Barry Galbraith — February 25, 2014 @ 5:14 pm

  24. *** IMPORTANT ***

    For newer versions of MySql, DON’T USE 62 as your id.

    Use 249 instead.

    Cheers

    Comment by Wilson — February 28, 2014 @ 6:08 pm

  25. I need to get fulltext search to recognize the forward slash (/) as a valid character. I have a table with part numbers and descriptions. The items are electrical wire and contain the gauge of the wire like “2/0” or “1/0”. I have followed this article as well as the information in the MySQL manual as well as I could but can’t seem to get it to work.

    I modified the Index.xml file and added the new “latin1_ft_ci” collation. Then I edited the latin1.xml file and modified the ctypes map. The forward slash is 0x2f. The value was set to 20 (Punctuation). I changed that to 01. Rebooted the server and altered my fulltext fields to use the new “latin1_ft_ci” collation.

    When I do my query, it appears that the forward slash still does not appear to be evaluated in the fulltext search results.

    Appreciate any help…Thanks

    Comment by Keith — March 23, 2015 @ 10:03 pm

  26. Keith: Is the table InnoDB or MyISAM? Since your words are only 3 letters, have you adjusted the ft_min_word_len?

    Comment by snoyes — March 23, 2015 @ 11:43 pm

  27. The table is MyISAM and yes, I have altered the ft_min_word_len to 2.

    Thanks for the reply.

    Comment by Keith — March 23, 2015 @ 11:56 pm

  28. I apologize. I was able to get it working. It was a mistake when editing the ctype map. I had changed the wrong value. I will blame it on lack of sleep and staring at the hex values for too long.

    Thank you so much. This article was a life saver.

    Comment by Keith — March 24, 2015 @ 12:16 am

  29. It helped me like a charm. Ig go step by step everything goes fine. Yse If you have some settings like added index key or forign key, on to be modify columns, tahn it would give error in gicing the command “ALTER TABLE tblName MODIFY textColumn TEXT COLLATE latin1_ft_ci;”, but of you remove your indexes or foriegn key from those columns than It will work perfetctly. :)

    Comment by Preeti — April 21, 2015 @ 7:46 pm

  30. Thanks for this still very useful post, it pointed me in the right direction. Interesting the process has not been made easier by now. There is also very useful mysql documentation on this, which I did not find beforehand: https://dev.mysql.com/doc/refman/5.5/en/full-text-adding-collation.html
    It documents precisely the case I was looking for, the hyphen.

    Comment by Johannes Sollner — April 28, 2018 @ 3:22 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress