EGOPOLY

Topics include: programming, Apple, Unix, gadgets, large-scale web sites and other nerdy stuff.

Dealing with accented character sets in MySQL

2006-03-08 13:36:16

Here's something I learned about today. Suppose you have a table of city names, and some of them are really the same city, but with alternate spellings like Skovde and Skövde. How do you find them?

Check out this awesome query:

select city, id, count(id) as cnt from city 
  group by city collate utf8_general_ci 
  order by cnt;

Anything with a "cnt" column > 1 is probably a dup. Then you can look at the individual entries and figure out what to do. (In my case, I have an alias table that I merge the dups into.)

The key to this is that you can change the collation rules in the table on the fly, which will decide that o == ö, for example. You could do the same thing with latin1 or other character sets too.