Dealing with accented character sets in MySQL
2006-03-08 13:36:16
Here's something I learned about today. Suppose you have a table of city names, and some of them are really the same city, but with alternate spellings like Skovde and Skövde. How do you find them?
Check out this awesome query:
select city, id, count(id) as cnt from city group by city collate utf8_general_ci order by cnt;
Anything with a "cnt" column > 1 is probably a dup. Then you can look at the individual entries and figure out what to do. (In my case, I have an alias table that I merge the dups into.)
The key to this is that you can change the collation rules in the table on the fly, which will decide that o == ö, for example. You could do the same thing with latin1 or other character sets too.