It has been a long time since I blogged something. I am not writing as much code these days as I used to...
Anyway here is something I wrote yesterday, which I think might prove useful to other people.
asciiFilter is a Java function that converts all non-ASCII characters (and some non printable characters in the ASCII range as well) in a String. Most solutions you find on the net are much simpler, because they remove all characters that are not in the ASCII set. My function keeps some characters that are not in teh ASCII set, but are 'translatable' to other characters that are. For example 'Ü' becomes 'U' and 'Ç' becomes 'C'.
Non printable characters are simply removed, i.e. replaced by the empty String "". The empty string is defined as a constant (named EMPTY), so you can easily replace it with any other String, such as the popular "?" or ".".
The convertToAscii function that converts individual characters is pretty straightforward, so you can easily customize it to your needs. To do this I would advise you to download the Unicode Charts. I have converted only characters in the 'Basic Latin' and 'Latin-1 Supplement' charts. But you can implement conversions for any character set you like. All characters that are not explicitly converted are replaced by EMPTY.
I have tried to make the function robust. For example it should handle Unicode supplementary characters correctly. This is achieved with the Character.isHighSurrogate(pChar) and Character.isLowSurrogate(pChar). I am not sure it works, though, because I have not tested the function with supplementary characters.
private static final String EMPTY = "";
public static String asciiFilter(String pInput) {
if (pInput == null) {
return null;
}
char[] chars = new char[pInput.length()];
pInput.getChars(0, pInput.length(), chars, 0);
int i;
StringBuffer buf = new StringBuffer();
for (i = 0; i < pInput.length(); i++) {
buf.append(StringUtil.convertToAscii(chars[i]));
}
return buf.toString();
}
private static String convertToAscii(char pChar) {
if (Character.isHighSurrogate(pChar)) {
return EMPTY;
}
if (Character.isLowSurrogate(pChar)) {
return EMPTY;
}
// Printable ASCII range are not filtered
if (pChar > '\u0020' && pChar < '\u007F') {
return String.valueOf(pChar);
}
// All of these convert to 'A'
if (pChar >= '\u00C0' && pChar <= '\u00C5') {
return "A";
}
// All of these convert to 'a'
if (pChar >= '\u00E0' && pChar <= '\u00E5') {
return "a";
}
// converts to 'AE'
if (pChar == '\u00C6') {
return "AE";
}
// converts to 'ae'
if (pChar == '\u00E6') {
return "ae";
}
// converts to 'C'
if (pChar == '\u00C7') {
return "C";
}
// converts to 'c'
if (pChar == '\u00E7') {
return "c";
}
// All of these convert to 'E'
if (pChar >= '\u00C8' && pChar <= '\u00CB') {
return "E";
}
// All of these convert to 'e'
if (pChar >= '\u00E8' && pChar <= '\u00EB') {
return "e";
}
// All of these convert to 'I'
if (pChar >= '\u00CC' && pChar <= '\u00CF') {
return "I";
}
// All of these convert to 'i'
if (pChar >= '\u00EC' && pChar <= '\u00EF') {
return "i";
}
// converts to 'D'
if (pChar == '\u00D0') {
return "D";
}
// converts to 'd'
if (pChar == '\u00F0') {
return "d";
}
// converts to 'N'
if (pChar == '\u00D1') {
return "N";
}
// converts to 'n'
if (pChar == '\u00F1') {
return "n";
}
// All of these convert to 'O'
if (pChar >= '\u00D2' && pChar <= '\u00D6') {
return "O";
}
// All of these convert to 'o'
if (pChar >= '\u00F2' && pChar <= '\u00F6') {
return "o";
}
// converts to 'x'
if (pChar == '\u00D7') {
return "x";
}
// converts to '/'
if (pChar == '\u00F7') {
return "/";
}
// converts to 'O'
if (pChar == '\u00D8') {
return "O";
}
// converts to 'o'
if (pChar == '\u00F8') {
return "o";
}
// All of these convert to 'U'
if (pChar >= '\u00D9' && pChar <= '\u00DC') {
return "U";
}
// All of these convert to 'u'
if (pChar >= '\u00F9' && pChar <= '\u00FC') {
return "u";
}
// converts to 'Y'
if (pChar == '\u00DD') {
return "Y";
}
// converts to 'y'
if (pChar == '\u00FD' || pChar == '\u00FF') {
return "y";
}
// converts to 'ss'
if (pChar == '\u00DF') {
return "ss";
}
// All others are filtered out, i.e. converted to empty String
return EMPTY;
}