Java Tutorial/Development/Unicode — различия между версиями

Текущая версия на 15:30, 31 мая 2010

Содержание

1 Checks if the String contains only unicode digits. A decimal point is not a unicode digit and returns false.
2 Checks if the String contains only unicode digits or space
3 Checks if the String contains only unicode letters.
4 Checks if the String contains only unicode letters and space (" ").
5 Checks if the String contains only unicode letters, digits or space (" ").
6 Checks if the String contains only unicode letters or digits.
7 Convert from Unicode to UTF-8
8 Convert from UTF-8 to Unicode
9 Converts the string to the unicode format
10 Convert string to UTF8 bytes
11 Converts Unicode into something that can be embedded in a java properties file
12 Count the number of bytes included in the given char[].
13 Count the number of bytes needed to return an Unicode char. This can be from 1 to 6.
14 Count the number of chars included in the given byte[].
15 Decodes values of attributes in the DN encoded in hex into a UTF-8 String.
16 Display special character using Unicode
17 Get UTF String Size
18 Return an UTF-8 encoded String
19 Return an UTF-8 encoded String by length
20 Return the number of bytes that hold an Unicode char.
21 Return the Unicode char which is coded in the bytes at position 0.
22 Return the Unicode char which is coded in the bytes at the given position.
23 Return UTF-8 encoded byte[] representation of a String
24 Safe UTF: 64K serialized size
25 Using Unicode in String

Checks if the String contains only unicode digits. A decimal point is not a unicode digit and returns false.

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 * 
 *      http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
public class Main {
  /**
   * Checks if the String contains only unicode digits.
   * A decimal point is not a unicode digit and returns false.
   *
   * <code>null</code> will return <code>false</code>.
   * An empty String ("") will return <code>true</code>.
   *
   * <pre>
   * StringUtils.isNumeric(null)   = false
   * StringUtils.isNumeric("")     = true
   * StringUtils.isNumeric("  ")   = false
   * StringUtils.isNumeric("123")  = true
   * StringUtils.isNumeric("12 3") = false
   * StringUtils.isNumeric("ab2c") = false
   * StringUtils.isNumeric("12-3") = false
   * StringUtils.isNumeric("12.3") = false
   * </pre>
   *
   * @param str  the String to check, may be null
   * @return <code>true</code> if only contains digits, and is non-null
   */
  public static boolean isNumeric(String str) {
      if (str == null) {
          return false;
      }
      int sz = str.length();
      for (int i = 0; i < sz; i++) {
          if (Character.isDigit(str.charAt(i)) == false) {
              return false;
          }
      }
      return true;
  }
}

Checks if the String contains only unicode digits or space

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 * 
 *      http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
public class Main {
  /**
   * Checks if the String contains only unicode digits or space
   * (<code>" "</code>).
   * A decimal point is not a unicode digit and returns false.
   *
   * <code>null</code> will return <code>false</code>.
   * An empty String ("") will return <code>true</code>.
   *
   * <pre>
   * StringUtils.isNumeric(null)   = false
   * StringUtils.isNumeric("")     = true
   * StringUtils.isNumeric("  ")   = true
   * StringUtils.isNumeric("123")  = true
   * StringUtils.isNumeric("12 3") = true
   * StringUtils.isNumeric("ab2c") = false
   * StringUtils.isNumeric("12-3") = false
   * StringUtils.isNumeric("12.3") = false
   * </pre>
   *
   * @param str  the String to check, may be null
   * @return <code>true</code> if only contains digits or space,
   *  and is non-null
   */
  public static boolean isNumericSpace(String str) {
      if (str == null) {
          return false;
      }
      int sz = str.length();
      for (int i = 0; i < sz; i++) {
          if ((Character.isDigit(str.charAt(i)) == false) && (str.charAt(i) != " ")) {
              return false;
          }
      }
      return true;
  }
}

Checks if the String contains only unicode letters.

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 * 
 *      http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
public class Main {
  //-----------------------------------------------------------------------
  /**
   * Checks if the String contains only unicode letters.
   *
   * <code>null</code> will return <code>false</code>.
   * An empty String ("") will return <code>true</code>.
   *
   * <pre>
   * StringUtils.isAlpha(null)   = false
   * StringUtils.isAlpha("")     = true
   * StringUtils.isAlpha("  ")   = false
   * StringUtils.isAlpha("abc")  = true
   * StringUtils.isAlpha("ab2c") = false
   * StringUtils.isAlpha("ab-c") = false
   * </pre>
   *
   * @param str  the String to check, may be null
   * @return <code>true</code> if only contains letters, and is non-null
   */
  public static boolean isAlpha(String str) {
      if (str == null) {
          return false;
      }
      int sz = str.length();
      for (int i = 0; i < sz; i++) {
          if (Character.isLetter(str.charAt(i)) == false) {
              return false;
          }
      }
      return true;
  }
}

Checks if the String contains only unicode letters and space (" ").

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 * 
 *      http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
public class Main {
  /**
   * Checks if the String contains only unicode letters and
   * space (" ").
   *
   * <code>null</code> will return <code>false</code>
   * An empty String ("") will return <code>true</code>.
   *
   * <pre>
   * StringUtils.isAlphaSpace(null)   = false
   * StringUtils.isAlphaSpace("")     = true
   * StringUtils.isAlphaSpace("  ")   = true
   * StringUtils.isAlphaSpace("abc")  = true
   * StringUtils.isAlphaSpace("ab c") = true
   * StringUtils.isAlphaSpace("ab2c") = false
   * StringUtils.isAlphaSpace("ab-c") = false
   * </pre>
   *
   * @param str  the String to check, may be null
   * @return <code>true</code> if only contains letters and space,
   *  and is non-null
   */
  public static boolean isAlphaSpace(String str) {
      if (str == null) {
          return false;
      }
      int sz = str.length();
      for (int i = 0; i < sz; i++) {
          if ((Character.isLetter(str.charAt(i)) == false) && (str.charAt(i) != " ")) {
              return false;
          }
      }
      return true;
  }
}

Checks if the String contains only unicode letters, digits or space (" ").

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 * 
 *      http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
public class Main {
  /**
   * Checks if the String contains only unicode letters, digits
   * or space (<code>" "</code>).
   *
   * <code>null</code> will return <code>false</code>.
   * An empty String ("") will return <code>true</code>.
   *
   * <pre>
   * StringUtils.isAlphanumeric(null)   = false
   * StringUtils.isAlphanumeric("")     = true
   * StringUtils.isAlphanumeric("  ")   = true
   * StringUtils.isAlphanumeric("abc")  = true
   * StringUtils.isAlphanumeric("ab c") = true
   * StringUtils.isAlphanumeric("ab2c") = true
   * StringUtils.isAlphanumeric("ab-c") = false
   * </pre>
   *
   * @param str  the String to check, may be null
   * @return <code>true</code> if only contains letters, digits or space,
   *  and is non-null
   */
  public static boolean isAlphanumericSpace(String str) {
      if (str == null) {
          return false;
      }
      int sz = str.length();
      for (int i = 0; i < sz; i++) {
          if ((Character.isLetterOrDigit(str.charAt(i)) == false) && (str.charAt(i) != " ")) {
              return false;
          }
      }
      return true;
  }
}

Checks if the String contains only unicode letters or digits.

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 * 
 *      http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
public class Main {
  /**
   * Checks if the String contains only unicode letters or digits.
   *
   * <code>null</code> will return <code>false</code>.
   * An empty String ("") will return <code>true</code>.
   *
   * <pre>
   * StringUtils.isAlphanumeric(null)   = false
   * StringUtils.isAlphanumeric("")     = true
   * StringUtils.isAlphanumeric("  ")   = false
   * StringUtils.isAlphanumeric("abc")  = true
   * StringUtils.isAlphanumeric("ab c") = false
   * StringUtils.isAlphanumeric("ab2c") = true
   * StringUtils.isAlphanumeric("ab-c") = false
   * </pre>
   *
   * @param str  the String to check, may be null
   * @return <code>true</code> if only contains letters or digits,
   *  and is non-null
   */
  public static boolean isAlphanumeric(String str) {
      if (str == null) {
          return false;
      }
      int sz = str.length();
      for (int i = 0; i < sz; i++) {
          if (Character.isLetterOrDigit(str.charAt(i)) == false) {
              return false;
          }
      }
      return true;
  }
}

Convert from Unicode to UTF-8

public class Main {
  public static void main(String[] argv) throws Exception {
    
    String string = "abc\u5639\u563b";
    byte[] utf8 = string.getBytes("UTF-8");
  }
}

Convert from UTF-8 to Unicode

public class Main {
  public static void main(String[] argv) throws Exception {
    String string = "abc\u5639";
    byte[] utf8 = string.getBytes("UTF-8");
    
    string = new String(utf8, "UTF-8");
    System.out.println(string);
  }
}

Converts the string to the unicode format

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/**
 * Operations on char primitives and Character objects.
 *
 * This class tries to handle <code>null</code> input gracefully.
 * An exception will not be thrown for a <code>null</code> input.
 * Each method documents its behaviour in more detail.
 * 
 * @author Stephen Colebourne
 * @since 2.1
 * @version $Id: CharUtils.java 437554 2006-08-28 06:21:41Z bayard $
 */
public class Main {
  //--------------------------------------------------------------------------
  /**
   * Converts the string to the unicode format "\u0020".
   * 
   * This format is the Java source code format.
   *
   * <pre>
   *   CharUtils.unicodeEscaped(" ") = "\u0020"
   *   CharUtils.unicodeEscaped("A") = "\u0041"
   * </pre>
   * 
   * @param ch  the character to convert
   * @return the escaped unicode string
   */
  public static String unicodeEscaped(char ch) {
      if (ch < 0x10) {
          return "\\u000" + Integer.toHexString(ch);
      } else if (ch < 0x100) {
          return "\\u00" + Integer.toHexString(ch);
      } else if (ch < 0x1000) {
          return "\\u0" + Integer.toHexString(ch);
      }
      return "\\u" + Integer.toHexString(ch);
  }
  
  /**
   * Converts the string to the unicode format "\u0020".
   * 
   * This format is the Java source code format.
   * 
   * If <code>null</code> is passed in, <code>null</code> will be returned.
   *
   * <pre>
   *   CharUtils.unicodeEscaped(null) = null
   *   CharUtils.unicodeEscaped(" ")  = "\u0020"
   *   CharUtils.unicodeEscaped("A")  = "\u0041"
   * </pre>
   * 
   * @param ch  the character to convert, may be null
   * @return the escaped unicode string, null if null input
   */
  public static String unicodeEscaped(Character ch) {
      if (ch == null) {
          return null;
      }
      return unicodeEscaped(ch.charValue());
  }
  

}

Convert string to UTF8 bytes

public class MainClass {
  public static void main(String args[]) throws Exception {
    String s = "0123456789";
    byte ptext[] = s.getBytes("UTF8");
    for (int i = 0; i < ptext.length; i++) {
      System.out.print(ptext[i] + ",");
    }
  }
}

Converts Unicode into something that can be embedded in a java properties file

/*
    JSPWiki - a JSP-based WikiWiki clone.
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at
       http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.    
 */
import java.security.SecureRandom;
import java.util.Random;
public class StringUtils
{
  /**
   *  Converts a string from the Unicode representation into something that can be
   *  embedded in a java properties file.  All references outside the ASCII range
   *  are replaced with \\uXXXX.
   *
   *  @param s The string to convert
   *  @return the ASCII string
   */
  public static String native2Ascii(String s)
  {
      StringBuffer sb = new StringBuffer();
      for(int i = 0; i < s.length(); i++)
      {
          char aChar = s.charAt(i);
          if ((aChar < 0x0020) || (aChar > 0x007e))
          {
              sb.append("\\");
              sb.append("u");
              sb.append(toHex((aChar >> 12) & 0xF));
              sb.append(toHex((aChar >>  8) & 0xF));
              sb.append(toHex((aChar >>  4) & 0xF));
              sb.append(toHex( aChar        & 0xF));
          }
          else
          {
              sb.append(aChar);
          }
      }
      return sb.toString();
  }
  private static char toHex(int nibble)
  {
      final char[] hexDigit =
      {
          "0","1","2","3","4","5","6","7","8","9","A","B","C","D","E","F"
      };
      return hexDigit[nibble & 0xF];
  }

}

Count the number of bytes included in the given char[].

import java.io.File;
import java.io.FileFilter;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
/*
 *  Licensed to the Apache Software Foundation (ASF) under one
 *  or more contributor license agreements.  See the NOTICE file
 *  distributed with this work for additional information
 *  regarding copyright ownership.  The ASF licenses this file
 *  to you under the Apache License, Version 2.0 (the
 *  "License"); you may not use this file except in compliance
 *  with the License.  You may obtain a copy of the License at
 *  
 *    http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an
 *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 *  KIND, either express or implied.  See the License for the
 *  specific language governing permissions and limitations
 *  under the License. 
 *  
 */

/**
 * Various string manipulation methods that are more efficient then chaining
 * string operations: all is done in the same buffer without creating a bunch of
 * string objects.
 * 
 * @author 
 */
public class Main {
  private static final int CHAR_ONE_BYTE_MASK = 0xFFFFFF80;
  private static final int CHAR_TWO_BYTES_MASK = 0xFFFFF800;
  private static final int CHAR_THREE_BYTES_MASK = 0xFFFF0000;
  private static final int CHAR_FOUR_BYTES_MASK = 0xFFE00000;
  private static final int CHAR_FIVE_BYTES_MASK = 0xFC000000;
  private static final int CHAR_SIX_BYTES_MASK = 0x80000000;
  
  /**
   * Count the number of bytes included in the given char[].
   * 
   * @param chars
   *            The char array to decode
   * @return The number of bytes in the char array
   */
  public static final int countBytes( char[] chars )
  {
      if ( chars == null )
      {
          return 0;
      }
      int nbBytes = 0;
      int currentPos = 0;
      while ( currentPos < chars.length )
      {
          int nbb = countNbBytesPerChar( chars[currentPos] );
          // If the number of bytes necessary to encode a character is
          // above 3, we will need two UTF-16 chars
          currentPos += ( nbb < 4 ? 1 : 2 );
          nbBytes += nbb;
      }
      return nbBytes;
  }
  
  
  /**
   * Return the number of bytes that hold an Unicode char.
   * 
   * @param car
   *            The character to be decoded
   * @return The number of bytes to hold the char. TODO : Should stop after
   *         the third byte, as a char is only 2 bytes long.
   */
  public static final int countNbBytesPerChar( char car )
  {
      if ( ( car & CHAR_ONE_BYTE_MASK ) == 0 )
      {
          return 1;
      }
      else if ( ( car & CHAR_TWO_BYTES_MASK ) == 0 )
      {
          return 2;
      }
      else if ( ( car & CHAR_THREE_BYTES_MASK ) == 0 )
      {
          return 3;
      }
      else if ( ( car & CHAR_FOUR_BYTES_MASK ) == 0 )
      {
          return 4;
      }
      else if ( ( car & CHAR_FIVE_BYTES_MASK ) == 0 )
      {
          return 5;
      }
      else if ( ( car & CHAR_SIX_BYTES_MASK ) == 0 )
      {
          return 6;
      }
      else
      {
          return -1;
      }
  }
}

Count the number of bytes needed to return an Unicode char. This can be from 1 to 6.

import java.io.File;
import java.io.FileFilter;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
/*
 *  Licensed to the Apache Software Foundation (ASF) under one
 *  or more contributor license agreements.  See the NOTICE file
 *  distributed with this work for additional information
 *  regarding copyright ownership.  The ASF licenses this file
 *  to you under the Apache License, Version 2.0 (the
 *  "License"); you may not use this file except in compliance
 *  with the License.  You may obtain a copy of the License at
 *  
 *    http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an
 *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 *  KIND, either express or implied.  See the License for the
 *  specific language governing permissions and limitations
 *  under the License. 
 *  
 */

/**
 * Various string manipulation methods that are more efficient then chaining
 * string operations: all is done in the same buffer without creating a bunch of
 * string objects.
 * 
 * @author 
 */
public class Main {
  private static final int UTF8_MULTI_BYTES_MASK = 0x0080;
  private static final int UTF8_TWO_BYTES_MASK = 0x00E0;
  private static final int UTF8_TWO_BYTES = 0x00C0;
  private static final int UTF8_THREE_BYTES_MASK = 0x00F0;
  private static final int UTF8_THREE_BYTES = 0x00E0;
  private static final int UTF8_FOUR_BYTES_MASK = 0x00F8;
  private static final int UTF8_FOUR_BYTES = 0x00F0;
  private static final int UTF8_FIVE_BYTES_MASK = 0x00FC;
  private static final int UTF8_FIVE_BYTES = 0x00F8;
  private static final int UTF8_SIX_BYTES_MASK = 0x00FE;
  private static final int UTF8_SIX_BYTES = 0x00FC;
  /**
   * Count the number of bytes needed to return an Unicode char. This can be
   * from 1 to 6.
   * 
   * @param bytes
   *            The bytes to read
   * @param pos
   *            Position to start counting. It must be a valid start of a
   *            encoded char !
   * @return The number of bytes to create a char, or -1 if the encoding is
   *         wrong. TODO : Should stop after the third byte, as a char is only
   *         2 bytes long.
   */
  public static final int countBytesPerChar( byte[] bytes, int pos )
  {
      if ( bytes == null )
      {
          return -1;
      }
      if ( ( bytes[pos] & UTF8_MULTI_BYTES_MASK ) == 0 )
      {
          return 1;
      }
      else if ( ( bytes[pos] & UTF8_TWO_BYTES_MASK ) == UTF8_TWO_BYTES )
      {
          return 2;
      }
      else if ( ( bytes[pos] & UTF8_THREE_BYTES_MASK ) == UTF8_THREE_BYTES )
      {
          return 3;
      }
      else if ( ( bytes[pos] & UTF8_FOUR_BYTES_MASK ) == UTF8_FOUR_BYTES )
      {
          return 4;
      }
      else if ( ( bytes[pos] & UTF8_FIVE_BYTES_MASK ) == UTF8_FIVE_BYTES )
      {
          return 5;
      }
      else if ( ( bytes[pos] & UTF8_SIX_BYTES_MASK ) == UTF8_SIX_BYTES )
      {
          return 6;
      }
      else
      {
          return -1;
      }
  }
}

Count the number of chars included in the given byte[].

import java.io.File;
import java.io.FileFilter;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
/*
 *  Licensed to the Apache Software Foundation (ASF) under one
 *  or more contributor license agreements.  See the NOTICE file
 *  distributed with this work for additional information
 *  regarding copyright ownership.  The ASF licenses this file
 *  to you under the Apache License, Version 2.0 (the
 *  "License"); you may not use this file except in compliance
 *  with the License.  You may obtain a copy of the License at
 *  
 *    http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an
 *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 *  KIND, either express or implied.  See the License for the
 *  specific language governing permissions and limitations
 *  under the License. 
 *  
 */

/**
 * Various string manipulation methods that are more efficient then chaining
 * string operations: all is done in the same buffer without creating a bunch of
 * string objects.
 * 
 * @author 
 */
public class Main {
  private static final int UTF8_MULTI_BYTES_MASK = 0x0080;
  private static final int UTF8_TWO_BYTES_MASK = 0x00E0;
  private static final int UTF8_TWO_BYTES = 0x00C0;
  private static final int UTF8_THREE_BYTES_MASK = 0x00F0;
  private static final int UTF8_THREE_BYTES = 0x00E0;
  private static final int UTF8_FOUR_BYTES_MASK = 0x00F8;
  private static final int UTF8_FOUR_BYTES = 0x00F0;
  private static final int UTF8_FIVE_BYTES_MASK = 0x00FC;
  private static final int UTF8_FIVE_BYTES = 0x00F8;
  private static final int UTF8_SIX_BYTES_MASK = 0x00FE;
  private static final int UTF8_SIX_BYTES = 0x00FC;
  /**
   * Count the number of bytes needed to return an Unicode char. This can be
   * from 1 to 6.
   * 
   * @param bytes
   *            The bytes to read
   * @param pos
   *            Position to start counting. It must be a valid start of a
   *            encoded char !
   * @return The number of bytes to create a char, or -1 if the encoding is
   *         wrong. TODO : Should stop after the third byte, as a char is only
   *         2 bytes long.
   */
  public static final int countBytesPerChar( byte[] bytes, int pos )
  {
      if ( bytes == null )
      {
          return -1;
      }
      if ( ( bytes[pos] & UTF8_MULTI_BYTES_MASK ) == 0 )
      {
          return 1;
      }
      else if ( ( bytes[pos] & UTF8_TWO_BYTES_MASK ) == UTF8_TWO_BYTES )
      {
          return 2;
      }
      else if ( ( bytes[pos] & UTF8_THREE_BYTES_MASK ) == UTF8_THREE_BYTES )
      {
          return 3;
      }
      else if ( ( bytes[pos] & UTF8_FOUR_BYTES_MASK ) == UTF8_FOUR_BYTES )
      {
          return 4;
      }
      else if ( ( bytes[pos] & UTF8_FIVE_BYTES_MASK ) == UTF8_FIVE_BYTES )
      {
          return 5;
      }
      else if ( ( bytes[pos] & UTF8_SIX_BYTES_MASK ) == UTF8_SIX_BYTES )
      {
          return 6;
      }
      else
      {
          return -1;
      }
  }
  
  /**
   * Count the number of chars included in the given byte[].
   * 
   * @param bytes
   *            The byte array to decode
   * @return The number of char in the byte array
   */
  public static final int countChars( byte[] bytes )
  {
      if ( bytes == null )
      {
          return 0;
      }
      int nbChars = 0;
      int currentPos = 0;
      while ( currentPos < bytes.length )
      {
          currentPos += countBytesPerChar( bytes, currentPos );
          nbChars++;
      }
      return nbChars;
  }
  
}

Decodes values of attributes in the DN encoded in hex into a UTF-8 String.

import java.io.UnsupportedEncodingException;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import javax.naming.InvalidNameException;
/*
 *  Licensed to the Apache Software Foundation (ASF) under one
 *  or more contributor license agreements.  See the NOTICE file
 *  distributed with this work for additional information
 *  regarding copyright ownership.  The ASF licenses this file
 *  to you under the Apache License, Version 2.0 (the
 *  "License"); you may not use this file except in compliance
 *  with the License.  You may obtain a copy of the License at
 *  
 *    http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an
 *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 *  KIND, either express or implied.  See the License for the
 *  specific language governing permissions and limitations
 *  under the License. 
 *  
 */

/**
 * Various string manipulation methods that are more efficient then chaining
 * string operations: all is done in the same buffer without creating a bunch of
 * string objects.
 * 
 * @author 
 */
public class Main {
  /** &lt;hex> ::= [0x30-0x39] | [0x41-0x46] | [0x61-0x66] */
  private static final byte[] HEX_VALUE =
      { 
          -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 00 -> 0F
          -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 10 -> 1F
          -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 20 -> 2F
           0,  1,  2,  3,  4,  5,  6,  7,  8,  9, -1, -1, -1, -1, -1, -1, // 30 -> 3F ( 0, 1,2, 3, 4,5, 6, 7, 8, 9 )
          -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 40 -> 4F ( A, B, C, D, E, F )
          -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, // 50 -> 5F
          -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1  // 60 -> 6F ( a, b, c, d, e, f )
      };
  /**
   * Decodes values of attributes in the DN encoded in hex into a UTF-8 
   * String.  RFC2253 allows a DN"s attribute to be encoded in hex.
   * The encoded value starts with a # then is followed by an even 
   * number of hex characters.  
   */
  public static final String decodeHexString( String str ) throws InvalidNameException
  {
      if ( str == null || str.length() == 0 )
      {
          throw new InvalidNameException( "Expected string to start with a "#" character.  " +
              "Invalid hex encoded string for empty or null string."  );
      }
      
      char[] chars = str.toCharArray();
      if ( chars[0] != "#" )
      {
          throw new InvalidNameException( "Expected string to start with a "#" character.  " +
                  "Invalid hex encoded string: " + str  );
      }
      
      // the bytes representing the encoded string of hex
      // this should be ( length - 1 )/2 in size
      byte[] decoded = new byte[ ( chars.length - 1 ) >> 1 ];
      for ( int ii = 1, jj = 0 ; ii < chars.length; ii+=2, jj++ )
      {
          int ch = ( HEX_VALUE[chars[ii]] << 4 ) + 
              HEX_VALUE[chars[ii+1]];
          decoded[jj] = ( byte ) ch;
      }
      
      return utf8ToString( decoded );
  }
  /**
   * Return an UTF-8 encoded String
   * 
   * @param bytes
   *            The byte array to be transformed to a String
   * @return A String.
   */
  public static final String utf8ToString( byte[] bytes )
  {
      if ( bytes == null )
      {
          return "";
      }
      try
      {
          return new String( bytes, "UTF-8" );
      }
      catch ( UnsupportedEncodingException uee )
      {
          return "";
      }
  }
}

Display special character using Unicode

import javax.swing.JLabel;
public class Main {
  
  public static void main(String[] argv) {
    String COPYRIGHT = "\u00a9";
    JLabel a = new JLabel(COPYRIGHT);
  }
}

Get UTF String Size

/* Copyright (c) 1995-2000, The Hypersonic SQL Group.
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 * Redistributions of source code must retain the above copyright notice, this
 * list of conditions and the following disclaimer.
 *
 * Redistributions in binary form must reproduce the above copyright notice,
 * this list of conditions and the following disclaimer in the documentation
 * and/or other materials provided with the distribution.
 *
 * Neither the name of the Hypersonic SQL Group nor the names of its
 * contributors may be used to endorse or promote products derived from this
 * software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED. IN NO EVENT SHALL THE HYPERSONIC SQL GROUP,
 * OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 * This software consists of voluntary contributions made by many individuals
 * on behalf of the Hypersonic SQL Group.
 *
 *
 * For work added by the HSQL Development Group:
 *
 * Copyright (c) 2001-2009, The HSQL Development Group
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 * Redistributions of source code must retain the above copyright notice, this
 * list of conditions and the following disclaimer.
 *
 * Redistributions in binary form must reproduce the above copyright notice,
 * this list of conditions and the following disclaimer in the documentation
 * and/or other materials provided with the distribution.
 *
 * Neither the name of the HSQL Development Group nor the names of its
 * contributors may be used to endorse or promote products derived from this
 * software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED. IN NO EVENT SHALL HSQL DEVELOPMENT GROUP, HSQLDB.ORG,
 * OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */
/**
 * Collection of static methods for converting strings between different formats
 * and to and from byte arrays.
 * 
 * 
 * Includes some methods based on Hypersonic code as indicated.
 * 
 * @author Thomas Mueller (Hypersonic SQL Group)
 * @author Fred Toussi (fredt@users dot sourceforge.net)
 * @version 1.9.0
 * @since 1.7.2
 */
public class Main {
  private static final byte[] HEXBYTES = { (byte) "0", (byte) "1", (byte) "2", (byte) "3",
      (byte) "4", (byte) "5", (byte) "6", (byte) "7", (byte) "8", (byte) "9", (byte) "a",
      (byte) "b", (byte) "c", (byte) "d", (byte) "e", (byte) "f" };

  public static int getUTFSize(String s) {
      int len = (s == null) ? 0
                            : s.length();
      int l   = 0;
      for (int i = 0; i < len; i++) {
          int c = s.charAt(i);
          if ((c >= 0x0001) && (c <= 0x007F)) {
              l++;
          } else if (c > 0x07FF) {
              l += 3;
          } else {
              l += 2;
          }
      }
      return l;
  }
}

Return an UTF-8 encoded String

import java.io.UnsupportedEncodingException;
/*
 *  Licensed to the Apache Software Foundation (ASF) under one
 *  or more contributor license agreements.  See the NOTICE file
 *  distributed with this work for additional information
 *  regarding copyright ownership.  The ASF licenses this file
 *  to you under the Apache License, Version 2.0 (the
 *  "License"); you may not use this file except in compliance
 *  with the License.  You may obtain a copy of the License at
 *  
 *    http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an
 *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 *  KIND, either express or implied.  See the License for the
 *  specific language governing permissions and limitations
 *  under the License. 
 *  
 */

/**
 * Various string manipulation methods that are more efficient then chaining
 * string operations: all is done in the same buffer without creating a bunch of
 * string objects.
 * 
 * @author 
 */
public class Main {
  /**
   * Return an UTF-8 encoded String
   * 
   * @param bytes
   *            The byte array to be transformed to a String
   * @return A String.
   */
  public static final String utf8ToString( byte[] bytes )
  {
      if ( bytes == null )
      {
          return "";
      }
      try
      {
          return new String( bytes, "UTF-8" );
      }
      catch ( UnsupportedEncodingException uee )
      {
          return "";
      }
  }
}

Return an UTF-8 encoded String by length

import java.io.UnsupportedEncodingException;
/*
 *  Licensed to the Apache Software Foundation (ASF) under one
 *  or more contributor license agreements.  See the NOTICE file
 *  distributed with this work for additional information
 *  regarding copyright ownership.  The ASF licenses this file
 *  to you under the Apache License, Version 2.0 (the
 *  "License"); you may not use this file except in compliance
 *  with the License.  You may obtain a copy of the License at
 *  
 *    http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an
 *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 *  KIND, either express or implied.  See the License for the
 *  specific language governing permissions and limitations
 *  under the License. 
 *  
 */

/**
 * Various string manipulation methods that are more efficient then chaining
 * string operations: all is done in the same buffer without creating a bunch of
 * string objects.
 * 
 * @author 
 */
public class Main {
  /**
   * Return an UTF-8 encoded String
   * 
   * @param bytes
   *            The byte array to be transformed to a String
   * @param length
   *            The length of the byte array to be converted
   * @return A String.
   */
  public static final String utf8ToString( byte[] bytes, int length )
  {
      if ( bytes == null )
      {
          return "";
      }
      try
      {
          return new String( bytes, 0, length, "UTF-8" );
      }
      catch ( UnsupportedEncodingException uee )
      {
          return "";
      }
  }
}

Return the number of bytes that hold an Unicode char.

import java.io.File;
import java.io.FileFilter;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
/*
 *  Licensed to the Apache Software Foundation (ASF) under one
 *  or more contributor license agreements.  See the NOTICE file
 *  distributed with this work for additional information
 *  regarding copyright ownership.  The ASF licenses this file
 *  to you under the Apache License, Version 2.0 (the
 *  "License"); you may not use this file except in compliance
 *  with the License.  You may obtain a copy of the License at
 *  
 *    http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an
 *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 *  KIND, either express or implied.  See the License for the
 *  specific language governing permissions and limitations
 *  under the License. 
 *  
 */

/**
 * Various string manipulation methods that are more efficient then chaining
 * string operations: all is done in the same buffer without creating a bunch of
 * string objects.
 * 
 * @author 
 */
public class Main {
  private static final int CHAR_ONE_BYTE_MASK = 0xFFFFFF80;
  private static final int CHAR_TWO_BYTES_MASK = 0xFFFFF800;
  private static final int CHAR_THREE_BYTES_MASK = 0xFFFF0000;
  private static final int CHAR_FOUR_BYTES_MASK = 0xFFE00000;
  private static final int CHAR_FIVE_BYTES_MASK = 0xFC000000;
  private static final int CHAR_SIX_BYTES_MASK = 0x80000000;
  /**
   * Return the number of bytes that hold an Unicode char.
   * 
   * @param car
   *            The character to be decoded
   * @return The number of bytes to hold the char. TODO : Should stop after
   *         the third byte, as a char is only 2 bytes long.
   */
  public static final int countNbBytesPerChar( char car )
  {
      if ( ( car & CHAR_ONE_BYTE_MASK ) == 0 )
      {
          return 1;
      }
      else if ( ( car & CHAR_TWO_BYTES_MASK ) == 0 )
      {
          return 2;
      }
      else if ( ( car & CHAR_THREE_BYTES_MASK ) == 0 )
      {
          return 3;
      }
      else if ( ( car & CHAR_FOUR_BYTES_MASK ) == 0 )
      {
          return 4;
      }
      else if ( ( car & CHAR_FIVE_BYTES_MASK ) == 0 )
      {
          return 5;
      }
      else if ( ( car & CHAR_SIX_BYTES_MASK ) == 0 )
      {
          return 6;
      }
      else
      {
          return -1;
      }
  }
}

Return the Unicode char which is coded in the bytes at position 0.

import java.io.File;
import java.io.FileFilter;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
/*
 *  Licensed to the Apache Software Foundation (ASF) under one
 *  or more contributor license agreements.  See the NOTICE file
 *  distributed with this work for additional information
 *  regarding copyright ownership.  The ASF licenses this file
 *  to you under the Apache License, Version 2.0 (the
 *  "License"); you may not use this file except in compliance
 *  with the License.  You may obtain a copy of the License at
 *  
 *    http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an
 *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 *  KIND, either express or implied.  See the License for the
 *  specific language governing permissions and limitations
 *  under the License. 
 *  
 */

/**
 * Various string manipulation methods that are more efficient then chaining
 * string operations: all is done in the same buffer without creating a bunch of
 * string objects.
 * 
 * @author 
 */
public class Main {
  private static final int UTF8_MULTI_BYTES_MASK = 0x0080;
  private static final int UTF8_TWO_BYTES_MASK = 0x00E0;
  private static final int UTF8_TWO_BYTES = 0x00C0;
  private static final int UTF8_THREE_BYTES_MASK = 0x00F0;
  private static final int UTF8_THREE_BYTES = 0x00E0;
  private static final int UTF8_FOUR_BYTES_MASK = 0x00F8;
  private static final int UTF8_FOUR_BYTES = 0x00F0;
  private static final int UTF8_FIVE_BYTES_MASK = 0x00FC;
  private static final int UTF8_FIVE_BYTES = 0x00F8;
  private static final int UTF8_SIX_BYTES_MASK = 0x00FE;
  private static final int UTF8_SIX_BYTES = 0x00FC;
  /**
   * Return the Unicode char which is coded in the bytes at position 0.
   * 
   * @param bytes
   *            The byte[] represntation of an Unicode string.
   * @return The first char found.
   */
  public static final char bytesToChar( byte[] bytes )
  {
      return bytesToChar( bytes, 0 );
  }
  /**
   * Return the Unicode char which is coded in the bytes at the given
   * position.
   * 
   * @param bytes
   *            The byte[] represntation of an Unicode string.
   * @param pos
   *            The current position to start decoding the char
   * @return The decoded char, or -1 if no char can be decoded TODO : Should
   *         stop after the third byte, as a char is only 2 bytes long.
   */
  public static final char bytesToChar( byte[] bytes, int pos )
  {
      if ( bytes == null )
      {
          return ( char ) -1;
      }
      if ( ( bytes[pos] & UTF8_MULTI_BYTES_MASK ) == 0 )
      {
          return ( char ) bytes[pos];
      }
      else
      {
          if ( ( bytes[pos] & UTF8_TWO_BYTES_MASK ) == UTF8_TWO_BYTES )
          {
              // Two bytes char
              return ( char ) ( ( ( bytes[pos] & 0x1C ) << 6 ) + // 110x-xxyy
                                                                  // 10zz-zzzz
                                                                  // ->
                                                                  // 0000-0xxx
                                                                  // 0000-0000
                  ( ( bytes[pos] & 0x03 ) << 6 ) + // 110x-xxyy 10zz-zzzz
                                                      // -> 0000-0000
                                                      // yy00-0000
              ( bytes[pos + 1] & 0x3F ) // 110x-xxyy 10zz-zzzz -> 0000-0000
                                          // 00zz-zzzz
              ); // -> 0000-0xxx yyzz-zzzz (07FF)
          }
          else if ( ( bytes[pos] & UTF8_THREE_BYTES_MASK ) == UTF8_THREE_BYTES )
          {
              // Three bytes char
              return ( char ) (
              // 1110-tttt 10xx-xxyy 10zz-zzzz -> tttt-0000-0000-0000
              ( ( bytes[pos] & 0x0F ) << 12 ) +
              // 1110-tttt 10xx-xxyy 10zz-zzzz -> 0000-xxxx-0000-0000
                  ( ( bytes[pos + 1] & 0x3C ) << 6 ) +
                  // 1110-tttt 10xx-xxyy 10zz-zzzz -> 0000-0000-yy00-0000
                  ( ( bytes[pos + 1] & 0x03 ) << 6 ) +
              // 1110-tttt 10xx-xxyy 10zz-zzzz -> 0000-0000-00zz-zzzz
              ( bytes[pos + 2] & 0x3F )
              // -> tttt-xxxx yyzz-zzzz (FF FF)
              );
          }
          else if ( ( bytes[pos] & UTF8_FOUR_BYTES_MASK ) == UTF8_FOUR_BYTES )
          {
              // Four bytes char
              return ( char ) (
              // 1111-0ttt 10uu-vvvv 10xx-xxyy 10zz-zzzz -> 000t-tt00
              // 0000-0000 0000-0000
              ( ( bytes[pos] & 0x07 ) << 18 ) +
              // 1111-0ttt 10uu-vvvv 10xx-xxyy 10zz-zzzz -> 0000-00uu
              // 0000-0000 0000-0000
                  ( ( bytes[pos + 1] & 0x30 ) << 16 ) +
                  // 1111-0ttt 10uu-vvvv 10xx-xxyy 10zz-zzzz -> 0000-0000
                  // vvvv-0000 0000-0000
                  ( ( bytes[pos + 1] & 0x0F ) << 12 ) +
                  // 1111-0ttt 10uu-vvvv 10xx-xxyy 10zz-zzzz -> 0000-0000
                  // 0000-xxxx 0000-0000
                  ( ( bytes[pos + 2] & 0x3C ) << 6 ) +
                  // 1111-0ttt 10uu-vvvv 10xx-xxyy 10zz-zzzz -> 0000-0000
                  // 0000-0000 yy00-0000
                  ( ( bytes[pos + 2] & 0x03 ) << 6 ) +
              // 1111-0ttt 10uu-vvvv 10xx-xxyy 10zz-zzzz -> 0000-0000
              // 0000-0000 00zz-zzzz
              ( bytes[pos + 3] & 0x3F )
              // -> 000t-ttuu vvvv-xxxx yyzz-zzzz (1FFFFF)
              );
          }
          else if ( ( bytes[pos] & UTF8_FIVE_BYTES_MASK ) == UTF8_FIVE_BYTES )
          {
              // Five bytes char
              return ( char ) (
              // 1111-10tt 10uu-uuuu 10vv-wwww 10xx-xxyy 10zz-zzzz ->
              // 0000-00tt 0000-0000 0000-0000 0000-0000
              ( ( bytes[pos] & 0x03 ) << 24 ) +
              // 1111-10tt 10uu-uuuu 10vv-wwww 10xx-xxyy 10zz-zzzz ->
              // 0000-0000 uuuu-uu00 0000-0000 0000-0000
                  ( ( bytes[pos + 1] & 0x3F ) << 18 ) +
                  // 1111-10tt 10uu-uuuu 10vv-wwww 10xx-xxyy 10zz-zzzz ->
                  // 0000-0000 0000-00vv 0000-0000 0000-0000
                  ( ( bytes[pos + 2] & 0x30 ) << 12 ) +
                  // 1111-10tt 10uu-uuuu 10vv-wwww 10xx-xxyy 10zz-zzzz ->
                  // 0000-0000 0000-0000 wwww-0000 0000-0000
                  ( ( bytes[pos + 2] & 0x0F ) << 12 ) +
                  // 1111-10tt 10uu-uuuu 10vv-wwww 10xx-xxyy 10zz-zzzz ->
                  // 0000-0000 0000-0000 0000-xxxx 0000-0000
                  ( ( bytes[pos + 3] & 0x3C ) << 6 ) +
                  // 1111-10tt 10uu-uuuu 10vv-wwww 10xx-xxyy 10zz-zzzz ->
                  // 0000-0000 0000-0000 0000-0000 yy00-0000
                  ( ( bytes[pos + 3] & 0x03 ) << 6 ) +
              // 1111-10tt 10uu-uuuu 10vv-wwww 10xx-xxyy 10zz-zzzz ->
              // 0000-0000 0000-0000 0000-0000 00zz-zzzz
              ( bytes[pos + 4] & 0x3F )
              // -> 0000-00tt uuuu-uuvv wwww-xxxx yyzz-zzzz (03 FF FF FF)
              );
          }
          else if ( ( bytes[pos] & UTF8_FIVE_BYTES_MASK ) == UTF8_FIVE_BYTES )
          {
              // Six bytes char
              return ( char ) (
              // 1111-110s 10tt-tttt 10uu-uuuu 10vv-wwww 10xx-xxyy 10zz-zzzz
              // ->
              // 0s00-0000 0000-0000 0000-0000 0000-0000
              ( ( bytes[pos] & 0x01 ) << 30 ) +
              // 1111-110s 10tt-tttt 10uu-uuuu 10vv-wwww 10xx-xxyy 10zz-zzzz
              // ->
                  // 00tt-tttt 0000-0000 0000-0000 0000-0000
                  ( ( bytes[pos + 1] & 0x3F ) << 24 ) +
                  // 1111-110s 10tt-tttt 10uu-uuuu 10vv-wwww 10xx-xxyy
                  // 10zz-zzzz ->
                  // 0000-0000 uuuu-uu00 0000-0000 0000-0000
                  ( ( bytes[pos + 2] & 0x3F ) << 18 ) +
                  // 1111-110s 10tt-tttt 10uu-uuuu 10vv-wwww 10xx-xxyy
                  // 10zz-zzzz ->
                  // 0000-0000 0000-00vv 0000-0000 0000-0000
                  ( ( bytes[pos + 3] & 0x30 ) << 12 ) +
                  // 1111-110s 10tt-tttt 10uu-uuuu 10vv-wwww 10xx-xxyy
                  // 10zz-zzzz ->
                  // 0000-0000 0000-0000 wwww-0000 0000-0000
                  ( ( bytes[pos + 3] & 0x0F ) << 12 ) +
                  // 1111-110s 10tt-tttt 10uu-uuuu 10vv-wwww 10xx-xxyy
                  // 10zz-zzzz ->
                  // 0000-0000 0000-0000 0000-xxxx 0000-0000
                  ( ( bytes[pos + 4] & 0x3C ) << 6 ) +
                  // 1111-110s 10tt-tttt 10uu-uuuu 10vv-wwww 10xx-xxyy
                  // 10zz-zzzz ->
                  // 0000-0000 0000-0000 0000-0000 yy00-0000
                  ( ( bytes[pos + 4] & 0x03 ) << 6 ) +
              // 1111-110s 10tt-tttt 10uu-uuuu 10vv-wwww 10xx-xxyy 10zz-zzzz
              // ->
              // 0000-0000 0000-0000 0000-0000 00zz-zzzz
              ( bytes[pos + 5] & 0x3F )
              // -> 0stt-tttt uuuu-uuvv wwww-xxxx yyzz-zzzz (7F FF FF FF)
              );
          }
          else
          {
              return ( char ) -1;
          }
      }
  }
}

Return the Unicode char which is coded in the bytes at the given position.

import java.io.File;
import java.io.FileFilter;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
/*
 *  Licensed to the Apache Software Foundation (ASF) under one
 *  or more contributor license agreements.  See the NOTICE file
 *  distributed with this work for additional information
 *  regarding copyright ownership.  The ASF licenses this file
 *  to you under the Apache License, Version 2.0 (the
 *  "License"); you may not use this file except in compliance
 *  with the License.  You may obtain a copy of the License at
 *  
 *    http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an
 *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 *  KIND, either express or implied.  See the License for the
 *  specific language governing permissions and limitations
 *  under the License. 
 *  
 */

/**
 * Various string manipulation methods that are more efficient then chaining
 * string operations: all is done in the same buffer without creating a bunch of
 * string objects.
 * 
 * @author 
 */
public class Main {
  private static final int CHAR_ONE_BYTE_MASK = 0xFFFFFF80;
  private static final int CHAR_TWO_BYTES_MASK = 0xFFFFF800;
  private static final int CHAR_THREE_BYTES_MASK = 0xFFFF0000;
  private static final int CHAR_FOUR_BYTES_MASK = 0xFFE00000;
  private static final int CHAR_FIVE_BYTES_MASK = 0xFC000000;
  private static final int CHAR_SIX_BYTES_MASK = 0x80000000;
  
  /**
   * Return the Unicode char which is coded in the bytes at the given
   * position.
   * 
   * @param car The character to be transformed to an array of bytes
   * 
   * @return The byte array representing the char 
   * 
   * TODO : Should stop after the third byte, as a char is only 2 bytes long.
   */
  public static final byte[] charToBytes( char car )
  {
      byte[] bytes = new byte[countNbBytesPerChar( car )];
      if ( ( car | 0x7F ) == 0x7F )
      {
          // Single byte char
          bytes[0] = ( byte ) car;
          return bytes;
      }
      else if ( ( car | 0x7F) == 0x7FF )
      {
          // two bytes char
          bytes[0] = ( byte ) ( 0x00C0 + ( ( car & 0x07C0 ) >> 6 ) );
          bytes[1] = ( byte ) ( 0x0080 + ( car & 0x3F ) );
      }
      else
      {
          // Three bytes char
          bytes[0] = ( byte ) ( 0x00E0 + ( ( car & 0xF000 ) >> 12 ) );
          bytes[1] = ( byte ) ( 0x0080 + ( ( car & 0x0FC0 ) >> 6 ) );
          bytes[2] = ( byte ) ( 0x0080 + ( car & 0x3F ) );
      }
      return bytes;
  }
  
  
  /**
   * Return the number of bytes that hold an Unicode char.
   * 
   * @param car
   *            The character to be decoded
   * @return The number of bytes to hold the char. TODO : Should stop after
   *         the third byte, as a char is only 2 bytes long.
   */
  public static final int countNbBytesPerChar( char car )
  {
      if ( ( car & CHAR_ONE_BYTE_MASK ) == 0 )
      {
          return 1;
      }
      else if ( ( car & CHAR_TWO_BYTES_MASK ) == 0 )
      {
          return 2;
      }
      else if ( ( car & CHAR_THREE_BYTES_MASK ) == 0 )
      {
          return 3;
      }
      else if ( ( car & CHAR_FOUR_BYTES_MASK ) == 0 )
      {
          return 4;
      }
      else if ( ( car & CHAR_FIVE_BYTES_MASK ) == 0 )
      {
          return 5;
      }
      else if ( ( car & CHAR_SIX_BYTES_MASK ) == 0 )
      {
          return 6;
      }
      else
      {
          return -1;
      }
  }
}

Return UTF-8 encoded byte[] representation of a String

import java.io.UnsupportedEncodingException;
/*
 *  Licensed to the Apache Software Foundation (ASF) under one
 *  or more contributor license agreements.  See the NOTICE file
 *  distributed with this work for additional information
 *  regarding copyright ownership.  The ASF licenses this file
 *  to you under the Apache License, Version 2.0 (the
 *  "License"); you may not use this file except in compliance
 *  with the License.  You may obtain a copy of the License at
 *  
 *    http://www.apache.org/licenses/LICENSE-2.0
 *  
 *  Unless required by applicable law or agreed to in writing,
 *  software distributed under the License is distributed on an
 *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 *  KIND, either express or implied.  See the License for the
 *  specific language governing permissions and limitations
 *  under the License. 
 *  
 */

/**
 * Various string manipulation methods that are more efficient then chaining
 * string operations: all is done in the same buffer without creating a bunch of
 * string objects.
 * 
 * @author 
 */
public class Main {
  /**
   * Return UTF-8 encoded byte[] representation of a String
   * 
   * @param string
   *            The string to be transformed to a byte array
   * @return The transformed byte array
   */
  public static final byte[] getBytesUtf8( String string )
  {
      if ( string == null )
      {
          return new byte[0];
      }
      try
      {
          return string.getBytes( "UTF-8" );
      }
      catch ( UnsupportedEncodingException uee )
      {
          return new byte[]
              {};
      }
  }
}

Safe UTF: 64K serialized size

/*
  * JBoss, Home of Professional Open Source
  * Copyright 2005, JBoss Inc., and individual contributors as indicated
  * by the @authors tag. See the copyright.txt in the distribution for a
  * full listing of individual contributors.
  *
  * This is free software; you can redistribute it and/or modify it
  * under the terms of the GNU Lesser General Public License as
  * published by the Free Software Foundation; either version 2.1 of
  * the License, or (at your option) any later version.
  *
  * This software is distributed in the hope that it will be useful,
  * but WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
  * Lesser General Public License for more details.
  *
  * You should have received a copy of the GNU Lesser General Public
  * License along with this software; if not, write to the Free
  * Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
  * 02110-1301 USA, or see the FSF site: http://www.fsf.org.
  */
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.IOException;
/**
 * 
 * A SafeUTF
 * 
 * @author 
 * @version $Revision: 1174 $
 *
 * $Id: SafeUTF.java 1174 2006-08-02 14:14:32Z timfox $
 * 
 * There is a "bug" in JDK1.4 / 1.5 DataOutputStream.writeUTF()
 * which means it does not work with Strings >= 64K serialized size.
 * See http://bugs.sun.ru/bugdatabase/view_bug.do?bug_id=4806007
 * 
 * We work around this by chunking larger strings into smaller pieces.
 * 
 * Note we only support TextMessage and ObjectMessage bodies with serialized length >= 64K
 * We DO NOT support Strings written to BytesMessages or StreamMessages or written as keys or values
 * in MapMessages, or as String properties or other String fields having serialized length >= 64K
 * This is for performance reasons since there is an overhead in coping with large
 * Strings
 * 
 */
public class SafeUTF
{      
   //Default is 16K chunks
   private static final int CHUNK_SIZE = 16 * 1024;
   
   private static final byte NULL = 0;
   
   private static final byte NOT_NULL = 1;
   
   public static SafeUTF instance = new SafeUTF(CHUNK_SIZE);
   
   private int chunkSize;
   
   private int lastReadBufferSize;
   
   public int getLastReadBufferSize()
   {
      return lastReadBufferSize;
   }
   
   public SafeUTF(int chunkSize)
   {
      this.chunkSize = chunkSize;
   }
      
   public void safeWriteUTF(DataOutputStream out, String str) throws IOException
   {        
      if (str == null)
      {
         out.writeByte(NULL);
      }
      else
      {         
         int len = str.length();
          
         short numChunks;
         
         if (len == 0)
         {
            numChunks = 0;
         }
         else
         {
            numChunks = (short)(((len - 1) / chunkSize) + 1);
         }         
         
         out.writeByte(NOT_NULL);
         
         out.writeShort(numChunks);
              
         int i = 0;
         while (len > 0)
         {
            int beginCopy = i * chunkSize;
            
            int endCopy = len <= chunkSize ? beginCopy + len : beginCopy + chunkSize;
     
            String theChunk = str.substring(beginCopy, endCopy);
               
            out.writeUTF(theChunk);
            
            len -= chunkSize;
            
            i++;
         }
      }
   }
   
   public String safeReadUTF(DataInputStream in) throws IOException
   {   
      boolean isNull = in.readByte() == NULL;
      
      if (isNull)
      {
         return null;
      }
      
      short numChunks = in.readShort();
      
      int bufferSize = chunkSize * numChunks;
      
      // special handling for single chunk
      if (numChunks == 1)
      {
         // The text size is likely to be much smaller than the chunkSize
         // so set bufferSize to the min of the input stream available
         // and the maximum buffer size. Since the input stream
         // available() can be <= 0 we check for that and default to
         // a small msg size of 256 bytes.
         
         int inSize = in.available();
               
         if (inSize <= 0)
         {
            inSize = 256;
         }
         bufferSize = Math.min(inSize, bufferSize);
         
         lastReadBufferSize = bufferSize;
      }
        
      StringBuffer buff = new StringBuffer(bufferSize);
            
      for (int i = 0; i < numChunks; i++)
      {
         String s = in.readUTF();
         buff.append(s);
      }
      
      return buff.toString();
   }
      
}

Using Unicode in String

The Greek letter p, for example, is \u03C0.

public class MainClass {
  public static void main(String[] arg) {
    String s = "\u03C0";
    
    System.out.println(s);
  }
}

Java Tutorial/Development/Unicode — различия между версиями

Текущая версия на 15:30, 31 мая 2010

Содержание

Checks if the String contains only unicode digits. A decimal point is not a unicode digit and returns false.

Checks if the String contains only unicode digits or space

Checks if the String contains only unicode letters.

Checks if the String contains only unicode letters and space (" ").

Checks if the String contains only unicode letters, digits or space (" ").

Checks if the String contains only unicode letters or digits.

Convert from Unicode to UTF-8

Convert from UTF-8 to Unicode

Converts the string to the unicode format

Convert string to UTF8 bytes

Converts Unicode into something that can be embedded in a java properties file

Count the number of bytes included in the given char[].

Count the number of bytes needed to return an Unicode char. This can be from 1 to 6.

Count the number of chars included in the given byte[].

Decodes values of attributes in the DN encoded in hex into a UTF-8 String.

Display special character using Unicode

Get UTF String Size

Return an UTF-8 encoded String

Return an UTF-8 encoded String by length

Return the number of bytes that hold an Unicode char.

Return the Unicode char which is coded in the bytes at position 0.

Return the Unicode char which is coded in the bytes at the given position.

Return UTF-8 encoded byte[] representation of a String

Safe UTF: 64K serialized size

Using Unicode in String

Навигация

Персональные инструменты

Пространства имён

Варианты

Просмотры

Ещё

Поиск

Разделы

Навигация

Инструменты