packages icon



 unac(3)                                                             unac(3)
                                    local



 NAME
      unac - remove accents from string or character


 SYNOPSIS
      #include <unac.h>

      const char* unac_version();

      int unac_string(const char* charset,
                const char* in, int in_length,
                char** out, int* out_length);

      unsigned short c = UTF16_A_GRAVE;
      unsigned short* unaccented;
      int length;
      unac_char_utf16(c, unaccented, length);

      int unac_string_utf16(const char* in, int in_length,
                char** out, int* out_length);



 DESCRIPTION
      unac is a C library that remove accents from a string or character.
      unac_string converts the input string from the specified charset to
      UTF-16 and call unac_string_utf16 to return the unaccented equivalent.
      The conversion from and to UTF-16 is done with iconv(3).
      unac_char_utf16 is a macro that efficiently returns a pointer to the
      unaccented equivalent of any UTF-16 character. An UTF-16 character
      such as fi will be expanded in two characters f and i.
      unac_string_utf16 repeatidly applies the unac_char_utf16 function on
      each character of an UTF-16 string.

      The endianess of the UTF-16 strings manipulated by unac must always be
      big endian. When using iconv(3) to translate strings, one should use
      UTF-16BE instead of UTF-16 to ensure that it is big endian (BE).  For
      more information check RFC2781
      (http://www.faqs.org/rfcs/rfc2781.html).

      The unac library uses the Unicode database to map accented letters to
      their unaccented equivalent. Mapping tables are generated from the
      UnicodeData-3.0.1.txt file by the builder perl script. Those tables
      are inserted in the unac.h and unac.c files, replacing the existing
      ones. Instead of one simple table mapping each Unicode character value
      to the sequence of unaccented characters, three tables are used in
      order to reduce the space used. The library occupies less than
      25Kbytes where a unique table would occupy around 512Kbytes. The idea
      used to compress the tables is that many Unicode characters do not
      have unaccented equivalent. The Unicode charset is divided in 32
      entries blocks and each block is stored sperately. An index is used to



                                    - 1 -      Formatted:  November 14, 2024






 unac(3)                                                             unac(3)
                                    local



      map a given block number to a block variable containing data for the
      unaccentuation mapping. As many blocks only contain zero, they all
      point to the same variable therefore reducing the space used.  Beside
      this simple optimization, a table listing the actual position of the
      unaccented replacement within a block is necessary because they are
      not of fixed length. Some characters such as the ligature fi will be
      replaced by two characters f and i. The unaccented equivalent of an
      UTF-16 character is calculated by applying a compatibility
      decomposition and then stripping all characters that belong to the
      mark category. For a precise definition see the Unicode-3.0
      normalization forms at http://www.unicode.org/unicode/reports/tr15/.


 FUNCTIONS
      const char* unac_version()

           Return the version number of unac.


      int unac_string(const char* charset, const char*

           Return the unaccented equivalent of the string in of length
           in_length. The returned string is stored in the pointer pointed
           by the out argument and the length of the string is stored in the
           integer pointed by the out_length argument. If the *out pointer
           is not null, it must point to an area allocated by malloc(3) and
           the length of the array must be specified in the *out_length
           argument. Both arguments *out and *out_length will be replaced by
           return values before the function returns. The *out may be
           reallocated by the function and the caller must not assume that
           the value remains the same.  If the *out pointer is null, the
           unac_string function allocates a new area using malloc(3). It is
           the responsibility of the caller to deallocate the area returned
           in the *out pointer.

           The return value is 0 on success and -1 on error, in which case
           the errno variable is set to the corresponding error code. See
           iconv(3) for the meaning of the error codes.


 out_length)
      int unac_string_utf16(const char* in, int in_length, char** out, int*

           Alias of unac_string("UTF-16", in, in_length, out, out_length).
           Since the unac_string_utf16 is the backend function of
           unac_string it is more efficient because no conversion of the
           input string is necessary.


 short l)
      void unac_char_utf16(const unsigned short c, unsigned short* p, unsigned



                                    - 2 -      Formatted:  November 14, 2024






 unac(3)                                                             unac(3)
                                    local



           Warning: this is a macro, each Return the unaccented equivalent
           of the UTF-16 character c in the pointer p. The length of the
           unsigned short array pointed by p is returned in the l argument.


 ERRORS
      EINVAL

           the requested conversion pair is not available.


 EXAMPLES
      Convert string into ete.
      char* out = 0;
      int out_length = 0;
      unac_string("ISO-8859-1", "t", 3, &out, &out_length);


 BUGS
      The input string must not contain partially formed characters, there
      is no support for this case.

      UTF-16 surrogates are not handled.


 SEE ALSO
      unaccent(1), iconv(3)
      http://www.unicode.org/
      http://www.sourceforge.net/projects/ustring/
      http://www.alphaworks.ibm.com/tech/icu/
      http://www.gnu.org/manual/glibc-2.0.6/libc.html


 AUTHOR
      Loic Dachary loic@senga.org
      http://www.senga.org/unac/


















                                    - 3 -      Formatted:  November 14, 2024