unac(3) unac(3) local NAME unac - remove accents from string or character SYNOPSIS #include <unac.h> const char* unac_version(); int unac_string(const char* charset, const char* in, int in_length, char** out, int* out_length); unsigned short c = UTF16_A_GRAVE; unsigned short* unaccented; int length; unac_char_utf16(c, unaccented, length); int unac_string_utf16(const char* in, int in_length, char** out, int* out_length); DESCRIPTION unac is a C library that remove accents from a string or character. unac_string converts the input string from the specified charset to UTF-16 and call unac_string_utf16 to return the unaccented equivalent. The conversion from and to UTF-16 is done with iconv(3). unac_char_utf16 is a macro that efficiently returns a pointer to the unaccented equivalent of any UTF-16 character. An UTF-16 character such as fi will be expanded in two characters f and i. unac_string_utf16 repeatidly applies the unac_char_utf16 function on each character of an UTF-16 string. The endianess of the UTF-16 strings manipulated by unac must always be big endian. When using iconv(3) to translate strings, one should use UTF-16BE instead of UTF-16 to ensure that it is big endian (BE). For more information check RFC2781 (http://www.faqs.org/rfcs/rfc2781.html). The unac library uses the Unicode database to map accented letters to their unaccented equivalent. Mapping tables are generated from the UnicodeData-3.0.1.txt file by the builder perl script. Those tables are inserted in the unac.h and unac.c files, replacing the existing ones. Instead of one simple table mapping each Unicode character value to the sequence of unaccented characters, three tables are used in order to reduce the space used. The library occupies less than 25Kbytes where a unique table would occupy around 512Kbytes. The idea used to compress the tables is that many Unicode characters do not have unaccented equivalent. The Unicode charset is divided in 32 entries blocks and each block is stored sperately. An index is used to - 1 - Formatted: November 14, 2024 unac(3) unac(3) local map a given block number to a block variable containing data for the unaccentuation mapping. As many blocks only contain zero, they all point to the same variable therefore reducing the space used. Beside this simple optimization, a table listing the actual position of the unaccented replacement within a block is necessary because they are not of fixed length. Some characters such as the ligature fi will be replaced by two characters f and i. The unaccented equivalent of an UTF-16 character is calculated by applying a compatibility decomposition and then stripping all characters that belong to the mark category. For a precise definition see the Unicode-3.0 normalization forms at http://www.unicode.org/unicode/reports/tr15/. FUNCTIONS const char* unac_version() Return the version number of unac. int unac_string(const char* charset, const char* Return the unaccented equivalent of the string in of length in_length. The returned string is stored in the pointer pointed by the out argument and the length of the string is stored in the integer pointed by the out_length argument. If the *out pointer is not null, it must point to an area allocated by malloc(3) and the length of the array must be specified in the *out_length argument. Both arguments *out and *out_length will be replaced by return values before the function returns. The *out may be reallocated by the function and the caller must not assume that the value remains the same. If the *out pointer is null, the unac_string function allocates a new area using malloc(3). It is the responsibility of the caller to deallocate the area returned in the *out pointer. The return value is 0 on success and -1 on error, in which case the errno variable is set to the corresponding error code. See iconv(3) for the meaning of the error codes. out_length) int unac_string_utf16(const char* in, int in_length, char** out, int* Alias of unac_string("UTF-16", in, in_length, out, out_length). Since the unac_string_utf16 is the backend function of unac_string it is more efficient because no conversion of the input string is necessary. short l) void unac_char_utf16(const unsigned short c, unsigned short* p, unsigned - 2 - Formatted: November 14, 2024 unac(3) unac(3) local Warning: this is a macro, each Return the unaccented equivalent of the UTF-16 character c in the pointer p. The length of the unsigned short array pointed by p is returned in the l argument. ERRORS EINVAL the requested conversion pair is not available. EXAMPLES Convert string into ete. char* out = 0; int out_length = 0; unac_string("ISO-8859-1", "t", 3, &out, &out_length); BUGS The input string must not contain partially formed characters, there is no support for this case. UTF-16 surrogates are not handled. SEE ALSO unaccent(1), iconv(3) http://www.unicode.org/ http://www.sourceforge.net/projects/ustring/ http://www.alphaworks.ibm.com/tech/icu/ http://www.gnu.org/manual/glibc-2.0.6/libc.html AUTHOR Loic Dachary loic@senga.org http://www.senga.org/unac/ - 3 - Formatted: November 14, 2024