We saw how using scanf and char to read UTF-8 strings led us to some strange answers. So now we need to discuss the solution provided by C.
/** Program to read a single character of different language using wchar_t array and scanf. The program prints back the string along with its length */ #include #include #include <wchar.h> #include int main() { wchar_t string[100]; setlocale(LC_ALL, ""); printf ("Enter a string: "); scanf("%ls",string); printf("String Entered: %ls: length: %dn", string, wcslen(string)); return 0; }
Let’s see the various aspects of this program.
- Use of wchar_t instead of char. wchar_t is used by C to deal with the characters of various locales. Note that there are various locales other than UTF-8, but most of them focus on a particular language. wchar_t corresponds to a wide character. wchar is wider than char (1 bytes), so it can carry a large number of characters of various languages
- To read and print a wide character string, we use the %ls format. Instead of %s, we use %ls to work with the UTF-8 characters. This directs printf and scanf to do special treatment (call additional functions) to the entered string
- Use of wcslen instead of strlen to get the length of the string. C library provides the function wcslen to get the length of wide character strings
- There are different ways by which a locale needs to be treated. For example, in some cases, the locale treatment just involves treatment with date or current representation. But here we used LC_ALL to deal with all the locale specific features.
Let’s see more. I need to first show that I am using UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
On executing the program, I will first enter the English character a
$ ./a.out
Enter a single character and press enter: a
String Entered: a: length: 1
The output is as expected, we entered a single character. So the length is 1
Next, I use the French character é
$ ./a.out
Enter a single character and press enter: é
String Entered: é: length: 1
So we got the length 1 as expected
Let’s try the same experiment with a Chinese letter 诶
$ ./a.out
Enter a single character and press enter: 诶
String Entered: 诶: length: 1
Once again, we got what we were looking for, the length 1.
Thus, we must ensure that we use the UTF-8 string for our softwares. It is also important to use the right functions