Localization | Programming in Linux

We saw how using scanf and char to read UTF-8 strings led us to some strange answers. So now we need to discuss the solution provided by C.

/** Program to read a single character of different language
using wchar_t array and scanf. The program prints back the
string along with its length
*/

#include 
#include 
#include <wchar.h>
#include 

int main() {

    wchar_t string[100];

    setlocale(LC_ALL, "");

    printf ("Enter a string: ");
    scanf("%ls",string);

    printf("String Entered: %ls: length: %dn", string, wcslen(string));

    return 0;
}

Let’s see the various aspects of this program.

Use of wchar_t instead of char. wchar_t is used by C to deal with the characters of various locales. Note that there are various locales other than UTF-8, but most of them focus on a particular language. wchar_t corresponds to a wide character. wchar is wider than char (1 bytes), so it can carry a large number of characters of various languages
To read and print a wide character string, we use the %ls format. Instead of %s, we use %ls to work with the UTF-8 characters. This directs printf and scanf to do special treatment (call additional functions) to the entered string
Use of wcslen instead of strlen to get the length of the string. C library provides the function wcslen to get the length of wide character strings
There are different ways by which a locale needs to be treated. For example, in some cases, the locale treatment just involves treatment with date or current representation. But here we used LC_ALL to deal with all the locale specific features.

Let’s see more. I need to first show that I am using UTF-8
$ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

On executing the program, I will first enter the English character a
$ ./a.out Enter a single character and press enter: a String Entered: a: length: 1
The output is as expected, we entered a single character. So the length is 1

Next, I use the French character é
$ ./a.out Enter a single character and press enter: é String Entered: é: length: 1
So we got the length 1 as expected

Let’s try the same experiment with a Chinese letter 诶
$ ./a.out Enter a single character and press enter: 诶 String Entered: 诶: length: 1

Once again, we got what we were looking for, the length 1.

Thus, we must ensure that we use the UTF-8 string for our softwares. It is also important to use the right functions

/** Program to read a single character of different language using char array and scanf and printing the string along with its length */ #include #include int main() { char string[10]; printf ("Enter a single character and press enter: "); scanf("%s",string); printf("String Entered: %s: length: %dn", string, strlen(string)); return 0; }

Programming in Linux

Show me the code

Category Archives: Localization

C: Using scanf and wchar_t to read and print UTF-8 strings

C: Using scanf and char to read UTF-8 strings