C: Using scanf and wchar_t to read and print UTF-8 strings

We saw how using scanf and char to read UTF-8 strings led us to some strange answers. So now we need to discuss the solution provided by C.

/** Program to read a single character of different language
using wchar_t array and scanf. The program prints back the
string along with its length
*/

#include 
#include 
#include <wchar.h>
#include 

int main() {

    wchar_t string[100];

    setlocale(LC_ALL, "");

    printf ("Enter a string: ");
    scanf("%ls",string);

    printf("String Entered: %ls: length: %dn", string, wcslen(string));

    return 0;
}

Let’s see the various aspects of this program.

  • Use of wchar_t instead of char. wchar_t is used by C to deal with the characters of various locales. Note that there are various locales other than UTF-8, but most of them focus on a particular language. wchar_t corresponds to a wide character. wchar is wider than char (1 bytes), so it can carry a large number of characters of various languages
  • To read and print a wide character string, we use the %ls format. Instead of %s, we use %ls to work with the UTF-8 characters. This directs printf and scanf to do special treatment (call additional functions) to the entered string
  • Use of wcslen instead of strlen to get the length of the string. C library provides the function wcslen to get the length of wide character strings
  • There are different ways by which a locale needs to be treated. For example, in some cases, the locale treatment just involves treatment with date or current representation. But here we used LC_ALL to deal with all the locale specific features.

Let’s see more. I need to first show that I am using UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

On executing the program, I will first enter the English character a
$ ./a.out
Enter a single character and press enter: a
String Entered: a: length: 1

The output is as expected, we entered a single character. So the length is 1

Next, I use the French character é
$ ./a.out
Enter a single character and press enter: é
String Entered: é: length: 1

So we got the length 1 as expected

Let’s try the same experiment with a Chinese letter 诶
$ ./a.out
Enter a single character and press enter: 诶
String Entered: 诶: length: 1

Once again, we got what we were looking for, the length 1.

Thus, we must ensure that we use the UTF-8 string for our softwares. It is also important to use the right functions

C: Using scanf and char to read UTF-8 strings

As businesses are turning global, softwares are made that are intended to meet the global customers. UTF-8 has now become a de-facto standard for use in the web. There are obvious questions that arise in the minds of a C programmer whether C supports UTF-8 and is it possible to read a UTF-8 content. In this example, I show how scanf and char are used to read a UTF-8 string. But at the end of the post you will understand why char is not a good option for working with UTF-8.

/** Program to read a single character of different language
  using char array and scanf and printing the string
  along with its length
*/

#include 
#include 

int main() {

    char string[10];

    printf ("Enter a single character and press enter: ");
    scanf("%s",string);

    printf("String Entered: %s: length: %dn", string, strlen(string));

    return 0;
}

We see that in the program, we declare a char array of length 10, we read a string and then print the string along with its length.

I need to first show that I am using UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

On executing the program, I will first enter the English character a
$ ./a.out
Enter a single character and press enter: a
String Entered: a: length: 1

The output is as expected, we entered a single character. So the length is 1

Next, I use the French character é
$ ./a.out
Enter a single character and press enter: é
String Entered: é: length: 2

Here comes the difficult part, we see that even though we entered a single character, we get the length of the character as 2.

Let’s try the same experiment with a chinese letter 诶
$ ./a.out
Enter a single character and press enter: 诶
String Entered: 诶: length: 3

The result is bizarre, we see that the length is 3.
How can we explain this?

The first thing we should recall is that the size of char is 8 bits or 1 byte. It means it can only carry 256 values. Consider the vast number of languages and dialects in the world, char is not enough to carry the value. So we need a better mechanism called the UTF-8.

As already discussed, I am using UTF-8 in my terminal. So it is able to handle the characters from different languages, but my program is not capable to. Since it is showing very strange answers about the length of the character entered. So we need a better option