C: Using scanf and char to read UTF-8 strings

As businesses are turning global, softwares are made that are intended to meet the global customers. UTF-8 has now become a de-facto standard for use in the web. There are obvious questions that arise in the minds of a C programmer whether C supports UTF-8 and is it possible to read a UTF-8 content. In this example, I show how scanf and char are used to read a UTF-8 string. But at the end of the post you will understand why char is not a good option for working with UTF-8.

/** Program to read a single character of different language
  using char array and scanf and printing the string
  along with its length


int main() {

    char string[10];

    printf ("Enter a single character and press enter: ");

    printf("String Entered: %s: length: %dn", string, strlen(string));

    return 0;

We see that in the program, we declare a char array of length 10, we read a string and then print the string along with its length.

I need to first show that I am using UTF-8
$ locale

On executing the program, I will first enter the English character a
$ ./a.out
Enter a single character and press enter: a
String Entered: a: length: 1

The output is as expected, we entered a single character. So the length is 1

Next, I use the French character é
$ ./a.out
Enter a single character and press enter: é
String Entered: é: length: 2

Here comes the difficult part, we see that even though we entered a single character, we get the length of the character as 2.

Let’s try the same experiment with a chinese letter 诶
$ ./a.out
Enter a single character and press enter: 诶
String Entered: 诶: length: 3

The result is bizarre, we see that the length is 3.
How can we explain this?

The first thing we should recall is that the size of char is 8 bits or 1 byte. It means it can only carry 256 values. Consider the vast number of languages and dialects in the world, char is not enough to carry the value. So we need a better mechanism called the UTF-8.

As already discussed, I am using UTF-8 in my terminal. So it is able to handle the characters from different languages, but my program is not capable to. Since it is showing very strange answers about the length of the character entered. So we need a better option


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s