C: Using scanf and wchar_t to read and print UTF-8 strings

We saw how using scanf and char to read UTF-8 strings led us to some strange answers. So now we need to discuss the solution provided by C.

/** Program to read a single character of different language
using wchar_t array and scanf. The program prints back the
string along with its length
*/

#include 
#include 
#include <wchar.h>
#include 

int main() {

    wchar_t string[100];

    setlocale(LC_ALL, "");

    printf ("Enter a string: ");
    scanf("%ls",string);

    printf("String Entered: %ls: length: %dn", string, wcslen(string));

    return 0;
}

Let’s see the various aspects of this program.

  • Use of wchar_t instead of char. wchar_t is used by C to deal with the characters of various locales. Note that there are various locales other than UTF-8, but most of them focus on a particular language. wchar_t corresponds to a wide character. wchar is wider than char (1 bytes), so it can carry a large number of characters of various languages
  • To read and print a wide character string, we use the %ls format. Instead of %s, we use %ls to work with the UTF-8 characters. This directs printf and scanf to do special treatment (call additional functions) to the entered string
  • Use of wcslen instead of strlen to get the length of the string. C library provides the function wcslen to get the length of wide character strings
  • There are different ways by which a locale needs to be treated. For example, in some cases, the locale treatment just involves treatment with date or current representation. But here we used LC_ALL to deal with all the locale specific features.

Let’s see more. I need to first show that I am using UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

On executing the program, I will first enter the English character a
$ ./a.out
Enter a single character and press enter: a
String Entered: a: length: 1

The output is as expected, we entered a single character. So the length is 1

Next, I use the French character é
$ ./a.out
Enter a single character and press enter: é
String Entered: é: length: 1

So we got the length 1 as expected

Let’s try the same experiment with a Chinese letter 诶
$ ./a.out
Enter a single character and press enter: 诶
String Entered: 诶: length: 1

Once again, we got what we were looking for, the length 1.

Thus, we must ensure that we use the UTF-8 string for our softwares. It is also important to use the right functions

Advertisements

C: Printing data types using printf : short, wchar_t, long double

printf is the most commonly used function while programming (for big developers at least in the designing stage). But we mostly come across the problem: “Which conversion character for char, short, pointers… and their unsigned counterparts?” Here is a program which deals with printing of all the different data types.

#include <stdio.h>

/*A program to see how diffent data types can be printed using printf
*See the conversion characters with % for the various data types
*/
int main()
{
    //Character strings
   char *message="Printing different data types";
   wchar_t *wmessage=L"Wide Character string";

   //Characters
   char ca='A';
   wchar_t wca=L'A';

   //Integer Data types
   unsigned short usa=65535; short ssb=-32768;
   unsigned int usi=4294967295; int ssi=-2147483648;
   unsigned long long int ulli=4294967295L;long long int lli=-2147483647L;

   //Floating point Data types
   float fa=1e+37,fb=1e-37;
   double da=1e+37,db=1e-37;
   long double lda=1e+37L,ldb=1e-37L;

   //Pointers
   int *p=&ssi;

   //Character strings
   printf("%s\n",message);
   printf("%ls\n\n",wmessage);

   //Character strings
   printf("%c\n",ca);
   printf("%lc\n\n",wca);

   printf("%hu %hi\n",usa,ssb); //short
   printf("%hx %hx\n\n",usa,ssb);

   printf("%u %i\n",usi,ssi); //integer/long integers
   printf("%x %x\n\n",usi,ssi);

   printf("%llu %lli\n",ulli,lli); //long long integers
   printf("%llx %llx\n\n",ulli,lli);

   printf("%f %f\n",fa,fb); //float

   printf("%f %f\n",da,db); //double
   printf("%e %e\n\n",da,db);

   printf("%llf %llf\n",lda,ldb); //long double
   printf("%lle %lle\n\n",lda,ldb);

   printf("%p\n\n",p); //pointer of any data type

}

Here’s the output

printf

C: Ellipsis operator (…) : printf

Ever imagined how printf works, even though we are able to pass a number of arguments to it. If we design a function which takes two arguments and pass three parameters, we are bound to get this error “too many arguments to function”i.e., suppose we have a function
    int fun2(int a, int b)

and we call the function

    fun2(2,3,4)

we are sure to get the above error. So the question is how printf / scanf works with variable number of arguments? This is because C has a feature called ellipsis (…) by which you are able to pass variable number of arguments?

So the prototype of printf is

    int printf(const char *str,...)

But the next question is how then can we access the arguments in the function. This can be done by the power of pointers. Let’s take a pointer which points to the last argument before …
and depending on the next arguments of what we expect, we increment the pointer and increment it accordingly

Below is a simple code which shows how this can be done

int print(const char *str,...)

/*str has the number of integers passed*/

{

        int i;

        int num_count=atoi(str);

        int *num=(int *)&str;

        for(i=1;i<=num_count;i++)

                printf("%d ",*(num+i));}

int print_num(int num_count,...)

/*num_count contains the number of integers passed*/

{

        int i;

        int *num=&num_count;

        for(i=1;i<=num_count;i++)

                printf("%d ",*(num+i));

}

int main()

{

        print_num(3,2,3,4);

        print("3",2,3,4);

}