C: Using scanf and wchar_t to read and print UTF-8 strings

We saw how using scanf and char to read UTF-8 strings led us to some strange answers. So now we need to discuss the solution provided by C.

/** Program to read a single character of different language
using wchar_t array and scanf. The program prints back the
string along with its length
*/

#include 
#include 
#include <wchar.h>
#include 

int main() {

    wchar_t string[100];

    setlocale(LC_ALL, "");

    printf ("Enter a string: ");
    scanf("%ls",string);

    printf("String Entered: %ls: length: %dn", string, wcslen(string));

    return 0;
}

Let’s see the various aspects of this program.

  • Use of wchar_t instead of char. wchar_t is used by C to deal with the characters of various locales. Note that there are various locales other than UTF-8, but most of them focus on a particular language. wchar_t corresponds to a wide character. wchar is wider than char (1 bytes), so it can carry a large number of characters of various languages
  • To read and print a wide character string, we use the %ls format. Instead of %s, we use %ls to work with the UTF-8 characters. This directs printf and scanf to do special treatment (call additional functions) to the entered string
  • Use of wcslen instead of strlen to get the length of the string. C library provides the function wcslen to get the length of wide character strings
  • There are different ways by which a locale needs to be treated. For example, in some cases, the locale treatment just involves treatment with date or current representation. But here we used LC_ALL to deal with all the locale specific features.

Let’s see more. I need to first show that I am using UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

On executing the program, I will first enter the English character a
$ ./a.out
Enter a single character and press enter: a
String Entered: a: length: 1

The output is as expected, we entered a single character. So the length is 1

Next, I use the French character é
$ ./a.out
Enter a single character and press enter: é
String Entered: é: length: 1

So we got the length 1 as expected

Let’s try the same experiment with a Chinese letter 诶
$ ./a.out
Enter a single character and press enter: 诶
String Entered: 诶: length: 1

Once again, we got what we were looking for, the length 1.

Thus, we must ensure that we use the UTF-8 string for our softwares. It is also important to use the right functions

C: Using scanf and char to read UTF-8 strings

As businesses are turning global, softwares are made that are intended to meet the global customers. UTF-8 has now become a de-facto standard for use in the web. There are obvious questions that arise in the minds of a C programmer whether C supports UTF-8 and is it possible to read a UTF-8 content. In this example, I show how scanf and char are used to read a UTF-8 string. But at the end of the post you will understand why char is not a good option for working with UTF-8.

/** Program to read a single character of different language
  using char array and scanf and printing the string
  along with its length
*/

#include 
#include 

int main() {

    char string[10];

    printf ("Enter a single character and press enter: ");
    scanf("%s",string);

    printf("String Entered: %s: length: %dn", string, strlen(string));

    return 0;
}

We see that in the program, we declare a char array of length 10, we read a string and then print the string along with its length.

I need to first show that I am using UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

On executing the program, I will first enter the English character a
$ ./a.out
Enter a single character and press enter: a
String Entered: a: length: 1

The output is as expected, we entered a single character. So the length is 1

Next, I use the French character é
$ ./a.out
Enter a single character and press enter: é
String Entered: é: length: 2

Here comes the difficult part, we see that even though we entered a single character, we get the length of the character as 2.

Let’s try the same experiment with a chinese letter 诶
$ ./a.out
Enter a single character and press enter: 诶
String Entered: 诶: length: 3

The result is bizarre, we see that the length is 3.
How can we explain this?

The first thing we should recall is that the size of char is 8 bits or 1 byte. It means it can only carry 256 values. Consider the vast number of languages and dialects in the world, char is not enough to carry the value. So we need a better mechanism called the UTF-8.

As already discussed, I am using UTF-8 in my terminal. So it is able to handle the characters from different languages, but my program is not capable to. Since it is showing very strange answers about the length of the character entered. So we need a better option

Piping: A simple means of data transfer between commands and programs

When we were discussing about Data processing, we saw that a data processing machine consists of a number of data processors working in unison (in a series or in parallel) to generate a meaningful output. To see such a data processing in action, we can take a look at the idea of piping in Linux (or Unix). A pipe reminds me of the cylindrical tubes that are used to transfer water or oil from one place to another. So a pipe is something which doesn’t modify the data (or entity) passed through it, rather it is just for transfer. In other words, the input to a pipe is the same as the output.

When we discussed about commands, we saw that there was a time, when the users had to enter the name of a command on the terminal of a computer to get things done. So to list the contents of a directory, the command ‘ls’ is used. To change the current directory, the command ‘cd’ was used. To count the number of words, characters or lines, the command ‘wc’ was used. All these commands were meant for a specific purpose. But what if I want to get another work done, that is not possible with a particular command, should i program a new command. What if I am not a programmer, it will be really difficult to create a new command. So there must exist a simpler option. And here comes the idea of piping. As we saw earlier, pipe is used to transfer an object or simply it gives out what is fed into it. So what if the output of one command is fed as an input to the other command. So we need a mechanism to transfer the contents of one command as an input to the other command. And for this purpose, piping is used in many Linux (Unix) based variants.

So a pipe transfers the output of one command as an input to the other command.

Take for example, I want to know the number of entries in a directory. As we saw earlier, ‘ls’ is used to see the contents of a directory and ‘wc’ is used to count the number of lines. What if we combine these two commands using a pipe, we get our work done. We feed the output of ls as an input to wc using a pipe.

In Linux, the symbol | is used for the purpose of piping

ls | wc -l

The option -l is used to get only the number of lines displayed. Thus we are able to get the number of entries in a directory without creating a new command. The best way to reuse the existing resources to get things done.

The idea of piping helped in a way that the users need not learn a large number of commands, but a very few number of commands. With this limited number of commands, one can get all the things done required to get their daily jobs done.

Compare this to the present day state of softwares and applications. A large number of softwares doing the same thing but with different user interfaces. What’s the result? A regular user is baffled to make a choice of a software ending up reading multiple review sites to make a choice. Sometimes, it is good to look at what people did in the past when they had a crunch of resources. Those lessons are helpful in making good design decisions.

Commands: Simple text-based terminal tools

There was a time, long before the ubiquitous windows management on our desktops and desktops, when people used commands to get things done from the computer. These commands were entered on the terminal. There were no beautiful interfaces involved like the menus or the buttons. To see the contents of the directory, they go to the terminal and enter the words ‘ls’ and they see a list of names of files and directories displayed on their terminal with no fancy figures. Similar for every action we now do on the computer, there was an associated command.

Command means to order somebody to do somethings. In the world of computers, there are a fixed number of commands that we give to the computer and the computer can understand these commands. Though there are many research works to make computer understand a large number of commands. But as of 2012, the computers can only understand a limited set of commands.

A command is a software. But normally the word ‘command’ corresponds to the software that run on the terminal. These commands take as input textual information from the user (like the name of a file or an image) and generates textual information as output.

Some commonly used commands in Linux
• ls (to list the contents of a directory)
• cd (change directory)
• pwd (prints current working directory)
• grep (to search for a pattern in a file)

Data Processing

One way to look at the Computer is as “A machine that performs data processing.” When you look at the computer from this point of view, you need to look at every device associated with the computer from this perspective. Before delving deep into the topic of Data processing, let’s first understand what is meant by the term ‘Data Processing’. The word processing means ‘A series of actions, changes, or functions bringing about a result’. Data processing means that you act upon the data to turn into other meaningful data. So there is something before and after the act of processing. That something before processing is often referred as input. The something after processing is called output. In the case of data processing, both input and output is the data. The input and output data may or may not be similar. If they are similar, the data may not have been acted upon. An intuitive example is when you use a clean tube to transfer water. Water(input) is passed through a tube from one end of the tube and you get the (same)water at the other end(output). But in most cases, we use the computer or any device to get a different output from the input we fed in. In case of computer, we often call the output data as (meaningful) information.

Let’s take some real word examples to see what processing means. Take a juicer. What you feed in the juicer are the peeled fruits and you get as output  pulp extracted from the juice. You can also see a change in state. The input was in solid state and the output is in liquid state. Thus a juicer is also a processing machine. It processed upon the fruits and returned the juice of the fruits as output.

A computer is a data processing machine. A computer has many devices attached to it. When you press a key on your keyboard, it generates electric signals which must be transformed to the pixels on the screen. The computer aids in this transformation to display the key at the right place on the screen. So a set of electric signals (input) is transformed to another set of electric signals(output) to display the pixels in the desired color. We abstracted out many inner details of this processing and presented an overall picture. What happens exactly is there are a number of data processors (any entity capable of data processing) working in a series (or in parallel) that is aiding the transformation. Each data processor does its intended duty to transform the data to another form of data. When all these data processor are arranged in a proper manner, we are able to see the character displayed on the screen.

Take any other example from our daily lives, we see the (data) processing. A camera, a mobile phone, a vacuum cleaner, a dish washer, a washing machine. All of these take something as input and give another meaningful or helpful output. All these devices process the data fed into them. The output is in the form that you want.

PHP: How to set the value of the include_path: set_include_path

Various functions in PHP have options to search for files in a fixed set of directories often known as include_path. When you include another PHP file in another PHP program, the interpreter searches for the file in the directories mentioned in the include_path and reports an error if not found. The include_path value can be overwritten by a program. But it is better to extend the current value and add more directories. The following program does that. It gets the current value and extends the new value. Note the use of the PATH_SEPARATOR, which is a variable to signify the separator between two directories.

<?php
print get_include_path()."n";
$path = "../../config";
set_include_path(get_include_path().PATH_SEPARATOR.$path);
print get_include_path()."n";
?>

The output of the program is as follows

$php value.php
.:/usr/share/pear:/usr/share/php
.:/usr/share/pear:/usr/share/php:../../config

PHP: How to get the value of the include_path: get_include_path

Various functions in PHP have options to search for files in a fixed set of directories often known as include_path. When you include another PHP file in another PHP program, the interpreter searches for the file in the directories mentioned in the include_path and reports an error if not found. The include_path value can be overwritten by a program. To get the value of the include_path

<?php
print get_include_path()."n";
?>

The output of the file is generated as follows

$php value.php
.:/usr/share/pear:/usr/share/php