Think Regionally Act Locally

The first step in making computers work for South Asia is to make the computers print out the letters of the local language. It's not as easy as it looks.

Writing systems and languages
All the writing systems of South Asia and South East Asia, except those of Pakistan, are traceable back to the ancient Brahmi phonetic writing system of the 4th century BC. But over the centuries, the letters of different systems took different shapes as the languages diverged along their respective evolutionary paths.

Around the first century AD the scripts of the South Indian Dravidian languages diverged from those of the North Indian and Sinhala Indo-Aryan languages. Later, around the 6th century, the Tibeto-Burmese and Austro-Asiatic languages of Southeast Asia also adopted Brahmic writing systems, later branching off into the particular systems of Burmese, Thai, and so on. It was also around this time that Tibetan and other languages of the Himalaya adopted a Brahmic writing system. Today, the letters of the different systems look very different, but each letter has continued to carry its unique phonetic value.

In order for computers to print in any of these languages all that needs to be done is to identify the alphabet, and then fix things so that you can print that alphabet. In the old days of impact printers this meant creating metal type for the new alphabet. Today we have laser printers and ink-jet printers, and the type is described in software, but the idea is the same: replace the Roman type with the type of the new alphabet. Type faces (called fonts) are designed for the new alphabet, and installed in your computer.

This sounds simple, but it is not. Even deciding what the letters of the alphabet are can be a problem. The act of computerisation forces us to consider what is in the alphabet, and what is not, and there are many issues to confront. For instance, in the Devanagari script, are vowels and consonants with the modifiers chandrabindu ( ~ ) and anuswar ( ) themselves letters, or do these modifiers have some lesser status as diacritics? Such matters have to be decided with the help of language experts; it cannot be left to technologists alone.

Technological limitations
Then there is the technology of the computer which limits the number of printable characters. Current personal computers allow 224 printable characters. (The computer allows one byte per character, allowing a total of 256 codes of which some 32 are used by the computer for its own purposes). Of these 224, some 30 to 40 are needed for punctuation and mathematical symbols, leaving perhaps 180 for alphabet characters. That seems like plenty, until one starts looking at what needs to be printed. Also, it needs to be considered that an ideal of the print industry has always been to be able to reproduce in print something as good as a calligrapher´s writing.

Take the example of Devanagari, the writing system used for Nepali, Hindi and some other North Indian languages. When printing was being introduced in the early 19th century in Bombay, it was found that Devanagari required 1800 different type elements for one print size. There are three reasons for this. First, the writing is cursive and adjacent letters usually join with ligatures properly positioned in a range of printable characters for seamless joints. Secondly, since Devanagari uses many diacritics, extra forms of the letters with the diacritics are required. Thirdly, the letters are sometimes stacked on top of each other or otherwise combined in what are called conjuncts. Separate type elements are needed for each of these conjuncts. Today, many less than the 1800 originally proposed characters are used, but still 44 consonants and 11 vowels require many hundreds of distinct printable characters. Since computers allow a mere 180, compromises have to be made.

Similarly the cursive Arabic writing system used for Urdu, with some 28 to 35 letters in up to four different forms and many diacritics, requires more than a hundred characters for basic quality printing, and many times that to vary the width and to stack the letters as is found in quality calligraphy.

Compromises will have to be made in Devanagari and other writing systems of South Asia, just as they were made in European languages and the Roman scripts. European languages when written by hand are cursive, but this cursiveness has not been replicated in print. Alternative forms of letters and awkward composite letters and diacritics have disappeared or are disappearing. Some might argue that this has simplified the Roman alphabet to great benefit, while others feel that much of the beauty of writing has been removed.

Compromises and chaos
For Devanagari and similar alphabets, the usual compromise is to drop a lot of the conjuncts and the full variety of characters needed for quality printing. And then, to produce a font, the remaining characters are each given a unique internal code number between 33 and 255 to form a code table, and the shapes designed using a computer software package like Fontographer. After installing this new font in the usual software, such as a word processor, one can select it and begin to type in the selected writing system.

Creators of the font will have determined which keys need to be pressed for typing. They will probably have chosen one of two keyboard layouts – typewriter (following a well-known typewriter layout like Remington) or phonetic (placing characters on Roman keys of similar sound).

This exercise, of squeezing Devanagari into a code table of limited size and aligning characters with keys containing implied internal coding, has been done by many people and organisations in South Asia, and each has compromised in different ways to end up with different sets of characters and different internal codes. The result is chaos. A choice of writing styles is desirable, but what has happened is a total incompatibility of fonts – text prepared using one font cannot be replaced with another font unless the character repertoires and internal codes are identical.

Taking the example of Devanagari fonts developed for Nepali in Kalhmandu, in the Anusha font becomes when the font is changed to Barood, because the internal code for  in the Anusha is the internal code for  in Barood, while the next internal code for  has no meaning in Barood and hence the box to indicate this. It is not the same with Roman fonts, for in changing from the Berkeley Medium font used in this article to Helvetica Regular, the text remains readable. Easy convertability of texts is the major advance in standardisation and is essential for South Asian fonts today.

International Standards
There is compatibility among the Roman fonts for the simple reason that the character sets and internal codes underlying the fonts have been standardised, a process that emerged way back in the 1950s with what is known as ASCII (American Standard Code for Information Interchange). The ISO (International Organisation for Standardisation) equivalent of ASCII is ISO 646, which includes some simple devices for national variants, and since then standards for character encodings have included methods for handling non-Roman scripts as well, whereby national bodies register their scripts with ISO. The multi-script standard that is now rapidly becoming established, however, is Unicode.

Unicode began in the American company Xerox in the late 1970s to solve the problem of ideographic scripts such as Chinese and Japanese with their tens of thousands of characters. Each distinct character in each writing system was provided with its own unique code, which required thousands of codes. This was achieved by using two bytes instead of the usual one, which allowed 65,536 (256 times 256) distinct codes. Unicode has been adopted by ISO as standard ISO 10646, and since it now uses four bytes, the four thousand million codes possible is clearly enough for all eventualities.

But there are some critically important issues in the way Unicode is managed. Unicode is not regulated by ISO but by the independent Unicode Consortium in California. This is very much a US organisation, with very limited European and East Asian affiliates. Unicode already contains character sets and encodings for many writing systems, including the Roman or Latin, Greek, Cyrillic, Arabic systems of West Asia, many of the systems of South Asia and South East Asia, as well as the ideographic systems of China, Japan and Korea. These have mostly been derived from national standards but not necessarily with the participation and approval of the national agencies concerned. Unicode has been adopted for the next generation of most major computer platforms, but it is not clear whose interests Unicode serves.

While current systems are based in the ASCII tradition of character encoding using one byte per character, the future clearly lies with Unicode and its multi-byte representations. Which means that South Asia must simply accept what Unicode and computer suppliers give them, or else it must actively engage with Unicode and make it serve the interests of the people of South Asia.

It might be expected that there have been standard encodings for single byte fonts in the ASCII tradition in South Asia, and that efforts are being directed towards the multi-byte Unicode standard. (Languages like Urdu and Pushtu using the Arabic writing system have long been served by ASMO 449, a single byte standard which grew out of an Arab League initiative.) But the fact is that almost nothing has been done to standardise single byte fonts for the Brahmi scripts of South Asia. Whatever is being done is of very recent origin: in India, the BharatBhasha initiative, started in December 1997, aims at establishing a defacto standard for all Indian languages; and in Nepal, a standardisation committee is in the final stages of drawing up a national single byte standard.

Advancing the technology
To understand what has happened with South Asian computer fonts, a little knowledge of how font systems work is necessary. Computers work internally using only numbers, and bit patterns correspond to these numbers. When you press a key on your keyboard, an internal code is sent along the cable to the computer where it is stored, and then sent to your display where the internal code number is used to select the actual character to be displayed. When you ask your computer to print, the sequence of stored internal codes is sent to the printer, which uses the internal codes to select in turn each character to be printed.

So far, we have seen a direct correspondence between the key you press, the code that is generated for transmission to and storage in the computer, and the character that is printed. We need to recognise that this direct correspondence need not be so, and that we could have three independent components -the entry system including keyboard layouts, the internal codes, and the rendering system which includes the fonts. Here is what happens using a simple example from Devanagari.

In the diagram below, on the left is the sequence of keys to press, but the actual keys to press depends upon which keyboard layout is being used. These then generate internal codes using key-mapping software: some changes, even some reordering, may take place. When this internal code sequence comes to be printed or displayed, it is rendered using the chosen font to determine the style.

The critical feature is the rendering system, which effectively gives you your own cal-ligrapher. Tell the calligrapher what you want written, and the calligrapher will render it with all the finesse of the calligraphic art. The internal codes can then be designed to focus on the essential features of the writing sys tem. This is what was done in the 1970s for the Arabic writing systems and in the 1980s for the Brahmic writing systems.

Advanced encodings for Arabic and Brahmic
In Arabic itself there are 28 letters, several of which can appear in up to four forms: as shown below for the Arabic letter heh, depending on how they fit into the cursive flow of the writing.

From left to right, the isolated form stands unconnected to other letters, the final form connects to some letter on its right, the medial form connects to both left and right, while the initial form connects to the left. In extensions of the Arabic writing system given by Unicode to cover all languages like Urdu, Pushtu and Sindhi, which use the Arabic writing system, the number of letters has been more than doubled.

The selection of the actual form of the character to be used (called a "glyph") is correctly left to the rendering system, since this can be precisely determined by context. But rendering systems of even this simplicity are not provided by current standard operating systems, and thus the rendering has to be built into the specialist applications built for Arabic.

The way Unicode has encoded the Arabic writing system and its extensions still leaves some problems. The principle of one internal code per letter is right, but the situation with some characters, shown in the table below, needs urgent review.

The hamza  is represented in six forms, each of which has its own encoding, whereas one code should have been enough. Similarly, the two characters – taa marboota  and alif mksoora – are encoded without any clear argument as to why they have been included and are not just context-dependent forms of taa and alif  respectively. These aspects of the Arabic writing system, and the way in which the extensions have been handled, clearly need the further analysis of professional linguists.

In India during the 1980s an encoding for the Brahmic writing systems was produced. This was developed through research grants from the Indian Department of Electronics initially at the Indian Institute of Technology, Kanpur, and then at C-DAC (Centre for Development of Advanced Computing) in Pune. It led to the 1991 Indian Standard IS 13194, the Indian Script Code for Information Interchange (ISCII). A 1988 version of the standard was used as the basis for the Unicode encodings of the writing systems of India.

ISCH is an advanced encoding in two ways. Firstly, ISCII aims to encode all the writing systems of India within a single code table, building upon their common origin in Brahmi. The letters of the alphabets that have the same phonetic value or sound (derived from the same original Brahmi letter) are given the same internal code, so that changing fonts between writing systems yields a crude but useful transliteration.

Secondly, ISCII aims to represent the letters and not the particular written forms or glyphs of the letters, just as had been done previously for Arabic. This means that in particular the conjuncts of the writing systems do not need to be given internal codes, since these can be generated by the rendering system which will recognise when a sequence of consonants should be joined, and how they should be joined as a horizontal conjunct or as a stacked conjunct or perhaps as some special different character. The figure illustrates: the pairs on the left are stored, while the conjunct character or glyph on the right is generated and printed or displayed as needed.

The rendering requirement for Brahmic writing systems is more sophisticated than that needed for Arabic, and is well beyond the technologies available in current operating systems. This has meant that special software has had to be produced, and C-DAC has developed a rendering system that is relatively simple.

But the drawback of this level of sophistication for both the Arabic and Brahmic writing systems is that the rendering system required is packaged and sold with other software like a word processor, thus making both ASMO 449 and iscii (and their Unicode derivatives) expensive. In India C-DAC has a monopoly, buttressed by the state requirement that all government information be supplied in ISCII. Of course, others could enter the market, but the entry cost is very high.

Problems and benefits
ISCII has recently been reviewed and has been the subject of much criticism, but the official report on this review has not yet been made public. Some of the criticism stems from the requirement of advanced rendering to use ISCII but there is also substantive linguistic criticism of some features of ISCII. While due recognition must be given to the very significant contributions of ISCII and C-DAC to the development of encoding systems and standards for Indian writing systems, it is time that these defects were corrected. The standardisation process which is underway in Nepal is trying to do so.

Taking this more abstract approach of representing the letters and not their written forms in the internal codes can have enormous benefits for the input side of the system. The keyboard need only have key positions for the letters, not their forms, reducing very significantly the number of distinct keys required. This in turn reduces the number of ´shift presses´ required. Both of these lead to a very significant increase in typing speed and accuracy, and a reduction in learning time for new typists.

The next generation of operating systems must give computer users the ability to render fonts with the flexibility required for the writing systems of South Asia. If this change comes about, it will take place through the adoption of Unicode and the incorporation of the key mapping and font rendering facilities needed by Unicode. Further sophistication may come with the development of the TrueType Open font system by Microsoft and Adobe; this will allow arbitrary reorganisation of characters during rendering to give quality results.

Unicode has already been adopted into  current Unix systems. Microsoft appears to have adopted Unicode for Windows 98 and  related systems, and will implement whatever  Unicode mandates, though Microsoft is saying little publicly. Apple´s intentions are even less clear, but they have implemented ISCII in their Indian Language Kit. It is still possible, however, that Unicode may fail to represent the writings systems of South Asia adequately, and that the platform providers may fail to provide the technology necessary to render the scripts properly.

The way forward
There is some state of confusion in South Asia, with much happening, but with little co-ordination. Standards are necessary for computers to work in the languages of South Asia. The extant standards in Unicode are not adequate and need the expert attention of linguists from the nations concerned to remove linguistic misunderstandings. More scripts need to be added. This cannot be left to well-intentioned but inappropriately qualified groups in the US, and must involve the participation and leadership of linguistic and technical experts in South Asia.

Furthermore, the commercial interests of organisations like C-DAC in India, and major concerns like Microsoft, cannot be left to determine whether or not the peoples of South Asia get access to computing in their own languages and conventions, nor what form South Asian languages should take.

What is needed is a regional conference under the auspices of an organisation like SAARC to resolve these matters at the linguistic, technical and political levels. C-DAC is organising such a conference in the first week of September 1998 in Pune, a welcome and timely initiative. This writer most sincerely hopes that one aim of this conference will be to resolve the very basic issues covered in this article. A resolution of the issues must be arrived at respecting the interests of all the countries of South Asia (which have languages and scripts that are shared across borders) and drawing upon expertise from across the region; it would be most unfortunate if India dominated this meeting. Which is why it is important that the organisers of the conference formally constitute the meeting within a regional framework and seek neutral chairpersons for the strategically crucial sessions.

Loading content, please wait...
Himal Southasian