Tracking modern Chinese language with LIVAC

01 May 2001

Which individuals in the Chinese speaking communities of Hong Kong, Taiwan, and Beijing have had most media exposure over the last two weeks? Which words were most frequently used? You may think these are questions to which there are no definite answers, only subjective guesses. But in fact these and other precise and statistics-based answers are only a click away in the Synchronous Linguistics Variation in Chinese Speech Communities (LIVAC) Corpus (www.rcl.cityu.edu.hk/livac/sample), developed by the Language Information Sciences Research Centre (LISRC), a CityU University Research Centre.

The three key indices of the LISRC: "Celebrity Roster", "Place Name Rank", and "Common Word List", were compiled from the Synchronous LIVAC Corpus. First launched in 1994 by LISCR Director and Chair Professor of Linguistics and Asian Languages, Professor Benjamin T'sou , the LIVAC Corpus is one of the Competitive Earmarked Research Grants projects supported by Hong Kong's Research Grants Council.

A ten-year research project

Since July 1995, the LIVAC database has been regularly compiled with linguistic data from the major newspapers and electronic media from six Chinese-speaking communities: Hong Kong, Taiwan, Beijing, Shanghai, Macau, and Singapore. Words and phrases are first automatically selected by computer and then manually proofread and categorized. From this, a database composed of the linguistic structure-Character, phrase, sentence, and text-is constructed. This database is very useful for linguists and people interested in exploring linguistic phenomena, social organizations, culture and other developments in Chinese communities.

In early 2001, the size of the corpus exceeded 70 million characters and 400,000 phrases. It is continuously expanding. Currently, the part of the corpus database that has been put on the web comprises approximately 16 million characters and 190,000 phrases. It consists mainly of linguistic data compiled from July 1995 to June 1997. According to the LISCR schedule, the database will be expanded and renewed until June 2005. The total number of characters and phrases compiled at the end of the project is estimated to be 100 million and 600,000, respectively.

A Chinese language time capsule

"The corpus is like a time capsule, capturing the social, cultural, and linguistic developments of the six Chinese speaking communities within a decade," Professor T'sou explained, "This provides valuable primary research materials for linguists and those interested in studying Chinese societies." One of the many important objectives of the corpus is to explore in depth the dynamics in the development of modern Chinese vocabulary. This includes examining the origins and subsequent forms of new-concept words, the development of meaning in words, the transference of old phrases, and phrases with local colour.

Guess how many common Chinese translations can be found for the term "Internet" in the six targeted communities? According to LIVAC records between 1995 and 2000, there are at least 13 and the most frequently used translation varies between the different Chinese speaking communities. For instance, in Hong Kong"" (pronounced hu lian wang in Putonghua) is often used; in Taiwan, "" (wang ji wang lu); in Singapore, "" (wang ji wang luo); in Macau, ""(hu lian wang luo); and in Shanghai and Beijing, "" (yin te wang).

Professor T'sou said, "The Chinese language is diverse, not a single entity. It carries different local colour in different communities. People often criticize the Chinese written language used by young people in Hong Kong as being mingled with Cantonese colloquial expressions. This is in fact a value judgment. The same language of the same locale develops differences over the passage of time. Language never stops evolving. The corpus lets us see the developments and variations of modern Chinese language in different Chinese communities over the last 10 years."

Unlimited application potential

The process of building the database is long, laborious and tedious, similar to "cultivating a barren continent" or "moving a huge mountain", Professor T'sou said. "However, when the task is completed and the result is a 'feast' to be shared by all who are interested, we forget about the hardship and feel rewarded."

Apart from academic research, a database with a huge linguistic corpus, with built-in search and statistical functions, has enormous potential for application. It is increasingly common now for Hong Kong's law courts to use Cantonese, and the Synchronous LIVAC Corpus can be used in the process of recording litigation. Mobile phones designed for Chinese input also need to be supported by a huge linguistic database. In fact, as Professor T'sou pointed out, some network and IT product development companies, such as the Japanese telecom giant NTT, Hong Kong's leading web content provider, tom.com, and a subsidiary of AOL have already started applying the LIVAC database.

Back to News