2008年7月12日 星期六

Suggestions to overcome China’s Great Firewall

This is intended for webmaster who would like celebrate the natural beauty of freedom of information in the Internet during the 2008 Olympics. Since the censorship technology in China is partially base on keywords and partially base on url, my suggestions here is about exploiting the weakness of former.

Suggested Direction:
1. Mischaracterization: Mismatch the character set that the webpage intended for its audience to its META data, most Chinese read only website with Chinese characters on it, the software that do the censoring does NOT understand the meaning of Chinese words. There are five commonly adopted language set for displaying text, from Big 5 to UTF. If a webpage is mean to display in UTF is mis-understand by the censor software as displaying in Big 5, what the censor program see is nothing but gibberish. It is natural for Netizen inside China to adjust the language setting if what is display doesn’t make sense, but it is NOT natural for censorship software to do that. That would increase the load of censorship software 5 times.

2. A more advance idea is to break the webpage into several character set, For instance, break a webpage into partitions of Simplified Chinese, UTF and Traditional Chinese HK style. It would be troublesome for the Netizen to adjust the language display in individual partitions, but a single webpage that divided into 3 parts would increase the difficulty of the task by censorship software by 125 times. To alleviate the frustration of Netizen, we should develop software that can automate the task of ‘decipher’ the language ssetting of each webpage.

3. A related idea is to display the taboo word/phrase only in picture. The censor software can’t make any decision on pictures, they can only deal with raw text. Many webmaster already do that to display the Chinese character of the webpage that is not specified in character set data of webpage. How difficult it is to change only a few characters on a webpage?
Moreover, to increase the level of difficulty of censorship, the webmaster don’t just transform all the taboo item to pictures, the webmaster would do so randomly on taboo and non-taboo item. Doing so may require a software that automatically transform the require character into pictures. That is not difficult for webmaster, but it would VERY DIFFICULT for the censorship software to transform the picture into Chinese character that may or may not relevant to censorship process.

4. Add meaningless number, symbol, character from another language(like English), pictures or space inbetween the phrase that the censor software is looking for. Chinese language make sense only with an unique combination of character. The method is mean to disrupt this relationship for the censor software. For instance, democracy is made of (民主) two characters. The censorship software can’t block everything start with 民 or anything that end with 主, nor anything like 民 主, since the software doesn’t UNDERSTAND, it work only by inflexible mechanical rules.

5 Purposefully Wrongly align/indention to intentionally break the taboo words that is targeted for censorship software. It is easy for human to adjust the webpage in their hand, it would be very difficult for censorship program to try and test all possible indention and align to get the intended reading of the webpage. Beside, the censorship program itself does NOT understand anything of the content, therefore it has to test mechanically of every possible combination to look for taboo words. However, it can never tell which is the intended way of displaying the content of webpage.

6. Translated the part of the taboo words into another language like English, for instance democracy(民主) into people 主, or people master. Translation software is widely available in the Internet, it is nothing against the law in China to look for them, Therefore it is easy for Netizen to read the correct meaning but not for the censorship software.

7. Use only pronunciation to represent the whole or part of the taboo item. The censorship software is NOT equipped with the ability to recognize the character through pronunciation. It, however, require an intimate understanding of the pronunciation of Chinese character for different segment of Chinese. Moreover, the censorship software may confuse between different type of data. For instance, some Chinese Netizen use 1314 to represent ‘my whole life’ (一生一世), imagine if a Chinese Netizen use (1生1世). How can the censorship software distinguish the indent of number as for pronunciation or for representing Chinese character?

If every webpage that Communist China want to censor has adopt all or some of the above technique, then the cost and time of censorship would go up more than 1000 times. Let’s see if it want to slow down the Internet for 1000 times during the Beijing Olympics!

沒有留言: