Tuesday, October 14, 2008

Pnyin - Basic Python module for adso -








> Learning Chinese > Chinese Computing and Technology > Adsotrans.com Forum
Basic Python module for adso
Home New Posts

Login: Pass: Log in or register for standard view and full access.





Page 1 of 2 1 2 >






imron -

Stemming from the discussion in this thread, here is a basic python module that will perform
web-based queries against the adsotrans website and return the results as a list of tuples.

There are 3 files:

adso.py - the main module
adsotatepage.py - class that handles processing of the adsotrans webpage
test.py - simple test harness

if anyone was interested, it probably wouldn't be too hard to have a translatepage or a pinyinpage
that would process and return the results from a translate or pinyin query.

To use the module, import it, and create an object of the Adso class.

I decided to write an Adso class rather than just having functions in the module, so that all the
different adso options (conjugation, grammar, encoding, encoding_out, numeric_pinyin and quality)
can easily be preserved across multiple calls. These values are set in the constructor, and are
simply strings that correspond to the values passed to the adso url.

Default values are:

conjugation='on'
grammar='on'
encoding='UTF-8S'
encoding_out='UTF-8S'
numeric_pinyin='off'
quality='high'

To use, simply import the module, create an Adso object, and call the adsotate member function
with the text that you want.

from adso import Adso

adso = Adso()
result = adso.adsotate( '你好世界‘ )

result will be a list of tuples containing the values (chinese, pinyin, translation), with one
tuple per segment of text, ordered by the same order the segments appear in the original text.
e.g. the above example produces the result:

[ ( '你好', 'nǐhǎo', 'hello' ), ( '世界', 'shìjiè', 'world' ) ]

Note: the encoding of the text you pass in should be what you provided as the encoding when
creating the Adso object (defaults to utf-8 ).

Anyway, it's all pretty basic at the moment, and doesn't really do anything more advanced than
generate a query to the main adsotrans webpage, and then parse the resulting html file. There's
also very little in the way of error checking, so you'll get exceptions if you can't connect to
the internet etc. It was done more as a proof-of-concept than anything else. Is this the sort of
thing you had in mind Kudra?

BTW speaking of errors, I don't know if this is of interest to you Trevelyan, but the python
HTMLParser says the output generated by Adso has malformed start tags at various places in the
html. The w3.org validator reports errors in the same lines/columns, but it seems to be because
it's treating the < operator in some of the javascript as a start tag.



Pleco Software Learn Chinese with our Dictionaries for Palm and Pocket PC.
Learn Chinese in China Learn to speak Chinese 1MonthChinese.com -Mandarin School in China.
Chinese Textbooks Wide range, cheap, varied languages. Also Chinese cartoons, toys, gifts.
Study Chinese in Beijing Affordable Mandarin language courses at BLCU with ChinaUnipath.com.
HNHSoft Dictionary Learn Chinese on Smartphone and PDA with real person's voice.
XueXueXue IQChinese Get beyond the plateau.Take your Mandarin to a new level.
Chinese in Lijiang Short term Chinese study in a beautiful town with a focus on daily life.
MandarinTube Chinese Access to current everyday Chinese language and culture, 24/7.
Learn Chinese Homestay Chinese course, cultural activities & volunteer events in China.
Learn Chinese Online 1-on-1 instant tutoring, diverse courses, native teachers. FREE trial now!
Nihao Chinese Progam Free one-on-one Chinese lesson. Win 5-years of free lessons now!


About Ads (and how to hide them) -- Your message here









kudra -

Haven' t played with it yet, but from all appearances, in the words of Will Smith in Men in Black
I, "Now that's what I'm talking about!"

thanks.










bogleg -

Hi Imron,

Awesome work. Would you mind if I ported something like this over to Java? I'd love to be able use
it in the ZDT and I'm sure others would use it as well.

Chris










trevelyan -

Looks good. Let me know if any changes are necessary on this end to help out. It would be possible
to create a script that just spat out the information delimited in a more convenient way for
parsing/processing if that would help or be faster.










kudra -

@trevelyan -- that would be convenient. In my experience of parsing yahoo pages, it is always a
pain when they change the html format. By essentially providing an api you or we python(or other
lang) programmers wont have to worry if you change stuff around in the html.










imron -

@bogleg - go for it, it's not even 100 lines of code, so I can't imagine it'd take too long.
Though you might want to wait until trevelyan can produce a page with a more streamlined output.

@trevelyan - yeah, a more suitable format would be nice, and would certainly be more future-proof.
Maybe just a simple XML file along the lines of:

你好nǐhǎohello

(or less verbosely

)

You could of course add any extra other info that was relevant/useful (part of speech,
simplified/traditional conversion etc). All of which (including the 3 listed above) could be
toggled by parameters.

This format would also lend itself nicely to the other styles of queries (translation/pinyin),
which would simply just have one segment containing the entire body of text with the appropriate
pinyin/translation.










trevelyan -

Currently takes GB2312 as input, but it will make sense to switch to UTF8. I'm not sure which
server to put it on. Probably the new one. Ping me if anyone is clamouring to set anything up
using it and I'll jump on supporting UTF sooner rather than later.

http://www.adsotate.com/adso/api.pl?text=%CB%FB%C3%C7










bogleg -

I'm clamouring! Hook us up!

Chris










imron -

That's great! Thanks for that










trevelyan -

Ok. First file here takes in GB2312. The second takes in UTF8. Because of the need to support both
simplified and traditional, both files return content in UTF8.

http://www.adsotate.com/adso/api-gb2312.pl?text=TEXT
http://www.adsotate.com/adso/api-utf8.pl?text=TEXT

There's no guarantee these files will stay online here. So if you set up anything using
them send me an email so I can notify you if they move.












All times are GMT +8. The time now is 04:28 PM.














Learn Chinese, Chinese Online Class, Learning Materials, Mandarin audio lessons, Chinese writing lessons, Chinese vocabulary lists, About chinese characters, News in Chinese, Go to China, Travel to China, Study in China, Teach in China, Dictionaries, Learn Chinese Painting, Your name in Chinese, Chinese calligraphy, Chinese songs, Chinese proverbs, Chinese poetry, Chinese tattoo, Beijing 2008 Olympics, Mandarin Phrasebook, Chinese editor, Pinyin editor, China Travel, Travel to Beijing, Travel to Tibet

No comments: