One Small Site for a Kurd, one Giant Leap for Kurdistan

Could you be the person who goes down in history as the William Caxton of Kurdistan?  Do you remember, William Caxton is reckoned to be the man who introduced the printing press to England, and as a printer became the first retailer of printed books in England.  I believe a transliteration tool would be a huge technological-educational leap

forWilliam caxton.jpgward for Kurds and Kurdish culture worldwide.

Why is this needed? Most Kurds can only properly read and write in their language in one script, either the Latin script or the Arabic script.  And foreigners, even those who sweat away learning the Arabic script, will generally always be able to read Kurdish in Latin script many times faster.

چاوانی باشی؟ ———-> Çawan î baş î? (The standard greeting How are you, are you well?)

But some people, myself included, are constantly transforming one script into another to meet the needs of the varied readership you might have in presentations and transnational Facebook groups, for example.  Kurdish Wikipedia would be a good example where it is probably increasingly getting used, but many people would be hindered from reading it because they can’t read, or can’t read at all well, the Kurmanji site in Latin script or the Sorani site in modified Arabic script (sometimes called the ‘Kurdish’ script).

It is true, there are tools that do this 90% accurately; Chawg and Lexilogos, for example, the former marginally better than the latter it seems.  And Kurmanji Wikipedia itself allows you to switch between scripts, though, again, erratically.  So my question in this blog post is: is there a volunteer, or even a professional IT engineer, who could blaze a trail and design an online tool that does this with near 100% accuracy?

There is no way to get ‘technology to do it all for you’.  If you’re not familiar with the scripts, ‘و’ in Arabic script is transliterated to either ‌u or w: you just have to know the word and then decide whether it’s the ‘u’ sound or the ‘w’ sound.

My input into this conversation is as follows:

  • What is needed is a word-bank which contains every word in every possible inflection, listed with its correct transcription from one alphabet to the other.  This is where our language consultancy has a lot of experience.  Whenever we edit dictionary pages on ku.wiktionary.org we enter a word in both scripts.  So, the data is there: see for example this dictionary page where a giraffe is spelt in Arabic script midway down the page.  Many pages on Kurdish Wiktionary have only pages in one script, so we would need a bot to roughly transliterate them and insert a flag asking the reader to correct the transcription if it is flawed.
  • furthermore, many more words need to be created so as to cover plurals and different tenses and cases.  Some of that has already been automated.  See, for example, the box containing all verb conjugations of the Kudish word for ‘put’.  But in Wiktionary many of these word-forms have not yet been given pages of their own, defining a word as e.g. “second-person singular past perfect of the verb ‘to put'”.  So, that’s another big project.
  • another question is whether that data within ku.wiktionary.org can easily be mined and placed at the disposal of a machine transliteration tool.
  • if it can, it would want to be set up such that the user can submit their corrections to the site’s draft transliteration.  That way, a lot of creases will be ironed out for future users.

That would be a fine tool for all those who read and write seriously in Kurdish.  30 million people worldwide could stand to benefit.  But I’ll leave you with a final example of how fiddly this procedure is:

  • One of the most tedious tasks when transliterating Kurdish from Arabic to Latin script is the fact that Arabic script has no capitals.  To be sure, پاریس would obviously be Paris not paris.  But the real fine tuning would come when it has to be decided whether ئازاد (azad) is an adjective meaning free or capital-A Azad, a boy’s name.

 

 

Leave a comment