Vietnamese disyllabic word tokenization issues

Posts6Likes0Joined21/6/2025LocationAX
Native
English
Learning Vietnamese

I have noticed that many disyllabic Vietnamese words are separated when they should be glossed as one unit.


For example: hoành tráng (imposing; monumental; glorious; grand)

is glossed in the text as:


hoành (diaphram)

and

tráng = (rinse)


hoành tráng is a calque from Chinese (宏壯) and has nothing to do with the individual units above.


Vietnamese is typically described as an isolating language, and while individual units do have meaning, disyallabic words like this make up the majority of Vietnamese vocabulary.


This could cause issues down the line as new words in other texts are split and marked as 'known' when they are in fact units of a compound word.

Posted 
0
#1
Posts1753Likes1148Joined18/3/2018LocationBellingham / US
Native
English
Learning Lao
Other Chinese - Mandarin, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swahili, Tagalog, Thai

There's a join and split tool that works for Vietnamese (it doesn't work for all languages), but it's a premium feature.

Learning Isaan every day!

Posted 
0
#2
Posts6Likes0Joined21/6/2025LocationAX
Native
English
Learning Vietnamese

Thank you for your reply. I am currently trialling the site before paying for membership. It looks really cool, and I want to support the project, but I need to check its functiuonality thoroughly. I have added nearly 3000 known words, but what I have encountered is that the majority of unknown words are split up and therefore not recognised or marked as known words.


For example, the Wikipedia page for Vietnam consists almost entirely of Sino-Vietnamese compounds, like any other formal, academic or technical text. In this case every single word has been cut into a single morpheme. I have attached a screenshot for reference. As a result, the vocabulary function for nearly every word is innacurate or unusable.


For further reference, this list of the 11,000 most common Vietnamese words shows how the majority are disyllabic compounds: https://github.com/duyet/vietnamese-wordlist/blob/master/Viet11K.txt


Chinese texts would likely have the exact same tokenisation problems.


How does the join tool work? It would be quite tedious if each compound to be typed out in a separate window and added one by one. If it could be done graphically with just a couple of clicks or hotkeys, that would speed things up considerably.

Edited 
0
#3
Posts6Likes0Joined21/6/2025LocationAX
Native
English
Learning Vietnamese

Suggestion:


A hotkey to link single Vietnamese mopheremes into a disyllabic compound


For example:


control + right click word 1

control + right click word 2


turns them into a single word


The word would then be recognised by the standard dictionary that is used.

Edited 
0
#4
Posts1753Likes1148Joined18/3/2018LocationBellingham / US
Native
English
Learning Lao
Other Chinese - Mandarin, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swahili, Tagalog, Thai

palace3416.3416 wrote:
How does the join tool work?
Like this. I just realized that it's your first week, so you are on a premium trial and can try it out.

Learning Isaan every day!

Posted 
0
#5
Posts6Likes0Joined21/6/2025LocationAX
Native
English
Learning Vietnamese

Thank you! That function is very useful.


I have gone through another text and worked through a couple of pages. Rather than having to click the 'join' button, is there is a hotkey that I can press?


Basically 90% of the words need to be joined. I guess the number would reduce over time if these new words are identified as such in future texts? Are words that I join also marked as such for other users? 


Ultimately, the parser needs to pick up on common words that go together, like if Việt is next to Nam, 99.99% of the time it will be one word, Việt Nam. This website is miles ahead of LingQ for Vietnamese support, but with only a handful of Vietnamese learners, will there be any upgrades in the future?

Posted 
0
#6
Posts1753Likes1148Joined18/3/2018LocationBellingham / US
Native
English
Learning Lao
Other Chinese - Mandarin, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swahili, Tagalog, Thai

I'll ask the tech team to consider your changes. Maybe there is another parser they can use or something, but I can't promise anything.

Learning Isaan every day!

Posted 
1
#7
    Feedback