Vietnamese Language Graph 🇻🇳

Xin chào các bạn!

I'm in the process of learning Vietnamese, and something I want to focus on is expanding my vocabulary! Vietnamese has quite a few differences from English, but an important one is the low morpheme to word ratio. Vietnamese creates more complicated words (or words in different tenses) through creating compound words. When you learn a single morpheme word in Vietnamese, you can usually build on that word to create more complex words around the same idea. This project utilizes Python and Obsidian to visualize these compound words to improve vocabulary while learning the language.

Video

To see how this project was created, and get a glimpse of my problem-solving method, check out the video here:

Project Summary

This Python project takes a Vietnamese to English dictionary in the form of an XML file. It manipulates the XML data into a Pandas DataFrame, and then passes it to a search function. There are two search functions in the project:

reverse_search_method Reverse search method receives a DataFrame of words, then takes each word and splits it into morphemes and reverses the list. It iterates the morphemes, creates a word with a morpheme length equal to the iterated index, then reverses the word to put it back in proper order. This sub-word is then added to the connection list for the main word. This method is deprecated, in favor of dict_match_search_method, as it creates false positives in some words, linking to words that are not in the dictionary (i.e. morphemes used to add meaning to a compound word, but which are not words in their own right) and misses some connections to morpheme combinations created by using the center morphemes of a four (or higher) morpheme word.

dict_match_search_method Dictionary match search method receives a DataFrame of words and iterates it. For each word, it creates an array of all the possible sub-words that could exist in it. For instance, the word sinh hóa học (biochemistry) consists of five possible sub-words:

One Morpheme
- sinh
- hóa
- học
Two Morphemes
- sinh hóa
- hóa học

These words are compared against a filtered dictionary DataFrame containing only words with the same morpheme length as the sub-word. If a word exists in the dictionary a connection is made between the root word and the sub-word. This method is slightly slower than reverse_search_method, but ensures 100% connection with existing words from the dictionary, and removes all dead links.

After running the script, 23000 markdown files will be outputted, which can then be imported into an Obsidian Vault. This will result in a graph view like the following:

How to use

You can use this project to visualize Vietnamese as well! Required Software:

Obsidian
Python
source.txt

Instructions:

Clone this repository to your computer
run python main.py
Open Obsidian
Click Open another vault (bottom of left navigation bar)
Click Open folder as Vault
Select the folder ObsidianVault in this repository

Obsidian will take some time to index and import this vault. That time is greatly decreased by not opening the Graph View until the indexing is complete.

Sources

Quang Hiển's Vietnamese Dictionary

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
ObsidianVault		ObsidianVault
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ObsidianVault

ObsidianVault

resources

resources

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

Repository files navigation

Vietnamese Language Graph 🇻🇳

Video

Project Summary

How to use

Sources

Like my work?

About

Releases

Packages

Languages

License

DavidASix/vietnamese-language-graph

Folders and files

Latest commit

History

Repository files navigation

Vietnamese Language Graph 🇻🇳

Video

Project Summary

How to use

Sources

Like my work?

About

Topics

Resources

License

Stars

Watchers

Forks

Languages