The Story of Farhang, A Bilingual Dictionary Production System

4 minute read

More than 9 years ago, when I was studying towards a B.A. degree in German Language Translation Studies, I used to know and be a student of a great professor, Prof. Saied Firuzabadi, at my university, Azad University of Tehran, which is a great author, translator and professor with a big track record of publishing many books and articles.

On the summer of that year, he used to have an extracurricular class for teaching pragmatic translation skills. I just enrolled, since I knew I would learn something great.

On the first day, he told us about his new project and achievement, in which he authored a German-Persian dictionary worth of more than 10,000 A4 papers of handwritten words and meanings.

At that time, his biggest struggle with the work was that he couldn’t find great typists to enter that information into a more manageable, printable format that can be published on paper. He used to tell us about the frustrations he had at that time.

The day after that day, I used to think about ways that this amount of information can be managed. Then I found some pieces of software that can handle that amount.

One of those software was a great proprietary software, IDM DPS. The other was the TshwaneLex from TshwaneDJe Software. The former was used by very big dictionary production companies and was a very expensive piece of software. The latter was affordable, but didn’t have great support for right-to-left languages (like Persian) at that time.

It seemed like I should create a custom solution to handle this data entry, typesetting, generating publishing-ready PDFs and so on. What I did was to do minor research on ways to manage these information and transform them into usable formats. This was the start of the story.

The other day I went to the class and told the professor that I have a solution to your data-management problem. He was astounded to hear what I said. I went to the white-board and gave the class a very basic data-entry application sketch that was connected to a SQL database to store and retrieve the information that can later be used. Almost everyone was astounded.

There were two obstacles along the way. The first one was that all that information should be entered into the application manually and the second was the application itself. It must have been developed in a forward-looking way, so that the data could be later used, so we needed to make the data as structured as possible. For the first obstacle, a group of five students volunteered to do the job. The second obstacle should have been handled by me alone.

Later that month, I sketched out a set of UIs for the data-entry application and with the help of great mono- and bi-lingual dictionaries in other languages and by analyzing those dictionaries, I was able to discern the way they separate, label and manage that vast amount of information that at first looked like an unmanageable haystack.

As an avid advocate and user of FOSS software, I used Python, Qt4 and PostgreSQL to design UI and database tables alike. The very first MVP of the application was great enough for data-entry by the team. I was able to cross-compile the application for them since they all were Windows users.

The team managed to do the data-entry using the application. It took around 6 months to finish the “data-entry” job. At the same time, I was trying to devise a solution for generating a PDF that looked like a bilingual dictionary with all the details — CMYK, good quality tables and figures, margins, … — needed for publishing companies to able to consume the PDF and prepare, print and produce the hard-cover version.

I tried different solutions, from generating PDFs using various libraries to producing files that can be read by applications like Adobe InDesign. Almost none of those solutions worked because of the same RTL language problem and the support of them inside those libraries and applications.

Then I came across LaTeX and XeTeX. By creating a somewhat simple class file in TeX, I was able to produce bilingual dictionaries with qualities of those very well-known dictionaries like Cambridge, Oxford, Longman, Duden and Langenscheidt.

TeX File Used to Produce PDF Output
TeX File Used to Produce PDF Output
First Page of the Dictionary in PDF (Letter A)
First Page of the Dictionary in PDF (Letter A)

The dictionary was typeset with the first version of the Farhang DPS, which was developed on Linux. Later when editing the dictionary, I faced many problems with the SQL approach, then I migrated the data into a document-oriented NoSQL database, MongoDB. This choice allowed me to have a better locality of information inside a document with nested information inside the document itself, thus avoiding SQL JOINS and their constraints.

Also I changed the development environment to use Windows and the Visual Studio. I used C# and MongoDB to develop the second version of Farhang. I used Mercurial (hg) to do version control on my code which I later migrated the repository into Git, so that I can make it FOSS and publish it on GitHub.

Second Version of Farhang DPS
Second Version of Farhang DPS

In 2018, the hard-cover version was published by Langenscheidt Verlag in Germany. It is also planned to be published as a Mobile App and hard-cover version in my own country, too. The dictionary is around 2700 pages.

There were days of sweat and tears to review and edit the data and generate new files which were then used by different applications to produce output PDF files, and it was a very painstaking and arduous set of tasks to make all this possible for the language learners and translators to use this treasure to find their ways through the texts.