start => pootle
Pootle Metadata Storage
This is an attempt to crystallize some of the discussions we have had recently on improving the Pootle architecture. If you have a different idea for how things should work, discussion on the mailing list is the right place, but clarifications etc are welcome in this page.
These are some of the interacting issues we need to consider in parallel:
Scaling to multiple processes (e.g. when running under Apache) requires better
locking of our file interaction, changes etc
Scaling to larger numbers of files (e.g. 180000 for Debian) will probably require
faster statistics generation etc (in general, translation metadata)
Moving to generic
API for translation storage (the
base classes we use for PO, XLIFF, etc, etc) requires reworking our storage interaction
Locking and Base Classes are already being worked on in the Pootle-locking-branch. In addition because of Base Classes we have been factoring out the statistics generation etc which was horribly intertwined with PO file interaction.
It may be helpful to review the terms used in the base classes - they are outlined under terminology
Current status
Information Stored
Metadata here sinclude the following information that we currently store/use that is either not stored in the PO file, or is summary information:
Counts of translation units (messages) in translation stores including
Number of strings in a translation store, and number of translated/untranslated strings
Number of words in original and translation of each string
Which strings in a file (referenced by position in the file) are translated/untranslated
Which strings in a file have suggestions waiting for processing
Results of passing strings through checks
Assignment information
Quick Statistics for a translation project (which is a set of translation stores translating a project into a language)
Goal information for a translation project
A number of goals can be defined for a project
each goal has a list of files or directories (implying all files within that directory) categorised in that goal
each goal has a list of users assigned to that goal
Rights for a translation project
There are default rights, rights for a ‘nobody’ user (not logged in), and rights that can be assigned to specific users
These rights currently include view, suggest, translate, review, download archive, compile to mo, assign strings/goals, and administrate
Users for Pootle
Authentication info: username, email address, hash of password, activation status
Site-wide rights (project administrator)
user preferences - selected projects and languages (for shortcuts)
Storage Formats
This information is currently stored in text files.
Counts and checks are stored in a text file called xxx.po.stats. This file also contains a timestamp for the po file and suggestions file it depends on (from when the stats were last updated) etc
Assigments are stored in another text file called xxx.po.assigns
Quick Statistics are stored in a translation project stats file in CSV format - pootle-$project-$language.stats
Goals and Rights are stored in a project prefs file (also a text file) - pootle-$project-$language.prefs
Users are stored in a project-wide users prefs file (also a text file) - users.prefs
This is all far too messy and we need to clean it up properly.
Other Data
Other data that is not stored in the actual translation file (but isn’t strictly metadata):
Suggestions
These are suggestions that are waiting to be accepted / rejected
currently stored in a po file alongside the original po file called xxx.po.pending
for synchronization it is important that pending changes include the original source string, the original target translation as well as the new target translation. Otherwise we cannot pick up conflicts
Currently we only store the original source and new target, but this is really a topic for a separate page.
Text Indexes
We currently index all the strings and translations in a Lucene text index (if PyLucene enabled / available)
This really helps for fast text searching; Lucene is world class in this regard
Indexes are stored in one Lucene index per translation project.
Plan for Relational Database
This is a proposal to move to storing all of the above metadata (Counts, Checks, Quick Statistics, Assignment, Goals, Rights and User information), to a backend relational database. This move would also give us an opportunity to clean up exactly what metadata we need to store, how it interacts with changes and locking, etc, etc.
Contentious Issues
Discussions that this would raise:
Discussions about how we connect to the database, which databases we support, etc, etc
Discussions about whether we should store all the translations in the database as well rather than the current file-based system
These are the easiest things for people to suggest, without getting into the nitty-gritty of solving problems. Discussion of these should take place separately to this discussion and planning. Reasons for this:
Database Support is really up to the developers who actually implement this, although it is important that the right choice is made and we could give criteria here
Storing all translations in the Database basically amounts to a redesign of Pootle, if proposed as the only way of storing translations. There are also complex issues that relate to synchronising with version control, allowing download/upload of files etc. And it is a great advantage of Pootle that you can currently just run it on a bunch of translation files. As has been pointed out on the debian list, it makes more sense to later consider implementing TranslationStores that use the database as a backend, if anything.
We can only handle so much change at one time and we already have 3 or 4 major changes going on, so lets make sure some of our current improvements land before we take up too much time in discussing the above.
Other options considered
Other options we looked at for how to store metadata:
As they currently are (text files) - too clumsy, difficult to extend, complex to handle locking etc etc
In
XML files - more extensible but otherwise all the above problems
Within the translation files - creates problems with working with upstream versions of files, etc
In a more simple non-relational database like bsddb (included in Python) - not much advantage over relational database except inclusion in Python, probably less scalable
Try and store within Lucene Text Index - not really designed for this purpose, makes Lucene a hard requirement
An Object DB / Python Persistence engine - not as standard, not necessarily as open to other tools
In the end it comes down to this is the kind of thing relational databases were designed for, so it seems a clear choice
Issues for Design