LinuxCommandLibrary

po4a-gettextize

Convert a file to a PO file.

TLDR

Convert a text file to PO file

$ po4a-gettextize --format [text] --master [path/to/master.txt] --po [path/to/result.po]
copy


Get a list of available formats
$ po4a-gettextize --help-format
copy


Convert a text file along with a translated document to a PO file (-l option can be provided multiple times)
$ po4a-gettextize --format [text] --master [path/to/master.txt] --localized [path/to/translated.txt] --po [path/to/result.po]
copy

SYNOPSIS

po4a-gettextize -f fmt -m master.doc [-l XX.doc] -p XX.po

(XX.po is the output, all others are inputs)

DESCRIPTION

po4a (PO for anything) eases the maintenance of documentation translation using the classical gettext tools. The main feature of po4a is that it decouples the translation of content from its document structure. Please refer to the page po4a (7) for a gentle introduction to this project.

The po4a-gettextize script is in charge of converting documentation files into PO files. You only need it to setup your translation project with po4a, never afterward.

If you start from scratch, po4a-gettextize will extract the translatable strings from the documentation and write a POT file. If you provide a previously existing translated file with the -l flag, po4a-gettextize will try to use the translations that it contains in the produced PO file. This process remains tedious and manual, as explained in Section 'Converting a manual translation to po4a' below.

If the master document has non-ASCII characters, the new generated PO file will be in UTF-8. Else (if the master document is completely in ASCII), the generated PO will use the encoding of the translated input document, or UTF-8 if no translated document is provided.

OPTIONS

-f, --format

Format of the documentation you want to handle. Use the --help-format option to see the list of available formats.

-m, --master

File containing the master document to translate. You can use this option multiple times if you want to gettextize multiple documents.

-M, --master-charset

Charset of the file containing the document to translate.

-l, --localized

File containing the localized (translated) document. If you provided multiple master files, you may wish to provide multiple localized file by using this option more than once.

-L, --localized-charset

Charset of the file containing the localized document.

-p, --po

File where the message catalog should be written. If not given, the message catalog will be written to the standard output.

-o, --option

Extra option(s) to pass to the format plugin. See the documentation of each plugin for more information about the valid options and their meanings. For example, you could pass '-o tablecells' to the AsciiDoc parser, while the text parser would accept '-o tabs=split'.

-h, --help

Show a short help message.

--help-format

List the documentation formats understood by po4a.

-V, --version

Display the version of the script and exit.

-v, --verbose

Increase the verbosity of the program.

-d, --debug

Output some debugging information.

--msgid-bugs-address email@address

Set the report address for msgid bugs. By default, the created POT files have no Report-Msgid-Bugs-To fields.

--copyright-holder string

Set the copyright holder in the POT header. The default value is Free Software Foundation, Inc.

--package-name string

Set the package name for the POT header. The default is PACKAGE.

--package-version string

Set the package version for the POT header. The default is VERSION.

Converting a manual translation to po4a

po4a-gettextize will try to extract the content of any provided translation file, and use this content as msgstr in the produced PO file. Be warned that this process is very fragile: the Nth string of the translated file is supposed to be the translation of the Nth string in the original. This will naturally not work unless both files share exactly the same structure.

Internally, each po4a parser reports the syntactical type of each extracted strings. This is how desynchronization are detected during the gettextization. For example, if the files have the following structure, it is very unlikely that the 4th string in translation (of type 'chapter') is the translation of the 4th string in original (of type 'paragraph'). It is more likely that a new paragraph was added to the original, or that two original paragraphs were merged together in the translation.

Original Translation chapter chapter paragraph paragraph paragraph paragraph paragraph chapter chapter paragraph paragraph paragraph

po4a-gettextize will verbosely diagnose any detected structure desynchronization. When this happens, you should manually edit the files (this probably requires that you have some notions of the target language). You must add fake paragraphs or remove some content in one of the documents (or both) to fix the reported disparities, until the structure of both documents perfectly match. Some tricks are given in the next section.

Even when the document is successfully processed, undetected disparities and silent errors are still possible. That is why any translation associated automatically by po4a-gettextize is marked as fuzzy to require an manual inspection by humans. One has to check that each retrieved msgstr is actually the translation of the associated msgid, and not the string before or after.

As you can see, the key here is to have the exact same structure in the translated document and in the original one. The best is to do the gettextization on the exact version of master.doc that was used for the translation, and only update the PO file against the latest master file once the gettextization was successful.

If you are lucky enough to have a a perfect match in the file structures, building a correct PO file is a matter of seconds. Otherwise, you will soon understand why this process has such an ugly name :) But remember that this grunt work is the price to pay to get the comfort of po4a afterward. Once converted, the synchronization between master documents and translations will always be fully automatic.

Even when things go wrong, gettextization often remains faster than translating everything again. I was able to gettextize the existing French translation of the whole Perl documentation in one day, even though the structure of many documents were desynchronized. That was more than two megabytes of original text (2 millions of characters): restarting the translation from scratch would have required several months of work.

Hints and tricks for the gettextization process

The gettextization stops as soon as a desynchronization is detected. In theory, it should probably be possible resynchronize the gettextization later in the documents using e.g. the same algorithm than the diff (1) utility. But a manual intervention would still be mandatory to manually match the elements that couldn't be automatically matched, explaining why automatic resynchronization is not implemented (yet?).

When this happens, the whole game comes down to the alignment of these damn files' structures again through manual edits. po4a-gettextize is rather verbose about what went wrong when it happens. It reports the strings that don't match, their positions in the text, and the type of each of them. Moreover, the PO file generated so far is dumped as gettextization.failed.po for further inspection.

Here are some other tricks to help you in this tedious process:

AUTHORS

Denis Barbier <barbier@linuxfr.org> Nicolas Francois <nicolas.francois@centraliens.net> Martin Quinson (mquinson#debian.org)

COPYRIGHT AND LICENSE

Copyright 2002-2020 by SPI, inc.

This program is free software; you may redistribute it and/or modify it under the terms of GPL (see the COPYING file).

SEE ALSO

po4a (1), po4a-normalize (1), po4a-translate (1), po4a-updatepo (1), po4a (7).

Copied to clipboard