Opened 23 months ago

Last modified 8 months ago

#4487 new change

Auto-generation of translation string ids and comments in CMS

Reported by: juliandoucette Assignee:
Priority: Unknown Milestone: Websites editing service
Module: Sitescripts Keywords:
Cc: kvas, saroyanm, fhd, greiner, wspee, jsonesen, lisabielik, Shikitita, ire Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

CMS documentation: https://github.com/adblockplus/cms/#markdown-format-md
Translation string format review: https://codereview.adblockplus.org/29422615/

Description (last modified by juliandoucette)

Background

Our CMS requires a translatable string to be comprised of:

  1. id
    • [Required] provides a unique id for developers and translators that can be seen in pages and locale json
      • Can be re-used per page to prevent duplicate text / translations
  2. comment
    • [Optional] provides location, context, and intention details to translators
  3. content
    • [Optional] provides default text
      • Can be omitted if text for the given id has already been defined on the page

See Review URL(s) for more information.

Although this system is effective, it is inconvenient because:

  • We must create unique string ids per string
  • We must create repetitive string comments per string
  • The format breaks formatting in some editors
  • The format is non-trivial for non-developers
  • The format is easily broken
  • The format adds overhead to training and review

What to change

  1. Auto-generate unique string ids
    1. Auto-detect string addition/subtraction/changes
  2. Auto-generate string comments
    1. Based on heuristics EG: {$HEADLINE} {$TAG_NAME} {$INDEX}
    2. Must be manually overridable

OR

Provide a means for translating entire pages e.g. by putting them in the locales/${LANG}/pages

Change History (15)

comment:1 Changed 23 months ago by juliandoucette

  • Summary changed from Automate auto-generation of translation string ids and comments to Auto-generation of translation string ids and comments in CMS

comment:2 Changed 23 months ago by trev

The string IDs serve multiple purposes:

  • Give translators an idea about the purpose: Is it a label for some element (foo_label)? What does it describe (foo_introduction)?
  • Make sure that minor formulation changes (adding an article, fixing a typo) don't invalidate translations as the string ID doesn't change. At the same time, changes which affect the meaning of a string should change string ID and invalidate translations.
  • Make sure that reordering of text doesn't affect translations - if we move a text section to the beginning of the text its string IDs won't change, meaning no new translations.

I seriously wonder how you plan to achieve this via automatically generated string IDs. We already have automatically generated IDs in web.adblockplus.org repository (produced in the migration from Anwiki), and it is all but great.

comment:3 Changed 23 months ago by kvas

During the meeting I proposed the following id autogeneration scheme:

  1. Unique ids (based on the location or not) are generated for each string that doesn't have an id yet. It is saved back into the original document.
  2. If a string is moved, the id is never re-generated automatically, so once generated it never changes.
  3. When new strings are added, ids are generated for them making sure that the uniqueness is preserved (this might break the numbering according to order so perhaps we could just skip that altogether).
  4. It's possible to manually specify an id from the start or to change an autogenerated id to a manual one.

This solves the problems with insertion, reordering and uniqueness by sacrificing the numbering-follows-order property. Later I also realised that we would need to mark strings without an id (where we want it to be generated) because otherwise {{This is a string}} is not robustly distinguishable from {{MyId This is a string}}. I suppose we could add a syntax for this, e.g. {{* This is a string that needs an autogenerated id}}.

However, now after reading Julian's list of problems with the current scheme, I see that what I was perceiving as the main goal of this change (saving brain cycles by removing the need to invent string ids) is only part of it. My proposal doesn't address the issues with syntax being complicated (although I don't agree that it is) or with syntax breaking the display in editors (although it's compatible with other proposals to fix this, for example switching to an HTML-tag based syntax, e.g. <string id="foo" comment="bar">Default value</string> or <span cms:id="foo" cms:comment="bar">Default value</span>).

comment:4 Changed 23 months ago by juliandoucette

Give translators an idea about the purpose: Is it a label for some element (foo_label)? What does it describe (foo_introduction)?

I don't see the string id in crowdin. Do we give other translators the JSON file?

I seriously wonder how you plan to achieve this via automatically generated string IDs. We already have automatically generated IDs in web.adblockplus.org repository (produced in the migration from Anwiki), and it is all but great.

My intention was to report the issue and ask the sitescripts team to investigate.

If there is no good sitescripts solution, I might also suggest a codingtools solution that auto-fills missing string ids/comments in a similar way.

comment:5 Changed 23 months ago by trev

It is saved back into the original document.

This would need to happen before commit, otherwise it will add unnecessary noise to the commit history. Other than that - yes, autogenerating IDs for new strings wouldn't be too bad, assuming that the algorithm would provide sufficient context that the ID could be kept for most strings. However, I understood the proposal in such a way that the document wasn't supposed to contain any IDs at all and these would be generated completely dynamically.

I don't see the string id in crowdin.

It's visible under "Context" - collapsed by default.

comment:6 Changed 23 months ago by juliandoucette

However, I understood the proposal in such a way that the document wasn't supposed to contain any IDs at all and these would be generated completely dynamically.

Not necessarily. My goal was to list all of the problems that I have encountered and ask for a solution that helps as much as possible.

Last edited 23 months ago by juliandoucette (previous) (diff)

comment:7 Changed 23 months ago by trev

Note that a perfect implementation would recognize translation units automatically. In other words, it would be able to detect text paragraphs and split them up into sentences. A sentence could have an optional ID at the start, something not too heavy on the syntax like ~foo_label~ or ~foo_label~comment~ - if the ID is missing then some tool could autogenerate a meaningful ID and insert it before the text is committed.

This would be great but technically this is non-trivial to say the least...

comment:8 follow-up: Changed 23 months ago by juliandoucette

It's visible under "Context" - collapsed by default.

Have you checked this recently?

I don't see an id under "Context" when expanded.

comment:9 in reply to: ↑ 8 Changed 23 months ago by trev

Replying to juliandoucette:

I don't see an id under "Context" when expanded.

It's there for other projects - but it seems that the CMS upload script has a bug here.

comment:10 Changed 23 months ago by greiner

  • Cc greiner added

comment:11 Changed 13 months ago by juliandoucette

  • Cc wspee jsonesen lisabielik Shikitita ire added
  • Description modified (diff)
  • Review URL(s) modified (diff)

Updates:

  • I created a proof of concept script for inserting translation string {{ id[context] content }} around content in Markdown
    • It was relatively painless to write using [nodejs, marked]
  • I think I can do the same using DOM in nodeJS
    • But I haven't tried it yet

I added a second option to what to change that I think is particularly relevant considering our translators do not currently use crowdin and Shikititia seems to be exporting the translations into word documents anyway. I don't know whether this option makes sense / is technically feasible yet. Feedback would be greatly appreciated @kvas, @jsonesen, @wspee.

comment:12 Changed 13 months ago by juliandoucette

Note:

I think it's necessary for Websites Infrastructure to solve this problem in order to realistically provide the means for non-developers to contribute to websites.

comment:13 Changed 13 months ago by juliandoucette

I can also imagine solving this problem using a custom ~WYSIWYG editor UI (e.g. clicking on a paragraph, you could see the id and comment, which have been auto-generated for new content or populated from existing content; in a sidebar), but that is even more hypothetical and down-the-road (not likely to happen soon).

comment:14 Changed 11 months ago by juliandoucette

  • Milestone set to Websites editing service

comment:15 Changed 8 months ago by fhd

  • Cc trev removed
Note: See TracTickets for help on using tickets.