Under Community Review

Change to default Japanese segmentation rules regarding double quotation marks after full stops

Currently, with the default segmentation rule for Japanese, any quotation mark (") after a full stop (such as 。) is included at the end of the previous segment, rather than at the beginng of the next segment.

However, this behavior is often inappropriate as in the case below:

Source

これは1つ目の例文です"2つ目の例文"

Current default segmentation results

これは1つ目の例文です。"
2つ目の例文"

Unlike European languages, it is impossible in Japanese to determine where to divide segments based on spaces, as no spaces are placed between words. So, it is inappropriate to always place single and double quotes (' and ") which are indistinguishable as to whether they are opening or closing quotation marks are placed after a full stop (such as 。) at the end of segment.

So, it would be better to change the default Japanese segmentation rule so that double quotation marks following a full stop are not included in the previous segment.

  • Hello Paul,
    Thank you for your comment.
    Nevertheless, when a double quotation mark is added to the "After break" field of the default Japanese "Terminating punctuation" rule from the TM's Language Resource settings as below, segmentation results are strange.

    Segmentaion result with the default Terminating punctuation rule

    Segmentation result with the customized Terminating punctuation rule

    This behavior seems to be programmatically inproper. Even if the default rule is deemed better not to be changed, it would be nice to, at least, allow the user to optionally change the rule so that quotation marks are not included after a full stop or other terminating charracters.

  •  

    I have not tested yet. But why would this be better than the current behaviour for situations where you had this?

    これは1つ目の例文です。"2つ目の例文"。これは1つ目の例文です。

    In this case could you end up with the end quote now being in the wrong place?

  • Incidentally, as far as we tested, adjusting the "Terminating punctuation (full stop, ...)" rule for Japanese in the Language Resources settings of the translation memory did not change this behavior.
    Please note that this issue is based on Japanese-specific linguistic characteristics, rather than a merely technical issue.