How to improve Japanese segmentation

I translate JP-EN and often find that the segmentation is poor. Specifically I find that segments often consist of more than one sentence, with multiple periods in them. Official documents often use very long sentences, so this makes it even more important that text is divided cleanly at terminal marks.

This is the existing segmentation rule for JP-EN. Before break:

[。︒﹒.。︖﹖?︕﹗!]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*

After break:

\s

That whitespace token looks odd to me, because spaces are not used in Japanese. I thought I would use a simpler form, like this:

.[。︒﹒.。︖﹖?︕﹗!]+

Followed by anything:

.

This does a better job of segmenting sentences at Japanese periods (。) in my current job; I cannot find a single instance where it has put two sentences in a segment. However, I am finding that it is segmenting at line feeds or end-of-line characters - that is, it is cutting valid sentences in half. How can I adjust my segmentation rule to prevent this?

Regards

Dan

Parents
  • Hi Dan,

    >Specifically I find that segments often consist of more than one sentence, with multiple periods in them
    Would you be able to show an example of this?

    >That whitespace token looks odd to me, because spaces are not used in Japanese.
    Be careful here, as the whitespace character doesn't necessarily mean just spaces.
    I believe `\s` covers the following ASCII:
    ' ' (0x20) space (SPC)
    '\t' (0x09) horizontal tab (TAB)
    '\n' (0x0a) newline (LF)
    '\v' (0x0b) vertical tab (VT)
    '\f' (0x0c) feed (FF)
    '\r' (0x0d) carriage return (CR)

    and also UNICODE space categories including stuff like IDEOGRAPHIC SPACE (U+3000), etc.

    > I am finding that it is segmenting at line feeds or end-of-line characters - that is, it is cutting valid sentences in half.
    I think the catch-all anything (the period) maybe causing this.
    Perhaps taking the opposite approach, i.e. as long as its not one of the above in the list I show above?
    [^\s]

Reply
  • Hi Dan,

    >Specifically I find that segments often consist of more than one sentence, with multiple periods in them
    Would you be able to show an example of this?

    >That whitespace token looks odd to me, because spaces are not used in Japanese.
    Be careful here, as the whitespace character doesn't necessarily mean just spaces.
    I believe `\s` covers the following ASCII:
    ' ' (0x20) space (SPC)
    '\t' (0x09) horizontal tab (TAB)
    '\n' (0x0a) newline (LF)
    '\v' (0x0b) vertical tab (VT)
    '\f' (0x0c) feed (FF)
    '\r' (0x0d) carriage return (CR)

    and also UNICODE space categories including stuff like IDEOGRAPHIC SPACE (U+3000), etc.

    > I am finding that it is segmenting at line feeds or end-of-line characters - that is, it is cutting valid sentences in half.
    I think the catch-all anything (the period) maybe causing this.
    Perhaps taking the opposite approach, i.e. as long as its not one of the above in the list I show above?
    [^\s]

Children