I translate JP-EN and often find that the segmentation is poor. Specifically I find that segments often consist of more than one sentence, with multiple periods in them. Official documents often use very long sentences, so this makes it even more important that text is divided cleanly at terminal marks.
This is the existing segmentation rule for JP-EN. Before break:
[。︒﹒.。︖﹖?︕﹗!]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*
After break:
\s
That whitespace token looks odd to me, because spaces are not used in Japanese. I thought I would use a simpler form, like this:
.[。︒﹒.。︖﹖?︕﹗!]+
Followed by anything:
.
This does a better job of segmenting sentences at Japanese periods (。) in my current job; I cannot find a single instance where it has put two sentences in a segment. However, I am finding that it is segmenting at line feeds or end-of-line characters - that is, it is cutting valid sentences in half. How can I adjust my segmentation rule to prevent this?
Regards
Dan