First off, I admit to not having a clue about how the new fragment matching feature works but I suspect it's a cross between a TM, machine translation and segment comparing. But given the size of my TM and termbase, it behaves in sometimes puzzling ways which make me wonder if an element of manual tweaking would not greatly improve things.
1. An easy starter, it seems to insert spaces in really weird places. I do a lot of software localization and it certainly does not like placeholders and punctuation. Some common headscratchers are
blabla. » blabla.. (multiple ending fullstops)
<b> » < b> (spaces being added)
items '{0}' » nithean'{0}' (spaces being deleted)
& » something&;
2. It also does not like hyphens (at least not when they are word characters) for instance the word a-rithist (again) I often get reduced to rithist when it has been matching fragments to the left or right.
3. CrazyCaps - as I think of them, when it uplifts fragments from the start somewhere and puts them in the middle of a sentence with an uppercase initial
4. Language stuff
- it does not seem capable of learning that there is word order beyond Subject-Verb-Object
- it does not cope well with inflection across word boundaries, like the patterns between a definite article and a noun
- it does not cope well with other inflections such as past tense marking (which in my case is marked word-initially e.g. seas (stand) » sheas (stood)
So I'm just musing if there might not be a benefit of adding a way of manually specifying certain language specific things, such as the predominant word order of a language, whether hyphens and apostrophes are usually word characters or punctuation, whether a language tends towards lowercasing or not and perhaps even some inflectional patterns. Something like a locale grammar cribsheet which would nudge whatever black magic operates fragment matching in the right direction?
And something like a gross error check that prevents .. unless the source has it or ripping apart words like a-rithist, certainly if they're in the termbase as whole units.
How is it working out for other inflecting languages?