Some musings on fragment matching

First off, I admit to not having a clue about how the new fragment matching feature works but I suspect it's a cross between a TM, machine translation and segment comparing. But given the size of my TM and termbase, it behaves in sometimes puzzling ways which make me wonder if an element of manual tweaking would not greatly improve things.

1. An easy starter, it seems to insert spaces in really weird places. I do a lot of software localization and it certainly does not like placeholders and punctuation. Some common headscratchers are

blabla. » blabla.. (multiple ending fullstops)

<b> » < b> (spaces being added)

items '{0}' »  nithean'{0}' (spaces being deleted)

&amp; » something&;

 

2. It also does not like hyphens (at least not when they are word characters) for instance the word a-rithist (again) I often get reduced to rithist when it has been matching fragments to the left or right.

 

3. CrazyCaps - as I think of them, when it uplifts fragments from the start somewhere and puts them in the middle of a sentence with an uppercase initial

 

4. Language stuff

- it does not seem capable of learning that there is word order beyond Subject-Verb-Object

- it does not cope well with inflection across word boundaries, like the patterns between a definite article and a noun

- it does not cope well with other inflections such as past tense marking (which in my case is marked word-initially e.g. seas (stand) » sheas (stood)

 

So I'm just musing if there might not be a benefit of adding a way of manually specifying certain language specific things, such as the predominant word order of a language, whether hyphens and apostrophes are usually word characters or punctuation, whether a language tends towards lowercasing or not and perhaps even some inflectional patterns. Something like a locale grammar cribsheet which would nudge whatever black magic operates fragment matching in the right direction?

And something like a gross error check that prevents .. unless the source has it or ripping apart words like a-rithist, certainly if they're in the termbase as whole units.

How is it working out for other inflecting languages?

Parents
  • Here's another weird one. Close match, original source has blabla blabla blabla. with the translation blabla blabla bla-bla. The new source is blabla blabla blabla (i.e. only the fullstop at the end has gone) - but instead of removing the . in the targed, it kills the hyphen??

  • Hi Michael

    Thanks for describing some of the issues you've been seeing, which I'd say primarily concern MatchRepair, which strictly speaking is something separate from fragment recall (you can turn them on and off separately) though of course they do work together when both are enabled. MatchRepair doesn't actually use MT unless you want it to (you can switch that on in Project Settings, under MatchRepair sources). The "multiple ending full stops" issue is one that's already on our list, and I'd hope it'll be addressed at the same time as some other MatchRepair enhancements. What you've not unreasonably called "CrazyCaps" is tricky to deal with (it can be hard to know whether a word should be capitalised when it's in the middle of a sentence, which depends on the language, the part of speech, etc.) but there are some available strategies. (Spacing is also difficult - change the end of a sentence from a full stop to an exclamation mark in English, and spacing doesn't change; do it in French, and you have to add a space before the exclamation mark - but seeing cases where this should be improved can only help refine it.) With some of the other issues you've mentioned, though, it's less clear to me exactly what the behaviour is. For instance, when I use a new, empty test TM for English->Scots Gaelic, and translate a document that has only two "sentences", the "blabla blabla blabla" with and without full stop, I don't get the behaviour you've described. (I get the expected result of the full stop being removed.) If you have time to do so, it would be really helpful if you can provide (say) TMs and source texts to reproduce that and other examples you'd like to see improved. Any data provided is of course kept confidential and deleted after use. Do feel free to send anything you might be able to in that regard to me on kflanagan@sdl.com . This might help determine if there's more language metadata (e.g. SVO or otherwise) that could usefully be applied.

    Kevin
Reply
  • Hi Michael

    Thanks for describing some of the issues you've been seeing, which I'd say primarily concern MatchRepair, which strictly speaking is something separate from fragment recall (you can turn them on and off separately) though of course they do work together when both are enabled. MatchRepair doesn't actually use MT unless you want it to (you can switch that on in Project Settings, under MatchRepair sources). The "multiple ending full stops" issue is one that's already on our list, and I'd hope it'll be addressed at the same time as some other MatchRepair enhancements. What you've not unreasonably called "CrazyCaps" is tricky to deal with (it can be hard to know whether a word should be capitalised when it's in the middle of a sentence, which depends on the language, the part of speech, etc.) but there are some available strategies. (Spacing is also difficult - change the end of a sentence from a full stop to an exclamation mark in English, and spacing doesn't change; do it in French, and you have to add a space before the exclamation mark - but seeing cases where this should be improved can only help refine it.) With some of the other issues you've mentioned, though, it's less clear to me exactly what the behaviour is. For instance, when I use a new, empty test TM for English->Scots Gaelic, and translate a document that has only two "sentences", the "blabla blabla blabla" with and without full stop, I don't get the behaviour you've described. (I get the expected result of the full stop being removed.) If you have time to do so, it would be really helpful if you can provide (say) TMs and source texts to reproduce that and other examples you'd like to see improved. Any data provided is of course kept confidential and deleted after use. Do feel free to send anything you might be able to in that regard to me on kflanagan@sdl.com . This might help determine if there's more language metadata (e.g. SVO or otherwise) that could usefully be applied.

    Kevin
Children
No Data