In Progress

Extend reporting datasets for segment origin insights and edit distance metrics

We are interested in segment level reportability on edit distance, being able to differentiate between a machine translation match and a translation memory match to evaluate how much translators have to modify both fuzzy matches as well as MT matches and draw conclusions on the quality of MT per language.

Currently it seems to be impossible to get this data without exporting packages, parsing them offline and building an external data processing pipeline for this level of insights.

Please make these insights available via QuickSight and exportable via the DataBridge.

  • Hi Bernhard,

    I believe some of the insights you summarise here can be extracted from the Translation Productivity dataset. From this dataset, we can report on the edit distance based on the initial translation origin (MT or TM) along with the origin system (the name of the corresponding TM or MT engine). Based on the initial/final origin pair, we provide the number of segments alongside the average edit distance for those segments.

    I have included the table below, for reference. This table is filtered for a specific Project and where the final origin is "Interactive" (meaning the segment(s) were edited by a human). For the first French file listed in the table, we can see 3 segments originating from MT were edited by a linguist, with an average edit distance of 2.67. The detailed edit distance breakdown can also be extracted from this dataset, if required.    

    A table showing translation data with columns: Short ID, Project Name, Source File Name, Tgt Lang, Task Name, Initial Translation Origin, Initial Translation Origin System, Final Translation Origin, # Segments, Average Edit Distance, and ED0. Example rows include data for 'ionela default test' project with source files '2SamplePhotoPrinter.docx', 'SamplePhotoPrinter.docx', and '3SamplePhotoPrinter.docx'.

    I believe this data would help compare the quality of multiple MT providers, or the quality of matches originating from a TM.

    One point to note is that we aren't able to differentiate based on the match percentage coming from a TM. E.g. we can't compare the edit distance for fuzzy matches versus 100% matches.

    Cheers,

    Ian