Dedicated term extract workflow

Good afternoon:

I would like to know if there is a proces or workflow to extract terms from a document using an existing Multiterm termbase, i.e. the workflow would detect in the original document only the terms already registered in the termbase and extract them into a new termbase to, for example, send it to a collaborator.

I have done my own research on this several times to no avail.

Thank very much.

Pablo Dittrich

emoji
Parents
  • What about doing a “normal” term extraction on your document, then exporting all the terms in your termbase, then delete all extracted terms that don't match a term in your termbase. Any standard text editor like EditPad Pro should allow you to do that.

    The problem I see here is fuzzy term recognition, or rather the lack thereof. (If your TB contains “shelf” and your document contains “shelves”, it would take a bit to get that to match.) You could overcome that if you convert your TB into a TM with identical source term and target term, then “pre-translate” your list of extracted terms using that TM. You determine which level of match you will accept and arrive with a certain number of the extracted terms that have a translation (your TB entry) and a certain number without. Using Export to Excel or something similar, you export the whole stuff and delete all rows that have no translation. (=have no match in your TB)

    Now you have a spreadsheet of terms extracted from your document and the matching TB entries.

    What is your hoped-for result of this operation? A list of terms? A MultiTerm termbase that contains all the additional information that your current termbase contains? If you want to arrive at a slimmed-down version of your TB, you'd have to export it with Glossary Converter and use the aforementioned spreadsheet to filter out entries that don't match a TB entry in your new spreadsheet.

    Then you use Glossary converter to create a new TB from that. I know it's a few steps, but it's scalable. The entire process should take less than 30 minutes.

    Daniel

    emoji
Reply
  • What about doing a “normal” term extraction on your document, then exporting all the terms in your termbase, then delete all extracted terms that don't match a term in your termbase. Any standard text editor like EditPad Pro should allow you to do that.

    The problem I see here is fuzzy term recognition, or rather the lack thereof. (If your TB contains “shelf” and your document contains “shelves”, it would take a bit to get that to match.) You could overcome that if you convert your TB into a TM with identical source term and target term, then “pre-translate” your list of extracted terms using that TM. You determine which level of match you will accept and arrive with a certain number of the extracted terms that have a translation (your TB entry) and a certain number without. Using Export to Excel or something similar, you export the whole stuff and delete all rows that have no translation. (=have no match in your TB)

    Now you have a spreadsheet of terms extracted from your document and the matching TB entries.

    What is your hoped-for result of this operation? A list of terms? A MultiTerm termbase that contains all the additional information that your current termbase contains? If you want to arrive at a slimmed-down version of your TB, you'd have to export it with Glossary Converter and use the aforementioned spreadsheet to filter out entries that don't match a TB entry in your new spreadsheet.

    Then you use Glossary converter to create a new TB from that. I know it's a few steps, but it's scalable. The entire process should take less than 30 minutes.

    Daniel

    emoji
Children