RegEx for Language Recognition

Question

Hello guys ... I am just starting using a bit of reg ext to find strings with digits (^\d), urls (^\www), telephones etc BUT ... 
 I translate basically form Russian into Italian and quite often it happens to find in russian source files segments NOT in russian. It happens for instance with brands, company etc ... let's say ... I have in russian files a long list of brancs like like "Ferrari" "Lamgborghini", "AlfaRomeo" ETC ... 
 My question is, is there a way to build a RegEx which allows me to find segments writte NOT in russian ??? If I have for example this list ... in cirillic and latin alfphabet: КОМПАНИЯ 1 FERRARI КОМПАНИЯ 2 LAMBORGHINI ALENIA SPAZIO КОМПАНИЯ 3 Is there any way to EXTRACT from that source document ALL the segments like "Ferrari", "Lamborghini" and "Alenia spazio" in order to simply COPY them from source to target and only AFTER THAT start the translation of the REAL russian source ??? Sorry if it sounds strange! MANY THANKS! Pietro

Paul · Accepted Answer

And just in case... for the Russian: [А-Яа-я]

Paul · Answer

PIETRO SFERRINO 
 Time to get the text books out! 
 Try this and see if it works better: 
 ^(?=[A-Za-z]+) 
 Would have been much better if you provided some sample text in a file and not just a screenshot.

Anthony Rudd · Answer

\p{IsCyrillic} can also be used as range.

Anthony Rudd · Answer

I am returning to this question, because it was not fully answered and the solution includes some important considerations. 
 Because the \P{IsCyrillic} block is too inclusive (it contains more than just alphabetic characters), the list of excluded non-Cyrillic punctuation characters would need to be adapted. A better approach is to define the set of word characters (all alphabetic characters + numbers + underscore) that excludes Cyrillic characters. 
 The following regex matches a segment that contains Cyrillic and non-Cyrillic characters (while excluding digits and various miscellaneous characters): 
 (?=.*[\w-[\p{IsCyrillic}\d]]+)(?=.*[\p{IsCyrillic}]+) 
 The regex matches the first line, but not the last two lines: 
 White automobile "белый автомобиль" in Russian&lsaquo;┘ 
 White automobile in Russian&lsaquo;┘ 
 "белый автомобиль"

Nora Díaz · Answer

Hi Pietro, 
 Not strange at all! Try using this regex in the filter: [A-Za-z]

Trados Studio > 5. Regex and XPath

RegEx for Language Recognition

Top Replies