RegEx for Language Recognition

Hello guys ...
I am just starting using a bit of reg ext to find strings with digits (^\d), urls (^\www), telephones etc BUT ...

I translate basically form Russian into Italian and quite often it happens to find in russian source files segments NOT in russian. It happens for instance with brands, company etc ... let's say ... I have in russian files a long list of brancs like like "Ferrari" "Lamgborghini", "AlfaRomeo" ETC ...

My question is, is there a way to build a RegEx which allows me to find segments writte NOT in russian ???

If I have for example this list ... in cirillic and latin alfphabet:

КОМПАНИЯ 1
FERRARI
КОМПАНИЯ 2
LAMBORGHINI
ALENIA SPAZIO
КОМПАНИЯ 3

Is there any way to EXTRACT from that source document ALL the segments like "Ferrari", "Lamborghini" and "Alenia spazio" in order to simply COPY them from source to target and only AFTER THAT start the translation of the REAL russian source ???

Sorry if it sounds strange!

MANY THANKS!

Pietro

Parents Reply
  • I am returning to this question, because it was not fully answered and the solution includes some important considerations.

    Because the \P{IsCyrillic} block is too inclusive (it contains more than just alphabetic characters), the list of excluded non-Cyrillic punctuation characters would need to be adapted. A better approach is to define the set of word characters (all alphabetic characters + numbers + underscore) that excludes Cyrillic characters.

    The following regex matches a segment that contains Cyrillic and non-Cyrillic characters (while excluding digits and various miscellaneous characters):

    (?=.*[\w-[\p{IsCyrillic}\d]]+)(?=.*[\p{IsCyrillic}]+)

    The regex matches the first line, but not the last two lines:

    White automobile "белый автомобиль" in Russian‹┘

    White automobile in Russian‹┘

    "белый автомобиль"

Children