bilingual data extraction

Dear all,

I've recently found out a language resource free and publicly availble that seems very interesting here:

http://optima.jrc.it/Resources/DCEP-2013/DCEP-extract-README.html2

the Readme page says the process of pair sentences extraction has to be performed with Python, so I tried it (with no results, I'm a new bie with Python).

Is there may be somebody among you who is familiar with that programming language?

This is what the readme page says:

"

Download and extract the alignment information:

wget optima.jrc.it/.../DCEP-DA-LV.tar.bz2
tar jxf DCEP-DA-LV.tar.bz2

Now we download, extract, and run the tool that generates the bicorpus from the above data:

wget optima.jrc.it/.../DCEP-extract-scripts.tar.bz2
tar jxvf DCEP-extract-scripts.tar.bz2
./src/languagepair.py DA-LV > DA-LV-bisentences.txt
"

With Python I've just managed to download the files.
The "tar" command didn't work so I extracted data simply with 7zip.

Basically, I managed to work around steps 1, 2, 3 and 4.
The problem is the final pair-of-sentence extraction on the last step.

In the command prompt, I've tried to run python first, then I typed the command (following the instructions of the readme page):
./src/languagepair.py DA-LV > DA-LV-bisentences.txt

but I receive the following error message:
"
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
./src/languagepair.py EN-IT > EN-IT-bisentences.txt
File "", line 1
./src/languagepair.py EN-IT > EN-IT-bisentences.txt
^
SyntaxError: invalid syntax
"

I tried that command outside python, in the command prompt with a similar result:

"
'.' is not recognized as an internal or external command,
operable program or batch file.
"

I don't know how to get those pair of sentences extracted...
I wrote to the file owners (the EU) but they couldn't help me on that issue.
I also forward the issue on "codeacademy" here (in case you would like to know other information on this issue):
https://discuss.codecademy.com/t/tar-file-extraction/74560/5

Is there anybody among you who can help me on that sentece extraction?
thanks

Davide
Parents
  • Hi Davide,

    The readme file assumes you are using a Unix-based OS like MAC OSX or Ubuntu.
    "wget" and "tar" just work on those systems, but will not on Windows.

    Now, the reason for your error is you have the wrong version of python installed!
    Python 3 and Python 2 are not 100% compatible as you found out with "syntax errors".
    As the readme file says, you need python 2 (i.e. download python 2.7.13 from the following link)

    https://www.python.org/downloads/

    Once you have it installed, if python 2 is not on your command line, use the following command:

    C:\Python27\python ./src/languagepair.py EN-IT > EN-IT-bisentences.txt

    * The above assumes, "C:\Python27\" is where python.exe is located.

    Or, you can run the interpreter first like you did and then pass the command "./src/languagepair.py EN-IT > EN-IT-bisentences.txt"

  • Hi again Jesse,
    I have not been lucky so far to manage this issue.
    I unistalled Python 3 and installed pyhon 2 then I've tried two ways:
    - in the command prompt, inside the path I saved the files containing the bilingual folder I copied and pasted your suggested command with no luck. I receive this message
    " can't open file './src/languagepair.py': [Errno 2] No such file or directory"

    - I tried to install cygwin. I ran the "wget" and "tar" commands (I didn't receive any error message) but when it came to run the command "./src/languagepair.py EN-IT > EN-IT-bisentences.txt" I receive the following message:
    "
    $ ./src/languagepair.py EN-IT > EN-IT-bisentences.txt
    -bash: ./src/languagepair.py: /usr/bin/python: bad interpreter: No such file or directory"

    I don't know what else I could try.

    Davide
  • Davide, do you actually UNDERSTAND what the scripts do, what is the difference between Linux environment (where they are apparently expected to be used) and Windows environment, what "wget" and "tar" is, what the "No such file or directory" messages actually mean, etc.?

    If you would really UNDERSTAND what "'./src/languagepair.py" path means (i.e. that it's searching for the "languagepair.py" script in a "src" subdirectory of a CURRENT directory), you would accommodate either your paths, or the command accordingly.

    You really need to understand what is the whole thing about, what is the GOAL of the entire procedure... and then achieve the same goal using your own familiar way, your own tools you are familiar with, etc.
Reply
  • Davide, do you actually UNDERSTAND what the scripts do, what is the difference between Linux environment (where they are apparently expected to be used) and Windows environment, what "wget" and "tar" is, what the "No such file or directory" messages actually mean, etc.?

    If you would really UNDERSTAND what "'./src/languagepair.py" path means (i.e. that it's searching for the "languagepair.py" script in a "src" subdirectory of a CURRENT directory), you would accommodate either your paths, or the command accordingly.

    You really need to understand what is the whole thing about, what is the GOAL of the entire procedure... and then achieve the same goal using your own familiar way, your own tools you are familiar with, etc.
Children