Input And Output Data

Ganga tries to simplify sending input files and getting output files back as much as possible. You can specify not only what files you want but where they should be retrieved/put. There are three fields that are relevant for your job:

  1. Input Files
    Files that are sent with the job and are available in the same directory on the worker node that runs it
  2. Input Data
    A dataset or list of files that the job will run over but which are NOT transferred to the worker. Typically the running job will stream this data.
  3. Output Files
    The name, type and location of the job output

Basic Input/Output File usage

To start with, we’ll show a job that sends an input text file with a job and then sends an output text file back:

# create a script to send
open('my_script2.sh', 'w').write("""#!/bin/bash
ls -ltr
more "my_input.txt"
echo "TESTING" > my_output.txt
""")
import os
os.system('chmod +x my_script2.sh')

# create a script to send
open('my_input.txt', 'w').write('Input Testing works!')

j = Job()
j.application.exe = File('my_script2.sh')
j.inputfiles = [ LocalFile('my_input.txt') ]
j.outputfiles = [ LocalFile('my_output.txt') ]
j.submit()

After the job completes, you can then view the output directory and see the output file:

j.peek()   # list output dir contents
j.peek('my_output.txt')

If the job doesn’t produce the output Ganga was expecting, it will mark the job as failed:

# This job will fail
j = Job()
j.application.exe = File('my_script2.sh')
j.inputfiles = [ LocalFile('my_input.txt') ]
j.outputfiles = [ LocalFile('my_output_FAIL.txt') ]
j.submit()

You can also use wildcards in the files as well:

# This job will pick up both 'my_input.txt' and 'my_output.txt'
j = Job()
j.application.exe = File('my_script2.sh')
j.inputfiles = [LocalFile('my_input.txt')]
j.outputfiles = [LocalFile('*.txt')]
j.submit()

After completion, the output files found are copied as above but they are also recorded in the job appropriately:

j.outputfiles

This will also work for all backends as well - Ganga handles the changes in protocol behind the scenes, e.g.:

j = Job()
j.application.exe = File('my_script2.sh')
j.inputfiles = [ LocalFile('my_input.txt') ]
j.outputfiles = [ LocalFile('my_output.txt') ]
j.backend = Dirac()
j.submit()

Input Data Usage

Generally, input data for a job is quite experiment specific. However, Ganga provides by default some basic input data functionality that can be used to process a set of remotely stored files without copying them to the worker. This is done with the GangaDataset object that takes a list of GangaFiles (as you would supply to the inputfiles field) and instead of copying them, a flat text file is created on the worker (__GangaInputData.txt__) that lists the paths of the given input data. This is useful to access files from Mass or Shared Storage using the mechanisms within the running program, e.g. opening them with directly with Root.

As an example:

# Create a test script
open('my_script3.sh', 'w').write("""#!/bin/bash
echo $PATH
ls -ltr
more __GangaInputData.txt__
echo "MY TEST FILE" > output_file.txt
""")
import os
os.system('chmod +x my_script3.sh')

# Submit a job
j = Job()
j.application.exe = File('my_script3.sh')
j.inputdata = GangaDataset(files=[LocalFile('*.sh')])
j.backend = Local()
j.submit()

File Types Available

Ganga provides several File types for accessing data from various sources. To find out what’s available, do:

plugins('gangafiles')

LocalFile

This is a basic file type that refers to a file on the submission host that Ganga runs on. As an input file, it will pick up the file and send it with your job, as an output file it will be returned with your job and put in the j.outputdir directory.

DiracFile

This will store/retrieve files from Dirac data storage. This will require a bit of configuration in ~/.gangarc to set the correct LFN paths and also where you want the data to go:

config.DIRAC.DiracLFNBase
config.DIRAC.DiracOutputDataSE

To use a DiracFile, do something similar to:

j = Job()
j.application.exe = File('my_script2.sh')
j.inputfiles = [ LocalFile('my_input.txt') ]
j.outputfiles = [ DiracFile('my_output.txt') ]
j.backend = Dirac()
j.submit()

Ganga won’t retrieve the output to the submission node so if you need it locally, you will have to do.

j.outputfiles.get()

Often it might be better to simply stream the data from its remote destination. You can get th URL for this as

j.outputfiles[0].accessURL()

GoogleFile

This will store files to the user’s Google Drive. This requires the user to authenticate and give restricted access to Google Drive. To use a GangaFile, do something similar to:

j = GangaFile("mydata.txt")
j.localDir = "~/temp"
j.put()
print(j)
GoogleFile (
    namePattern = mydata.txt,
    localDir = /home/dumbmachine/temp,
    failureReason = ,
    compressed = False,
    downloadURL = https://drive.google.com/file/d/1dS_XqANroclWAqgIvLU7q5rbzen17mSf
)

The urls are generated by using the id of the file.

This will upload the local file “~/temp/mydata.txt” to the user’s Google Drive inside a folder names Ganga. The File object also supports for glob patterns, which can be supplied as j.namePattern = ‘*.ROOT’.

Upon first usage, the user will be asked to authenticate and allow access to create new files and edit these files only. While the default client ID of Ganga can be used, it is recommended to create you own client ID. Tjhis will prevent getting rate limited by other users. See Setting upi your own Google client ID for how to do this.

Only files created by Ganga can be deleted (or restored after deletion).

j = GangaFile("mydata.txt")
j.localDir = "~/temp"
j.put()

# if the file is required to be deleted
j.remove() # will send the file to trash, use permanent=True for deletion
# to restore the file from trash
j.restore()

To download files previously uploaded by ganga, use the get method:

# consider "mydata.txt" file was previously uploaded by ganga
j = GangaFile("mydata.txt")
j.localDir = "~/temp" # folder where the file should be downloaded
j.get()