Input And Output Data¶
Ganga tries to simplify sending input files and getting output files back as much as possible. You can specify not only what files you want but where they should be retrieved/put. There are three fields that are relevant for your job:
- Input Files
- Files that are sent with the job and are available in the same directory on the worker node that runs it
- Input Data
- A dataset or list of files that the job will run over but which are NOT transferred to the worker. Typically the running job will stream this data.
- Output Files
- The name, type and location of the job output
Basic Input/Output File usage¶
To start with, we’ll show a job that sends an input text file with a job and then sends an output text file back:
# create a script to send
open('my_script2.sh', 'w').write("""#!/bin/bash
ls -ltr
more "my_input.txt"
echo "TESTING" > my_output.txt
""")
import os
os.system('chmod +x my_script2.sh')
# create a script to send
open('my_input.txt', 'w').write('Input Testing works!')
j = Job()
j.application.exe = File('my_script2.sh')
j.inputfiles = [ LocalFile('my_input.txt') ]
j.outputfiles = [ LocalFile('my_output.txt') ]
j.submit()
After the job completes, you can then view the output directory and see the output file:
j.peek() # list output dir contents
j.peek('my_output.txt')
If the job doesn’t produce the output Ganga was expecting, it will mark the job as failed:
# This job will fail
j = Job()
j.application.exe = File('my_script2.sh')
j.inputfiles = [ LocalFile('my_input.txt') ]
j.outputfiles = [ LocalFile('my_output_FAIL.txt') ]
j.submit()
You can also use wildcards in the files as well:
# This job will pick up both 'my_input.txt' and 'my_output.txt'
j = Job()
j.application.exe = File('my_script2.sh')
j.inputfiles = [LocalFile('my_input.txt')]
j.outputfiles = [LocalFile('*.txt')]
j.submit()
After completion, the output files found are copied as above but they are also recorded in the job appropriately:
j.outputfiles
This will also work for all backends as well - Ganga handles the changes in protocol behind the scenes, e.g.:
j = Job()
j.application.exe = File('my_script2.sh')
j.inputfiles = [ LocalFile('my_input.txt') ]
j.outputfiles = [ LocalFile('my_output.txt') ]
j.backend = Dirac()
j.submit()
Input Data Usage¶
Generally, input data for a job is quite experiment specific. However, Ganga provides by default some basic input data functionality that can be used to process a set of remotely stored files without copying them to the worker.
This is done with the GangaDataset
object that takes a list of GangaFiles
(as you would supply to the
inputfiles
field) and instead of copying them, a flat text file is created on the worker
(__GangaInputData.txt__
) that lists the paths of the given input data. This is useful to access files from
Mass or Shared Storage using the mechanisms within the running program, e.g. opening them with directly with Root.
As an example:
# Create a test script
open('my_script3.sh', 'w').write("""#!/bin/bash
echo $PATH
ls -ltr
more __GangaInputData.txt__
echo "MY TEST FILE" > output_file.txt
""")
import os
os.system('chmod +x my_script3.sh')
# Submit a job
j = Job()
j.application.exe = File('my_script3.sh')
j.inputdata = GangaDataset(files=[LocalFile('*.sh')])
j.backend = Local()
j.submit()
File Types Available¶
Ganga provides several File types for accessing data from various sources. To find out what’s available, do:
plugins('gangafiles')
LocalFile¶
This is a basic file type that refers to a file on the submission host that Ganga runs on. As in input file,
it will pick up the file and send it with you job, as an output file it will be returned with your job and put in
the j.outputdir
directory.
DiracFile¶
This will store/retrieve files from Dirac data storage. This will require a bit of configuration in ~/.gangarc
to set the correct LFN paths and also where you want the data to go:
config.DIRAC.DiracLFNBase
config.DIRAC.DiracOutputDataSE
To use a DiracFile, do something similar to:
j = Job()
j.application.exe = File('my_script2.sh')
j.inputfiles = [ LocalFile('my_input.txt') ]
j.outputfiles = [ DiracFile('my_output.txt') ]
j.backend = Dirac()
j.submit()
Ganga won’t retrieve the output to the submission node so if you need it locally, you will have to do.
j.outputfiles.get()
Often it might be better to simply stream the data from its remote destination. You can get th URL
for this as
GoogleFile¶
This will store files to the user’s Google Drive. This requires the user to authenticate and give restricted access to Google Drive. To use a GangaFile, do something similar to:
j = GangaFile("mydata.txt")
j.localDir = "~/temp"
j.put()
print(j)
GoogleFile (
namePattern = mydata.txt,
localDir = /home/dumbmachine/temp,
failureReason = ,
compressed = False,
downloadURL = https://drive.google.com/file/d/1dS_XqANroclWAqgIvLU7q5rbzen17mSf
)
The urls are generated by using the id of the file.
This will upload the local file “~/temp/mydata.txt” to the user’s Google Drive. The File object also supports for glob patterns, which can be supplied as j.namePattern = ‘*.ROOT’. Upon first usage, the user will be asked to authenticate and allow access to create new files and edit these files only.
Only files created by Ganga can be deleted (or restored after deletion).
j = GangaFile("mydata.txt")
j.localDir = "~/temp"
j.put()
# if the file is required to be deleted
j.remove() # will send the file to trash, use permanent=True for deletion
# to restore the file from trash
j.restore()
To download files previously uploaded by ganga, use the get method: