This page is dedicated to running FAMOUS using reconfigurations although it might help in more general cases. Please refer to general and introductory gudes and notes if you haven’t done so.
A large part of information written in this page is based on the answers to my questions by Robin Smith and Annette Osprey. Lots of thanks to both of them.
This page is meant to be as foolproof as possible and so inevitably lengthy. Don’t tell me to tidy it up.
You do not have to do these at all, but you will need to know that I did these to understand the rest of this page.
jobid=xxxxx (this page is based on tcmid and tcmie)
$USER
(I’m ggxmy
)
mkdir $DUMP2HOLD/um/dumps #create a job-indep dir to store dump files ln -s $DUMP2HOLD/um/dumps MY_DUMPS
MY_DUMPS $DUMP2HOLD/um/dumps
1.1. UMUI: [submodel indep]-[compil+modif]-[Compile Options]
1.2. UMUI: [submodel indep]-[compil+modif]-[mod for reconfig]
1.3. UMUI: [submodel indep]-[Job submission, resources…]
1.4. UMUI: [submodel indep]-[Gen config control]
1.5. UMUI: [Atmos]-[Ancil+input]-[start dump]
1.6. UMUI: [Ocean]-[input files]-[start dump]
1.7. UMUI: press <save> and then <process>
1.8. manually post process on puma if it is necessary for you (see section [2] of NoteFamousQuest)
jobid=xxxxx #dont forget setting jobid first!!!! echo $jobid #chk jobid! /home/famous/bin/he_namelist_new_phase5 $jobid /home/famous/bin/vfdrift_pp.sh $jobid /home/famous/bin/quest_queue.sh $jobid
1.9. Submit
UMUI: press [SUBMIT] , puma; ~jeff/bin/umsubmit -h quest-hpc.bris.ac.uk -u $USER -r scp $jobid , OR puma; clustersubmit -c n -s y -r quest $jobid
$MY_DUMPS/$jobid.astart = ~/um/dumps/$jobid.astart
)~/DUMP2HOLD/um/$jobid/dataw/$RUNID.recon
is newly created in the first run that crashes (LET IT CRASH AND DO THE TROUBLE SHOOTING (2) UNTIL WE FIND A BETTER WAY TO RUN!!!). This will not be created after going through the trouble shooting (2), or in other words, the one created manually in the trouble shooting should not be overwritten.
1.10. Save the original astart file
quest:$MY_DUMPS/
cp $jobid.astart $jobid.astart.ini
(1) in my initial attempts, a $jobid.astart file was not created. .leave file looked as normal in terms of the size. However, looking in the .leave file revealed some problems. “Completion code” was 134 and I suppose anything other than 0 suggests an error. just above that there are lines like;
Abort
/exports/gpfsbig/um/PUM 64?/um/vn4.5/scripts/qsprelim: Error in dump reconfiguration - see OUTPUT
so clearly there was an error. near the end of the file, I found what was wrong;
ERROR : Reconfiguration CONTROL
No of land points in output Land-sea mask = 770
No of land points specified in namelist RECON = 836
Please reprocess the job with the correct number of land points in UMUI panel
(2) (1) is fixed and .astart file of larger size (2064384B and later found this is not quite the right size; the right size is 2080768B) is created, which looks ‘mostly’ alright, but .leave file still doesn’t seem quite happy;
Abort
/exports/gpfsbig/um/PUM 64?/um/vn4.5/scripts/qsprelim: Error in dump reconfiguration - see OUTPUT
Completion code : 134
…………<skipped>…………
No Sea Ice Temperature in input dump
Sea Ice Temperature being initialised.
Processing Field 119 Stash Code= 408 : SEA-ICE SURFACE TEMP AFTER TIMESTEP
Warning - non-constant polar row for field 119
Problem with reading T* field.
Error detected in subroutine CONTRO Lcontrol 1?.f^@^@^@^@^@^@^@^@^@^@^@^@^@TRANSPLANTING DATA ^T^@
^@^@¥^G^@^@¦^G^@^@§^G^@^@¨^G
while doing I/O on unit 21
this problem is at least circumvented (may or may not be essentially fixed) by going through the following procedure;
quest:~/DUMP2HOLD/um/$jobid/code/exec_build/qxrecon_dump_dir/control1.f
(lines near 5760) as blow;
IF (ICODE.GT.0) THEN WRITE (6,*) ' Problem with reading T* field.' CALL ABORT_IO ('CONTROL',CMESSAGE,ICODE,NFTOUT) ENDIF
IF (ICODE.GT.0) THEN if (ICODE.eq.1501) then !!! write(6,*) ' Polar rows not constant in T*.' !!! & //' This is probably not a problem.' !!! else !!! WRITE (6,*) ' Problem with reading T* field.' CALL ABORT_IO ('CONTROL',CMESSAGE,ICODE,NFTOUT) end if !!! ENDIF
make
in the directory and an executable qxrecon_dump
is created. I think it is ok to let it overwrite the original one, but you may want to change the name of the original to keep it.
qxrecon_dump
to /exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/dataw/
and rename it $jobid.recon
quest1:/exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/code/exec_build/qxrecon_dump_dir% cp qxrecon_dump qxrecon_dump.org
quest1:/exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/code/exec_build/qxrecon_dump_dir% make
quest1:/exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/code/exec_build/qxrecon_dump_dir% ls -ltr
…………………………………………
-rwxr-xr-x 1 $USER users 1380231 Feb 8 17:08 qxrecon_dump.org
-rwxr-xr-x 1 $USER users 1380423 Feb 8 17:15 qxrecon_dump
quest1:/exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/code/exec_build/qxrecon_dump_dir% cp qxrecon_dump /exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/dataw/
quest1:/exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/code/exec_build/qxrecon_dump_dir% cd /exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/dataw
quest1:/exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/dataw% mv $jobid.recon $jobid.recon.old
quest1:/exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/dataw% mv qxrecon_dump $jobid.recon
quest1:/exports/gpfsbig/home/$USER/DUMP2HOLD/um/$jobid/dataw% ls -l
…………………………………………
-rwxr-xr-x 1 $USER users 1380423 Feb 8 17:15 $jobid.recon
-rwxr-xr-x 1 $USER users 1380231 Feb 7 17:36 $jobid.recon.old
Doing these seems to fix the problem and a trouble free .astart file is created, and it has been proved that the job using this .astart file runs successfully.
(3) I was successful in the previous job but in the new job .astart file is not created. This seems a different problem from (1) or (2). .leave file says;
/exports/gpfsbig/um/PUM 64?/um/vn4.5/scripts/qsprelim[779]: /exports/gpfsbig/work/bristol/$USER/um/$jobid/dataw/ $jobid.recon: not found
/exports/gpfsbig/um/PUM 64?/um/vn4.5/scripts/qsprelim: Error in dump reconfiguration - see OUTPUT
*****************************************************************
Ending script : qsprelim
Completion code : 127
[If you had a different kind of trouble in atmospheric reconfiguration and resolved it, please add the information about it here (or anywhere else) and share it with other users.]
If started from atmospheric reconfiguration 1–3 should already be set correctly. In that case start from 4.
2.1. UMUI: [submodel indep]-[compil+modif]-[Compile Options]
2.2. UMUI: [submodel indep]-[Job submission, resources…]
2.3. UMUI: [submodel indep]-[Gen config control]
2.4. UMUI: [Atmos]-[Ancil+input]-[start dump]
2.5. UMUI: [Ocean]-[input files]-[start dump]
2.6. UMUI: press <save> and then <process>
2.7. manually post process on puma if necessary (see 1.8)
2.8. Submit
UMUI: press [SUBMIT] , puma; ~jeff/bin/umsubmit -h quest-hpc.bris.ac.uk -u $USER -r scp $jobid , OR puma; clustersubmit -c n -s y -r quest $jobid
2.9. Save the original ostart file
echo $jobid #chk jobid! cp $jobid.ostart $jobid.ostart.ini
(1) $jobid.ostart has only about 2.8MB and completion code in the .leave file is 134.
The completion code suggest it was clearly unsuccessful. Near the bottom found a line indicating what was wrong.
*ERROR* Stash code 103 not found on input file
If you look at stash (UMUI:[Ocean]-[STASH]-[STASH. Specification…]) you will see what is 103 for (it is ‘OCN EXTRASER 1: CONVEN TCO2′).
This error occurred because the ocean chemistry was turned on in the model but was not in the ocean start dump file. If it is Ok to turn off ocean chemistry, do so in UMUI:[Ocean]-[Scientific Parameters]-[Carbon Cycle]. Also disable extra tracers in UMUI:[Ocean]-[Tracers]-[User Defined Tracers]. Then go back to 2.6 and try again. I got a .ostart of 7274496B, which is much smaller than the case with ocean chemistry turned on (~15MB), but this turned out to be correct.
If you do need the chemistry, you’ll need some fields compatible with the rest of the restart.
[If you had a different kind of trouble in ocean reconfiguration and resolved it, please add the information about it here (or anywhere else) and share it with other users.]
3.1. UMUI: [submodel indep]-[compil+modif]-[Compile Options]
3.2. UMUI: [submodel indep]-[Job submission, resources…]
3.3. UMUI: [submodel indep]-[Gen config control]
3.4. UMUI: [Atmos]-[Ancil+input]-[start dump]
3.5. UMUI: [Ocean]-[input files]-[start dump]
3.6. UMUI: press <save> and then <process>
3.7. manually post process on puma if necessary (see 1.8)
3.8. Submit
UMUI: press [SUBMIT] , puma: ~jeff/bin/umsubmit -h quest-hpc.bris.ac.uk -u $USER -r scp $jobid , OR puma; clustersubmit -c n -s y -r quest $jobid
4.1. UMUI: [submodel indep]-[compil+modif]-[Compile Options]
4.2. UMUI: [submodel indep]-[Job submission, resources…]
4.3. (same as 3.3) UMUI: [submodel indep]-[Gen config control]
4.4. (same as 3.4) UMUI: [Atmos]-[Ancil+input]-[start dump]
4.5. (same as 3.5) UMUI: [Ocean]-[input files]-[start dump]
4.6. UMUI: press <save> and then <process>
4.7. manually post process on puma if necessary (see 1.8)
4.8. Submit
UMUI: press [SUBMIT] , puma: ~jeff/bin/umsubmit -h quest-hpc.bris.ac.uk -u $USER -r scp $jobid , OR puma; clustersubmit -c n -s y -r quest $jobid
clustersubmit
is said to be preferable. It generally has more control and the advantage in this particular case is that you can skip step 4.9.
qsub: invalid option ? x
qsub: invalid option ? s
usage: qsub [-a date_time] [-A account_string] [-b secs]
[-c c[=‹INTERVAL?] ] [-C directive_prefix] [-d path] [-D path]
[-e path] [-h] [-I] [-j oe] [-k {oe}] [-l resource_list] [-m {abe}]
[-M user_list] [-N jobname] [-o path] [-p priority] [-q queue] [-r y|n]
[-S path] [-u user_list] [-X] [-W otherattributes=value?] [-v variable_list]
[-V ] [-z] [script]
If UMUI [SUBMIT] button is clicked or umsubmit
is used
If clustersubmit
is used
4.9. Run
cd $HOME/umui_runs ls -lrt cd $jobid-012345678 (go to the latest directory) ls -l
~um/bin/qsub-um qsubmit.quest# (# = 1 or 2; look in the directory)
4.10. check the status of the run on quest
qstat #check all jobs runing on quest qstat -u $USER #check your jobs only@]
(1) Run stopped after couple tens of seconds. Near (not quite at) the end of the output (~/umui_runs/[jobid]−012345678/[jobid]000.o1234) said “LSEGF NOT LARGE ENOUGH.”
(2) Run stopped after tens of years of simulation time
As always look in the output file ([jobid]000.o1234 or [jobid]***.leave). I got the following message;
Model aborted with error code - 1 Routine and message:- P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.
This is a common error and I have had it many times. It seems the simulated climate went unstable and the simulation died.
In the previous versions of FAMOUS, this problem can be overcome simply by resubmitting the job. However, this version is bit-reproducing, so simply resubmitting the job will result in exactly the same error.
So basically what you can do is use the latest atmospheric and ocean dumps as initial dumps, reconfigure, and run again;
[If you had a different kind of trouble in running the model and resolved it, please add the information about it here (or anywhere else) and share it with other users.]
I have heard of a few ways to submit a continuation run.
Do 5.1 and 5.2 if necessary.
5.1. UMUI: [submodel indep]-[start date + run length options]
5.2. repeat 4.1~4.5.
Then do one of (1)~(4).
clustersubmit
puma; clustersubmit -c y -s y -r quest $jobid
-c
is to specify whether this is a continuation run (y
) or not (n
). (-s
is to specify whether to submit (y
) or copy files over (n
), and -r
is to specify the target machine (e.g. quest
, ormen
, etc.))
qsubmit.quest#
5.3. modify qsubmit.quest#
cd $HOME/umui_runs ls -lrt cd $jobid-012345678 (go to the latest directory) ls -l
qsubmit.quest#
(# = 1 or 2; look in the directory) with a text editor and replace “TYPE=NRUN
” with “TYPE=CRUN
”.
5.4. submit to run
~um/bin/qsub-um qsubmit.quest#
~/umui_jobs/$jobid
, edit SUBMIT
as shown in 5.3 above and then run umsubmit
to copy the new scripts across. Then submit using qsub-um
on quest as 5.4.
/home/annette/famous/bin/change_crun_vn4.5
There is also some information about resubmission in the NCAS-CMS website.