There is a check list of common problems at Easy "start of the night" Checks . If you have a problem, check this first !
The rest document summarises ORAC trouble shooting procedures and error messages for UKIRT scientists and telescope support staff. It lists known bugs which may cause problems. Because the software is very new, its possible that there will be error conditions that have not arising during testing and commissioning, but this document attempts to cover all the possibilities based on what we have seen so far. The messages may be unfamiliar in style (Java is different from Fortran or C) as well as content. This document gives examples of error messages as well as discussing their likely causes and fixes.
The sequencer console will beep and go into a Paused state whenever an error is generated while a sequence is loaded : this error could be something going wrong with the instrument, the telescope, or the data handling system (DHS) as well as the ORAC-OM and ORAC-OS software themselves.
Most errors generate a pop-up box, with a summary of the error in it, and observing cannot continue until you have clicked on OK to acknowledge the error. Its a good idea before you click on OK to go to the xterm and copy any more extensive error report from there into a file. (otherwise it will scroll away when more messages appear). Also if you run into a problem and there are no errors on the xterm, please report this fact and what the last item that did appear on it was. Be aware that there may still be a few errors that dont generate a popup box, and some might not even report an error - we've obviously tried to catch all these but its still very new software so don't be suprised if sometimes things like this happen.
If something goes wrong while you are taking data it is also helpful to remember that the software consists of a number of independent modules which "talk to each other" using various means. If you read the error messages carefully you can often work out which "module" the problem has happened in or if it is a communication problem : the name of the module is usually mentioned in the error message.
The ORAC system is still relatively new to most people, and so it is
easy at the start of the night to have problems which are caused by
simple mistakes or omissions.
If you have any problems such as error messages during the ORAC runup,
or during the first few observations, here are some things to check
before calling for assistance :
1. Is the TSS running TEL_NEW ?
Some problems can cause the sequencer to be stuck in a "running
state", but not actually doing anything, in which case a clean exit is
not possible and you will have to crash out. To crash out
of the ORAC-OM and the sequencer console, go back to the xterm you
started it up from and :
Type "Control-C "
(this is safe to do)
The ORAC windows will disappear and messages will appear on the
xterm. Many of these messages are just normal rundown logging messages
but if any of them look like additional error messages, scroll back
and copy them before proceeding. Its possible that in certain "hung"
conditions the error will be the first thing to appear on the x-term
after typing control-c, so if an unexpected fault happens it is worth
keeping an eye out for this.
You are now ready to re-start the ORAC system. If the problem was
actually in the instrument or telescope itself, then of course you
should fix these before re-starting ORAC. When ORAC runs back up it
will report finding and killing processes - i.e. part of the run up
procedure is to look for and cleanup up any processes "left over" from
the last time it was run.
In the unlikely event that Control-C does not get rid of the ORAC windows
then log onto Kiki, look for and kill the relevant Java processes. Then
try again. (Eventually we will provide a "nuke" script for this).
The same general procedure also works for the OT - if you do something
which causes it to hang, use control C from the startup xterm and then
try running it up again.
If you do "hang the system" try to note down the things that you did
just before the "hang" e.g. which buttons you'd clicked on the
consoles as well as capturing any error messages from the x-term.
Since ORAC is a new system we may not yet have eliminated all the
areas where a "user" error causes a problem and noting what you think
you did in some detail will be very helpful.
Remember that when the ORAC-OM is crashed out of or ran down, the
communication connection between it and the UFTI crate is lost.
Currently the only way to re-establish it is to reboot the UFTI crate
at the correct point in the startup procedure. When you run ORAC back
up you will get the reminder to reboot the UFTI crate at the right
point in the runup sequence again. Running ORAC down while leaving
the CGS4 or IRCAM SMS software running is not a problem unless you do
it a lot. A service night with a few IRCAM-CGS4 switches
should not be a problem, but if you are trying to find an instrument
problem and have run up and down a dozen times or so, then it would be
a good idea to run ORAC down and start cleanly. Improving
the connection management for UFTI is being worked on, but as its part
of how Drama works a fix may not be quick. In the meantime be aware
that if you keep running ORAC up and down you will need to reboot UFTI
frequently.
If the ORAC-OM or the ORAC-OT cannot "talk" to the database for some
reason it will report a problem with an error that includes the phrase
"RMI Error" (this appears in a pop-up window), or "problem communicating
with ODB". "RMI" (Remote Method
Invocation) is the name of the technology used to talk to the
database. Typical causes are that the database has "stopped working" ,
or you have entered the wrong password for your userid. Simply
running the software down and up will not cure such problems, and you
will need to identify the cause first. The two most common examples
and an explanation of how to read the errors and use them to identify
the problem are given here. The OM and the OT report the same errors
when they have problems communicating with the database (because they
share code for doing this), but they may look a little different because the
OT pop-up boxes have a different style.
If the database has stopped working for some reason (e.g. perhaps the
machine it runs on has crashed, or the network is not working, or the
database server software has crashed), the message in the pop-up
window will say :
Problem in communicating with ODB:the ODBServer may not be running
The first line is the generic information that RMI is reporting a
problem in communication with the ODB (Observation DataBase). All
errors with communication to the database begin with the
phrase: "problem communicating with the ODB:"
It is the phrase:
The Java version currently in use has a problem with a memory leak,
which means that sometimes if you do a lot of retrieving and
submitting programs to it, the database will eventually run out of
memory. Procedures are in place to minimise the chances of this
happening during a night of observing, but it is worth being aware of
the potential. When this problem happens, an error message very
similar to the above will appear. Instead of saying "ConnectException"
it will include the line "Java.lang.OutOfMemoryError" in the
pop up box. Before it actually fails like this you will notice that
sending to and fetching from the database becomes slower and slower.
Check that the relevant computer and network are up and running.
If the machine and network are OK (you can log into it) then the cure
for both of the above problems is to try restarting the database server.
To do so:
Log onto Kiki as irt_archiver
Another common cause of an error from the database occurs when
the observer has given the wrong key for their program, or the wrong
password for their userid.
The first line is again the generic information that RMI is reporting
a problem in communication with the ODB (Observation DataBase) Server.
Its the line "ODBServerPackage AccessException" that tells you the user could
not access their program. As already noted
the most likely cause is that they gave the wrong password or key (all
keys are set to userid-0, so this should not be a common problem) but
it is also possible that file protections or inability to read a disk
could cause an "access" exception.
A less common error message from the ODB is if you try to send to it a
different science program which has the same name as one you
previously sent . If the file was created by editing the orginal
then even if you delete and replace every observation in the program
it will still be acceptable to the ODB. If however you create a brand
new science program and then save it to disk with a name you have used
before, and then send it to the ODB, it will object vigorously. This
is because science programs are tagged when they've been in the
database and the database server is in effect keeping track of what
its already got. If you do this accidentally the error message you get
in the popup window will say : Problem communicating with
ODB:java.rmi.ServerException
The last phrase "ODBException" is telling you the ODB is objecting to
something ! (we are gradually improving the error messages with a
helpful hint as to the cause, but starting with the most common)
Make sure that for each observation the numbers are up-dating on the
Data taking and Filing Status display. If the DHS is no longer writing
files to disk it will stop up-dating the "Last Saved" number.
It is a good idea to keep a regular eye on the status display
showing the writing of data to disk, because currently the DHS can die
"silently" i.e. without giving you an error message, and "for no
obvious reason" This has happened twice in total in the past 9
months, so it is not a very common problem. Indeed it may have been
fixed as a side effect of various updates and other fixes that were intended
to improve robustness, but it is
worth being aware of it, just in case.
The DHS can also crash "for a reason" in which case you always get at
least one error message. Such reasons include taking data OK, but
being unable to write it to disk; crashing the quicklook display just
at the time the DHS is trying to display a frame on it, or being
unable to obtain data or headers from either the instrument or
telescope in some circumstances. DHS crashes generate a popup error
message that looks like:
Command newObs completed with error 226394388
Note that depending on the severity/type of fault the error number may
be different, and depending on exactly when/where the DHS problem was
the command may be different (e.g. it could be "endObs" or
"getHeaders"), but the general form of this message is typical.
It is possible to click on OK on the popup box and attempt to continue
observing - you might be OK doing this, but it is not very wise. If
the DHS tasks disappeared when the error occurred then they will be
unable to report further errors although the sequence will keep
pausing whenever you try to take data, because it can no longer talk
to the DHS. Not all DHS crashes cause the tasks to disappear, so in
these cases you may get repeated error pop-ups if you try to continue
observing. If there has been a "partial DHS crash" and
you've attempted to continue, or it can't write data to disk for some
reason, you can appear to take up to about 20 observations before it
will turn into a full completely fatal crash (when buffers
fill).
If the DHS crashes, for whatever reason, then the status display
always stops updating. If the DHS dies then you should exit ORAC
(in the normal manner, there is usually no need to "crash out") and run ORAC
up again after fixing any instrument/telescope/disk problems if they
contributed to the error.
If the DHS crashes then scrolling back through the messages on the OM
startup xterm to the
last filed observation may identify what went wrong. Please search for
an error message before running down.
It may also be useful to check the status of the DHS tasks to see if
any have disappeared. There are two easy checks you can do:
(a) Open an xterm on Kiki and type :
If UFTI is running you should see tasks with the following names :
If IRCAM3 (or CGS4) is running you should see names like :
Alternatively you can check the status of the Drama tasks. Log
onto Kiki using the observer account (which starts up Drama on
login). Then use the Drama "ditsgetinfo" command to get the status of
a specific task - the names of the tasks are given in the ps output above.
For example the command :
Using the "-full" option on "ditsgetinfo" is useful because it ensures
that you get a response if the task is there and healthy, as well as
an error message if it is not.
The other DHS related task which you could check is DES (Drama-Epics
system):
In general please be aware that the Quick Look which is available for
use with UFTI is still under development and test by the Michelle
project. It is probably the least tested software delivered with
ORAC, and in particular the interaction between the Gaia display
buttons and the ESO real time display code (which is being used to do
the rapid real time display) has not been well tested. Most of the
problems we have seen in the past have been in this general area - so
if you are the first person to use a particular Gaia button you could
well hit a bug and get strange behaviour or a nasty crash. The
buttons were quite thoroughly tested on the night of Aug 3rd, so there
are no obvious problems, just be aware that its still all quite new.
It is best to use the ORAC-DR Gaia display for serious data
manipulation since you have reduced images to work with there.
There are currently three known problems :
Do not use "the Zoom in" on the Quick Look Gaia . There is
a bug whereby this will stop Gaia from being able to display any
further images that are sent to it by the DHS - ie the QL will no
longer display your data as it is obtained. (Zoom out might be OK).
If the UFTI quicklook Gaia "disappears" (either due to a crash or you
kill it), then this can sometimes also cause problems with the writing
of files to disk by the DHS. We don't understand why yet, but the DHS
simply stops writing the data to disk (this is an intermittent fault
which makes difficult to trace). If you crash or accidently kill the
Gaia display on the second head of Kiki, be careful to check that the
DHS is still writing files to disk. You can do this as described above
by checking that the "last saved" number is updating. Note too that
having killed the "Quick Look Gaia" you cannot then "get another" by
attempting to restart Gaia - you have to run the ORAC software down
and back up. (The communication connection that allows the DHS to
pass data to Gaia in real time dies when Gaia or the DHS dies).
If you run Quick Look on its own using the button on the sequencer
console (e.g. to check an exposure time), and you then stop it you
must be sure to check that the last exposure has finished
before starting or continuing the execution of your
sequence. (watch the count down timer and for the last image to
display). The user-interface should not allow you to dismiss the
Quick Look control panel and continue with other things until this is
the case, but at the moment due to a bug you can do this on long
exposures.
Finally note that there have been problems in the past with
interaction/interference between the Quick Look and the Gaia that is
used by ORAC-DR. We believe that these have all been fixed. Anything of
this category would be apparent on startup - so ask for help then.
If you send a new observation for execution and there is
something wrong with the sequence itself it does not get loaded
into the sequencer console. If you send a new observation but when
you look at the OOS screen it still shows the old one and you get an
error message which says
It is possible to work around a translator problem by editing the
sequence by hand if you think you can work out what the offending line
is. The sequences are kept at the summit on
Some odd behaviour of the highlight on the sequencer control console
has been seen occasionally when sequences "run to completion" - the
highlight did not always go back to the start of the sequence as it
should have done. Sometimes it moved back a few rows and sometimes it
disappeared. We believe that all such problems have now been cured,
but are noting them here until further ORAC use has built confidence
that they really are fixed. In the unlikely case that they recur, such
problems can usually be worked around by clicking on the sequence
display to recover the highlight and place it where you need it to be.
You can then "run from highlight".
Remember that database connection problems, described above will
appear as an error message when you try to do something from the OM.
Be careful if you use the "on line help" help item at the top. The
system is currently being run under JDK1.1.x and Java help files are
designed for JDK1.2. This means that functionality like changing
fonts can cause problems. Be aware that if you hang the help pages
you can hang everything else as well, and the only way to recover is
to crash out of everthing. The help pages and web pages are the same
html files, and the help system will be tidied soon.
There is a minor bug in the "change userid" option which is offered when Exit
is selected on the ORAC-OM programme selection gui. If you attempt to login
in again using the same user-id instead of a different one the system
will hang. Switching userids from user1 -> user2 -> user1 is not a problem,
it is only if you try to do user1 -> user1 that the system hangs.
Sometimes "cut/copy and paste" just stops working. e.g. you can
apparently copy an observation, but then cannot paste it. There are
no error messages - the pasted observation simply does not appear. It
seems to happens after you have been both using both "cut/copy and
paste" and deleting observations or items by highlighting then and
using "delete" on the keyboard. However we have not yet been able to
tie down a set of actions that makes this reproducible. There is also
a suspicion that it might be partly a resource limitation on the
machine you are using. Unfortunately the only thing to do is save
what you have to disk, run everything down and try again.
There are also intermittent, non-reproducible problems with closing
science programmes. If you use file-close on a science program window
or file-exit on the OT without having first saved changes to a science
program, then you should be prompted for whether or not you want to save the
program before exiting. Occasionally this does not happen ! A possibly
related problem is that sometimes if you are prompted and you select
"don't save", then the OT does not close the program window or let you
exit. 99.99% of the time both of these work fine. Until we find the
bug, the best solution is to try to remember to save anything you want
to keep before closing or exiting. If you don't want to save then
control c> on the startup xterm will always crash you out.
There are also a number of irritating minor bugs in the OT - which do
not affect functionality, and are slowly being tidied. These are
noted where appropriate in the OT userguide.
If the problem was with UFTI then the TSS will use the UFTI Epics
control system to investigate, re-datum filter wheels or the
shutter. If the problem was CGS4 or IRCAM then the TSS has the sms
menu system and all its engineering functionality (such as "kill and
reload Occam") available for troubleshooting. Many errors coming from
the instrument or telescope are prefixed by the phrase "Error in the
Drama tasks".
Note that because the CGS4 and IRCAM software outputs long error
messages one line at a time (so they look tidy on an SMS screen and
which made sense when they were written !) this means that they now
output errors one line at a time to ORAC. Unfortunately because each
line is sent separately, each line appears as a separate error to
ORAC. This means that you often find that a CGS4 or IRCAM error will
generate several ORAC error pop-up windows one after the other, and
you have to click "OK" on each of them before you can continue
observing. There is little that can be done about this - it is simply
a reflection of the interface between ORAC and the old instrument
software being achieved without making a lot of changes to the
instrument software - it works, but isn't as elegant as for new
instruments.
The other kind of instrument problem that might occur is any
remaining bugs in the ORAC-OT or the translator. Since so far as the
instrument is concerned ORAC is responsible for generating it a config
file, this means that such problems are indicated by the instrument
software complaining that it has been sent an illegal config, or the
instrument does not appear to set to the config you expected it to.
If this happens there are four possible causes :
One of these is that it is possible to directly load a sequence
and then run it. This sequence could be one that you have written for
testing purposes containing commands not normally used by astronomers
(such as datum, or wait n , etc) - so long as the commands are in the
sequencer dictionary they can be executed. The sequence could also be
one that you have written or modified for observing, to work around a
known problem. To directly load a sequence go to the "commands" menu
at the top and select "load" - it will fire up a browser in the
direcotory where sequences are written to allow you to find your file:
highlight the file you want and choose "open".
Another useful feature is that it is possible to look at an
expanded display of the sequence and see details of what is going on
internally to the sequence commands. If something is failing this can
help to narrow down exactly what. For example if you cant take data,
instead of the console beeping and going into a paused state on the
"observe" commands, it will do so on one of the steps involved in
"observe" - such as "newObs" or "uftiObserve". This will help you to
know more precisely where the software is failing. To see the
expanded display go to the configure item at the top of the OOS
console and click on "hide eng exec" to turn off this function.
 
Original : 1999/10/24, Last Modification Date 2000/07/26 - Last Modification Author:Gillian Wright 2. Easy "start of the night" checks
If not, run everything down and start over.
If not, run the OM and instrument down and start over.
If not run the OM and
instrument down and start over.
If not, run the OM down
and start over.
3. You ran array tests successfully but now you can't slew the telescope ?
Check /ukirt_sw/instrument_configs/{instrument}.inst
Read the line beginning with Tel
If the second word is 'simulate' log onto kiki as irt_archiver
(usual password) and edit the file. Replace simulate with
PTEST@IRTTCS
4. Are you apparently taking data but it is not being saved to disk ?
Type :
oracdr_{instrument}
cd $ORAC_DATA_IN
ls -al .lastobs *.log
If either of those files is not writeable by group, ring up their
owner or FE/NPR/HPS and ask them to make them so.
5a. If you get an error message from the ODBServer when you try to
login on the first ORAC screen
Make sure you are logging on correctly - i.e. using your user-id, not
your username for loggin into computers
5b. You are using the correct user-id, and you are unable to login, or fetch/send science programs to/from the database, with error messages containing the word ODBServer
Log on as irt_archiver to kiki and type
killODBSvr
runODBSvr
6. ORAC-DR complains it can't find the data
Make sure you are using the correct invocation for setting up
(i.e. oracdr_ircam rather than oracdr_ircam_old)
7. Telescope problems such as the star not coming in the box properly, or error messages when you slew
Call Russell !
8. Any CGS4 or IRCAM error such as BDS errors, filter wheel problems etc:
Aknowledge the error(s) in the OM. Click on "STOP ASAP" and wait
for it to stop. Go to the VMS instrument screen and do whatever you
would have normally done to fix the error. When you have fixed it
on that side, put your highlight at an appropriate place (if it is
not there already) and "Run from Highlight".
9. Remember that if you run down the instrument or telescope software then
its best to run down ORAC first.
if you forget, and ORAC is "hung" just use control c on the startup xterm
3. "Cures Many": Rundown / Crash Out and Re-start
If you are observing and something goes wrong with the OM programme
selection, the sequencer console, or the DHS, the fastest fix for most
problems is to just run down and start again - if you are in the
middle of a long sequence, note where it crashed so that you can use
"run from highlight" to restart from the appropriate place. The
software runs down and up very quickly.
4. Database Connection Problems
Problem finding the Science Program Server (might be down ?):java.rmi.ConnectException:Connection refused to host:[kiki.ukirt.jach.hawaii.edu:4201];
java.net.ConnectException:Connection refused.
"java.rmi.
ConnectException: Connection refused
to host: [xxx ];"
which tells you that the problem is that the
ORAC-OM or OT was unable to connect
to the database. Note that the message does also tell you the name of the
machine the database server should be running on (Kiki). The most likely
reason for being unable to connect to the database is that for some
reason the database is not running.
Issue the command : killODBSvr (all one word, case sensitive)
Issue the command : runODBSvr (all one word, case sensitive)
You should get a message confirming that the server is now running
You will have to login to the database again before you try sending/fetching
your program again.
ODBServerPackage.ODBException
5. DHS Problems
ps -ef | grep dhs
/dhs/dhserver Ufti
/dhs/dhdriver QL_UFTI UFTI
/dhs/dhsave DHSAVE_UFTI
/dhs/dhandler DHANDLER_UFTI DHSPOOL_UFTI QL_UFTI
/dhs/dhspool DHSPOOL_UFTI DHSAVE_UFTI
/dhs/dhspool DHSPOOL_IRCAM3 DHSAVE_IRCAM3
/dhs/dhsave DHSAVE_IRCAM3
ditsgetinfo -full DHSPOOL_UFTI
will respond
Task DHSPOOL_UFTI, type 0, description ""
If the task is there and responding. If it is not there, or not responding
you will get an error message, for example:
DITSGETINFO_8f0:exit status:%DITS-F-UNKNTASK, Task unknown to message
system
ditsgetinfo -full DES
However if DES has "crashed" then (a) you will also have stopped being
able to talk to UFTI or the FP or IRPOL (its not used for CGS4 or
IRCAM), and (b) you will have had an error message reporting the fact.
6. UFTI Quick Look Problems
7. Sequencer Problems
Error in the Drama tasks: ## Failed
to load EXEC:exec_mainload failed
then there is probably
something wrong with the sequence you just tried to load. Running
into this problem is very unlikely during normal use, because the
sequences are generated by the translator, and there are now no known
translator problems. However if there are any remaining more esoteric
translator bugs they could cause such a problem.
ukirtdata/orac_data/sequences/
Use ls -lrt to find the most recently created file
called oracnnnnnnnnnnnn.exec and look for any unusual commands or spaces
in parameters where there are not normally spaces etc. These are the
sorts of problem which could cause a sequence to fail to load.
8. ORAC-OM Problems
9. ORAC-OT Problems
The ORAC-OT has a few intermittent but serious problems that we have
been unable to fix. These are :
10. Instrument Problems
If there are problems with the instrument, (e.g. failure of a filter
wheel to move) for which the instrument generates an error message,
then that error message is picked up by ORAC-OM and reported to the
observer in a pop-up box. This is also true of some telescope errors,
such as a source being unaccessible when you try to slew to it. Since
an error has been reported the sequencer will go into a paused
state. To continue observing you have to acknowledge the
error message by clicking on OK, and then use the "continue" button on
the console. Of course you should sort out the instrument/telescope
problem before continuing, and if that means running down their
control software you should also run down ORAC (its better if you do
this first, then run the instrument down).
11. Useful "engineering" funtions
Although ORAC is primariliy a high level user interface some
engineering functionality has been included in the OOS console to
assist in trouble-shooting and work arounds, and to enable engineering
sequences to be used for repetitive instrument tests.