Guide to Trouble Shooting and Known Bugs for the ORAC-OM and ORAC-OT

There is a check list of common problems at Easy "start of the night" Checks . If you have a problem, check this first !

The rest document summarises ORAC trouble shooting procedures and error messages for UKIRT scientists and telescope support staff. It lists known bugs which may cause problems. Because the software is very new, its possible that there will be error conditions that have not arising during testing and commissioning, but this document attempts to cover all the possibilities based on what we have seen so far. The messages may be unfamiliar in style (Java is different from Fortran or C) as well as content. This document gives examples of error messages as well as discussing their likely causes and fixes.

Contents

1. Error Message Locations

Both the ORAC-OT and the ORAC-OM log messages to the xterm they are started up from. If something does go wrong look at the startup xterm for the error message - phrases containing the words "error", "failed to find", "failed to connect", "lost", "Exception" "Null pointer" could all be indicative of what the problem is and please put these messages into a fault/bug report.

The sequencer console will beep and go into a Paused state whenever an error is generated while a sequence is loaded : this error could be something going wrong with the instrument, the telescope, or the data handling system (DHS) as well as the ORAC-OM and ORAC-OS software themselves.

Most errors generate a pop-up box, with a summary of the error in it, and observing cannot continue until you have clicked on OK to acknowledge the error. Its a good idea before you click on OK to go to the xterm and copy any more extensive error report from there into a file. (otherwise it will scroll away when more messages appear). Also if you run into a problem and there are no errors on the xterm, please report this fact and what the last item that did appear on it was. Be aware that there may still be a few errors that dont generate a popup box, and some might not even report an error - we've obviously tried to catch all these but its still very new software so don't be suprised if sometimes things like this happen.

If something goes wrong while you are taking data it is also helpful to remember that the software consists of a number of independent modules which "talk to each other" using various means. If you read the error messages carefully you can often work out which "module" the problem has happened in or if it is a communication problem : the name of the module is usually mentioned in the error message.

2. Easy "start of the night" checks

The ORAC system is still relatively new to most people, and so it is easy at the start of the night to have problems which are caused by simple mistakes or omissions.

If you have any problems such as error messages during the ORAC runup, or during the first few observations, here are some things to check before calling for assistance :

1. Is the TSS running TEL_NEW ?
If not, run everything down and start over.

2a. Is the TSS running IRCAM3_NEW? Did you wait until it had finished running up before running up the OM?
If not, run the OM and instrument down and start over.


2b. Is the TSS running CGS4_NEW? Did you wait until the motors had finished datuming before running up the OM?
If not run the OM and instrument down and start over.


2c. Did someone reboot UFTI when asked? Did you wait until it had finished booting before proceeding?
If not, run the OM down and start over.

3. You ran array tests successfully but now you can't slew the telescope ?
Check /ukirt_sw/instrument_configs/{instrument}.inst
Read the line beginning with Tel
If the second word is 'simulate' log onto kiki as irt_archiver (usual password) and edit the file. Replace simulate with PTEST@IRTTCS

4. Are you apparently taking data but it is not being saved to disk ?
Type :
oracdr_{instrument}
cd $ORAC_DATA_IN
ls -al .lastobs *.log
If either of those files is not writeable by group, ring up their owner or FE/NPR/HPS and ask them to make them so.

5a. If you get an error message from the ODBServer when you try to login on the first ORAC screen
Make sure you are logging on correctly - i.e. using your user-id, not your username for loggin into computers

5b. You are using the correct user-id, and you are unable to login, or fetch/send science programs to/from the database, with error messages containing the word ODBServer
Log on as irt_archiver to kiki and type
killODBSvr
runODBSvr

6. ORAC-DR complains it can't find the data
Make sure you are using the correct invocation for setting up
(i.e. oracdr_ircam rather than oracdr_ircam_old)


7. Telescope problems such as the star not coming in the box properly, or error messages when you slew
Call Russell !

8. Any CGS4 or IRCAM error such as BDS errors, filter wheel problems etc:
Aknowledge the error(s) in the OM. Click on "STOP ASAP" and wait for it to stop. Go to the VMS instrument screen and do whatever you would have normally done to fix the error. When you have fixed it on that side, put your highlight at an appropriate place (if it is not there already) and "Run from Highlight".

9. Remember that if you run down the instrument or telescope software then its best to run down ORAC first.
if you forget, and ORAC is "hung" just use control c on the startup xterm

3. "Cures Many": Rundown / Crash Out and Re-start

If you are observing and something goes wrong with the OM programme selection, the sequencer console, or the DHS, the fastest fix for most problems is to just run down and start again - if you are in the middle of a long sequence, note where it crashed so that you can use "run from highlight" to restart from the appropriate place. The software runs down and up very quickly.

Some problems can cause the sequencer to be stuck in a "running state", but not actually doing anything, in which case a clean exit is not possible and you will have to crash out. To crash out of the ORAC-OM and the sequencer console, go back to the xterm you started it up from and :

Type "Control-C " (this is safe to do)

The ORAC windows will disappear and messages will appear on the xterm. Many of these messages are just normal rundown logging messages but if any of them look like additional error messages, scroll back and copy them before proceeding. Its possible that in certain "hung" conditions the error will be the first thing to appear on the x-term after typing control-c, so if an unexpected fault happens it is worth keeping an eye out for this.

You are now ready to re-start the ORAC system. If the problem was actually in the instrument or telescope itself, then of course you should fix these before re-starting ORAC. When ORAC runs back up it will report finding and killing processes - i.e. part of the run up procedure is to look for and cleanup up any processes "left over" from the last time it was run.

In the unlikely event that Control-C does not get rid of the ORAC windows then log onto Kiki, look for and kill the relevant Java processes. Then try again. (Eventually we will provide a "nuke" script for this).

The same general procedure also works for the OT - if you do something which causes it to hang, use control C from the startup xterm and then try running it up again.

If you do "hang the system" try to note down the things that you did just before the "hang" e.g. which buttons you'd clicked on the consoles as well as capturing any error messages from the x-term. Since ORAC is a new system we may not yet have eliminated all the areas where a "user" error causes a problem and noting what you think you did in some detail will be very helpful.

Remember that when the ORAC-OM is crashed out of or ran down, the communication connection between it and the UFTI crate is lost. Currently the only way to re-establish it is to reboot the UFTI crate at the correct point in the startup procedure. When you run ORAC back up you will get the reminder to reboot the UFTI crate at the right point in the runup sequence again. Running ORAC down while leaving the CGS4 or IRCAM SMS software running is not a problem unless you do it a lot. A service night with a few IRCAM-CGS4 switches should not be a problem, but if you are trying to find an instrument problem and have run up and down a dozen times or so, then it would be a good idea to run ORAC down and start cleanly. Improving the connection management for UFTI is being worked on, but as its part of how Drama works a fix may not be quick. In the meantime be aware that if you keep running ORAC up and down you will need to reboot UFTI frequently.

4. Database Connection Problems

If the ORAC-OM or the ORAC-OT cannot "talk" to the database for some reason it will report a problem with an error that includes the phrase "RMI Error" (this appears in a pop-up window), or "problem communicating with ODB". "RMI" (Remote Method Invocation) is the name of the technology used to talk to the database. Typical causes are that the database has "stopped working" , or you have entered the wrong password for your userid. Simply running the software down and up will not cure such problems, and you will need to identify the cause first. The two most common examples and an explanation of how to read the errors and use them to identify the problem are given here. The OM and the OT report the same errors when they have problems communicating with the database (because they share code for doing this), but they may look a little different because the OT pop-up boxes have a different style.

If the database has stopped working for some reason (e.g. perhaps the machine it runs on has crashed, or the network is not working, or the database server software has crashed), the message in the pop-up window will say :

Problem in communicating with ODB:the ODBServer may not be running
Problem finding the Science Program Server (might be down ?):java.rmi.ConnectException:Connection refused to host:[kiki.ukirt.jach.hawaii.edu:4201];
java.net.ConnectException:Connection refused.

The first line is the generic information that RMI is reporting a problem in communication with the ODB (Observation DataBase). All errors with communication to the database begin with the phrase: "problem communicating with the ODB:" It is the phrase:
"java.rmi. ConnectException: Connection refused to host: [xxx ];"
which tells you that the problem is that the ORAC-OM or OT was unable to connect to the database. Note that the message does also tell you the name of the machine the database server should be running on (Kiki). The most likely reason for being unable to connect to the database is that for some reason the database is not running.

The Java version currently in use has a problem with a memory leak, which means that sometimes if you do a lot of retrieving and submitting programs to it, the database will eventually run out of memory. Procedures are in place to minimise the chances of this happening during a night of observing, but it is worth being aware of the potential. When this problem happens, an error message very similar to the above will appear. Instead of saying "ConnectException" it will include the line "Java.lang.OutOfMemoryError" in the pop up box. Before it actually fails like this you will notice that sending to and fetching from the database becomes slower and slower.

Check that the relevant computer and network are up and running. If the machine and network are OK (you can log into it) then the cure for both of the above problems is to try restarting the database server. To do so:

Log onto Kiki as irt_archiver
Issue the command : killODBSvr (all one word, case sensitive)
Issue the command : runODBSvr (all one word, case sensitive)
You should get a message confirming that the server is now running
You will have to login to the database again before you try sending/fetching your program again.

Another common cause of an error from the database occurs when the observer has given the wrong key for their program, or the wrong password for their userid.

The first line is again the generic information that RMI is reporting a problem in communication with the ODB (Observation DataBase) Server. Its the line "ODBServerPackage AccessException" that tells you the user could not access their program. As already noted the most likely cause is that they gave the wrong password or key (all keys are set to userid-0, so this should not be a common problem) but it is also possible that file protections or inability to read a disk could cause an "access" exception.

A less common error message from the ODB is if you try to send to it a different science program which has the same name as one you previously sent . If the file was created by editing the orginal then even if you delete and replace every observation in the program it will still be acceptable to the ODB. If however you create a brand new science program and then save it to disk with a name you have used before, and then send it to the ODB, it will object vigorously. This is because science programs are tagged when they've been in the database and the database server is in effect keeping track of what its already got. If you do this accidentally the error message you get in the popup window will say :

Problem communicating with ODB:java.rmi.ServerException
ODBServerPackage.ODBException

The last phrase "ODBException" is telling you the ODB is objecting to something ! (we are gradually improving the error messages with a helpful hint as to the cause, but starting with the most common)

5. DHS Problems

Make sure that for each observation the numbers are up-dating on the Data taking and Filing Status display. If the DHS is no longer writing files to disk it will stop up-dating the "Last Saved" number.

It is a good idea to keep a regular eye on the status display showing the writing of data to disk, because currently the DHS can die "silently" i.e. without giving you an error message, and "for no obvious reason" This has happened twice in total in the past 9 months, so it is not a very common problem. Indeed it may have been fixed as a side effect of various updates and other fixes that were intended to improve robustness, but it is worth being aware of it, just in case.

The DHS can also crash "for a reason" in which case you always get at least one error message. Such reasons include taking data OK, but being unable to write it to disk; crashing the quicklook display just at the time the DHS is trying to display a frame on it, or being unable to obtain data or headers from either the instrument or telescope in some circumstances. DHS crashes generate a popup error message that looks like:

Command newObs completed with error 226394388

Note that depending on the severity/type of fault the error number may be different, and depending on exactly when/where the DHS problem was the command may be different (e.g. it could be "endObs" or "getHeaders"), but the general form of this message is typical.

It is possible to click on OK on the popup box and attempt to continue observing - you might be OK doing this, but it is not very wise. If the DHS tasks disappeared when the error occurred then they will be unable to report further errors although the sequence will keep pausing whenever you try to take data, because it can no longer talk to the DHS. Not all DHS crashes cause the tasks to disappear, so in these cases you may get repeated error pop-ups if you try to continue observing. If there has been a "partial DHS crash" and you've attempted to continue, or it can't write data to disk for some reason, you can appear to take up to about 20 observations before it will turn into a full completely fatal crash (when buffers fill).

If the DHS crashes, for whatever reason, then the status display always stops updating. If the DHS dies then you should exit ORAC (in the normal manner, there is usually no need to "crash out") and run ORAC up again after fixing any instrument/telescope/disk problems if they contributed to the error.

If the DHS crashes then scrolling back through the messages on the OM startup xterm to the last filed observation may identify what went wrong. Please search for an error message before running down.

It may also be useful to check the status of the DHS tasks to see if any have disappeared. There are two easy checks you can do:

(a) Open an xterm on Kiki and type :
ps -ef | grep dhs

If UFTI is running you should see tasks with the following names :
/dhs/dhserver Ufti
/dhs/dhdriver QL_UFTI UFTI
/dhs/dhsave DHSAVE_UFTI
/dhs/dhandler DHANDLER_UFTI DHSPOOL_UFTI QL_UFTI
/dhs/dhspool DHSPOOL_UFTI DHSAVE_UFTI

If IRCAM3 (or CGS4) is running you should see names like :
/dhs/dhspool DHSPOOL_IRCAM3 DHSAVE_IRCAM3
/dhs/dhsave DHSAVE_IRCAM3

Alternatively you can check the status of the Drama tasks. Log onto Kiki using the observer account (which starts up Drama on login). Then use the Drama "ditsgetinfo" command to get the status of a specific task - the names of the tasks are given in the ps output above. For example the command :
ditsgetinfo -full DHSPOOL_UFTI
will respond
Task DHSPOOL_UFTI, type 0, description ""
If the task is there and responding. If it is not there, or not responding you will get an error message, for example:
DITSGETINFO_8f0:exit status:%DITS-F-UNKNTASK, Task unknown to message system

Using the "-full" option on "ditsgetinfo" is useful because it ensures that you get a response if the task is there and healthy, as well as an error message if it is not.

The other DHS related task which you could check is DES (Drama-Epics system):
ditsgetinfo -full DES
However if DES has "crashed" then (a) you will also have stopped being able to talk to UFTI or the FP or IRPOL (its not used for CGS4 or IRCAM), and (b) you will have had an error message reporting the fact.

6. UFTI Quick Look Problems

In general please be aware that the Quick Look which is available for use with UFTI is still under development and test by the Michelle project. It is probably the least tested software delivered with ORAC, and in particular the interaction between the Gaia display buttons and the ESO real time display code (which is being used to do the rapid real time display) has not been well tested. Most of the problems we have seen in the past have been in this general area - so if you are the first person to use a particular Gaia button you could well hit a bug and get strange behaviour or a nasty crash. The buttons were quite thoroughly tested on the night of Aug 3rd, so there are no obvious problems, just be aware that its still all quite new.

It is best to use the ORAC-DR Gaia display for serious data manipulation since you have reduced images to work with there.

There are currently three known problems :

Do not use "the Zoom in" on the Quick Look Gaia . There is a bug whereby this will stop Gaia from being able to display any further images that are sent to it by the DHS - ie the QL will no longer display your data as it is obtained. (Zoom out might be OK).

If the UFTI quicklook Gaia "disappears" (either due to a crash or you kill it), then this can sometimes also cause problems with the writing of files to disk by the DHS. We don't understand why yet, but the DHS simply stops writing the data to disk (this is an intermittent fault which makes difficult to trace). If you crash or accidently kill the Gaia display on the second head of Kiki, be careful to check that the DHS is still writing files to disk. You can do this as described above by checking that the "last saved" number is updating. Note too that having killed the "Quick Look Gaia" you cannot then "get another" by attempting to restart Gaia - you have to run the ORAC software down and back up. (The communication connection that allows the DHS to pass data to Gaia in real time dies when Gaia or the DHS dies).

If you run Quick Look on its own using the button on the sequencer console (e.g. to check an exposure time), and you then stop it you must be sure to check that the last exposure has finished before starting or continuing the execution of your sequence. (watch the count down timer and for the last image to display). The user-interface should not allow you to dismiss the Quick Look control panel and continue with other things until this is the case, but at the moment due to a bug you can do this on long exposures.

Finally note that there have been problems in the past with interaction/interference between the Quick Look and the Gaia that is used by ORAC-DR. We believe that these have all been fixed. Anything of this category would be apparent on startup - so ask for help then.

7. Sequencer Problems

If you send a new observation for execution and there is something wrong with the sequence itself it does not get loaded into the sequencer console. If you send a new observation but when you look at the OOS screen it still shows the old one and you get an error message which says
Error in the Drama tasks: ## Failed to load EXEC:exec_mainload failed
then there is probably something wrong with the sequence you just tried to load. Running into this problem is very unlikely during normal use, because the sequences are generated by the translator, and there are now no known translator problems. However if there are any remaining more esoteric translator bugs they could cause such a problem.

It is possible to work around a translator problem by editing the sequence by hand if you think you can work out what the offending line is. The sequences are kept at the summit on
ukirtdata/orac_data/sequences/
Use ls -lrt to find the most recently created file called oracnnnnnnnnnnnn.exec and look for any unusual commands or spaces in parameters where there are not normally spaces etc. These are the sorts of problem which could cause a sequence to fail to load.

Some odd behaviour of the highlight on the sequencer control console has been seen occasionally when sequences "run to completion" - the highlight did not always go back to the start of the sequence as it should have done. Sometimes it moved back a few rows and sometimes it disappeared. We believe that all such problems have now been cured, but are noting them here until further ORAC use has built confidence that they really are fixed. In the unlikely case that they recur, such problems can usually be worked around by clicking on the sequence display to recover the highlight and place it where you need it to be. You can then "run from highlight".

8. ORAC-OM Problems

Remember that database connection problems, described above will appear as an error message when you try to do something from the OM.

Be careful if you use the "on line help" help item at the top. The system is currently being run under JDK1.1.x and Java help files are designed for JDK1.2. This means that functionality like changing fonts can cause problems. Be aware that if you hang the help pages you can hang everything else as well, and the only way to recover is to crash out of everthing. The help pages and web pages are the same html files, and the help system will be tidied soon.

There is a minor bug in the "change userid" option which is offered when Exit is selected on the ORAC-OM programme selection gui. If you attempt to login in again using the same user-id instead of a different one the system will hang. Switching userids from user1 -> user2 -> user1 is not a problem, it is only if you try to do user1 -> user1 that the system hangs.

9. ORAC-OT Problems

The ORAC-OT has a few intermittent but serious problems that we have been unable to fix. These are :

Sometimes "cut/copy and paste" just stops working. e.g. you can apparently copy an observation, but then cannot paste it. There are no error messages - the pasted observation simply does not appear. It seems to happens after you have been both using both "cut/copy and paste" and deleting observations or items by highlighting then and using "delete" on the keyboard. However we have not yet been able to tie down a set of actions that makes this reproducible. There is also a suspicion that it might be partly a resource limitation on the machine you are using. Unfortunately the only thing to do is save what you have to disk, run everything down and try again.

There are also intermittent, non-reproducible problems with closing science programmes. If you use file-close on a science program window or file-exit on the OT without having first saved changes to a science program, then you should be prompted for whether or not you want to save the program before exiting. Occasionally this does not happen ! A possibly related problem is that sometimes if you are prompted and you select "don't save", then the OT does not close the program window or let you exit. 99.99% of the time both of these work fine. Until we find the bug, the best solution is to try to remember to save anything you want to keep before closing or exiting. If you don't want to save then control c> on the startup xterm will always crash you out.

There are also a number of irritating minor bugs in the OT - which do not affect functionality, and are slowly being tidied. These are noted where appropriate in the OT userguide.

10. Instrument Problems

If there are problems with the instrument, (e.g. failure of a filter wheel to move) for which the instrument generates an error message, then that error message is picked up by ORAC-OM and reported to the observer in a pop-up box. This is also true of some telescope errors, such as a source being unaccessible when you try to slew to it. Since an error has been reported the sequencer will go into a paused state. To continue observing you have to acknowledge the error message by clicking on OK, and then use the "continue" button on the console. Of course you should sort out the instrument/telescope problem before continuing, and if that means running down their control software you should also run down ORAC (its better if you do this first, then run the instrument down).

If the problem was with UFTI then the TSS will use the UFTI Epics control system to investigate, re-datum filter wheels or the shutter. If the problem was CGS4 or IRCAM then the TSS has the sms menu system and all its engineering functionality (such as "kill and reload Occam") available for troubleshooting. Many errors coming from the instrument or telescope are prefixed by the phrase "Error in the Drama tasks".

Note that because the CGS4 and IRCAM software outputs long error messages one line at a time (so they look tidy on an SMS screen and which made sense when they were written !) this means that they now output errors one line at a time to ORAC. Unfortunately because each line is sent separately, each line appears as a separate error to ORAC. This means that you often find that a CGS4 or IRCAM error will generate several ORAC error pop-up windows one after the other, and you have to click "OK" on each of them before you can continue observing. There is little that can be done about this - it is simply a reflection of the interface between ORAC and the old instrument software being achieved without making a lot of changes to the instrument software - it works, but isn't as elegant as for new instruments.

The other kind of instrument problem that might occur is any remaining bugs in the ORAC-OT or the translator. Since so far as the instrument is concerned ORAC is responsible for generating it a config file, this means that such problems are indicated by the instrument software complaining that it has been sent an illegal config, or the instrument does not appear to set to the config you expected it to. If this happens there are four possible causes :

11. Useful "engineering" funtions

Although ORAC is primariliy a high level user interface some engineering functionality has been included in the OOS console to assist in trouble-shooting and work arounds, and to enable engineering sequences to be used for repetitive instrument tests.

One of these is that it is possible to directly load a sequence and then run it. This sequence could be one that you have written for testing purposes containing commands not normally used by astronomers (such as datum, or wait n , etc) - so long as the commands are in the sequencer dictionary they can be executed. The sequence could also be one that you have written or modified for observing, to work around a known problem. To directly load a sequence go to the "commands" menu at the top and select "load" - it will fire up a browser in the direcotory where sequences are written to allow you to find your file: highlight the file you want and choose "open".

Another useful feature is that it is possible to look at an expanded display of the sequence and see details of what is going on internally to the sequence commands. If something is failing this can help to narrow down exactly what. For example if you cant take data, instead of the console beeping and going into a paused state on the "observe" commands, it will do so on one of the steps involved in "observe" - such as "newObs" or "uftiObserve". This will help you to know more precisely where the software is failing. To see the expanded display go to the configure item at the top of the OOS console and click on "hide eng exec" to turn off this function.

 

Authors: Gillian Wright, and Alan Bridger and Frossie Economou

Original : 1999/10/24, Last Modification Date 2000/07/26 - Last Modification Author:Gillian Wright