‹‹ Back to SVS Home

Troubleshooting Techniques and Common Issues

17.3 Troubleshooting Techniques and Common Issues

Although not a substitute for the official Condor® documentation available at http://www.cs.wisc.edu/condor/), this section covers methods to assess the health of your Condor® Pool and the status of your jobs submitted to the pool by SVS.

Command Utility Commands

The following Condor® utility programs can be run from the C:/condor/bin directory through a command prompt.

  • condor_status: Use this command to list the machines that are currently in your Condor® pool. This command will also display the state of each machine, which is usually one of the following values:
    • Unclaimed: Available to run jobs.
    • Owner: Configured to only run jobs when a user is not using the computer, and currently in use.
    • Claimed: In the process of running jobs.
  • condor_q: Use this command to list the jobs that have been submitted by your machine and the state of each job. When you use SVSto run jobs on the Condor® pool, the queue command will show the jobs that are currently running and the ones that are waiting for resources to become available to run.
Condor®Issues on Windows

Because Condor® is a complex system designed for multiple platforms and network environments, it may seem like a daunting task to discover the source of problems when things go wrong. In reality, the default settings along with the recommendations made above should provide you with working Condor® configurations. The problems that arise, therefore, are usually caused by external factors that block Condor® from fully functioning.

  • Windows Firewall: On Windows XP SP2, the Windows firewall seems to block submitted Condor® jobs from running properly. This symptom may not occur until rebooting after installing Condor®. The simplest solution is to disable the Windows firewall. Alternatively, see the Firewalls section of Condor®’s “Administrators Manual” to learn how to configure a firewall to work with Condor®.
  • Failure to Start Condor® Services: There are many reasons why the Condor® window service may fail to start. The log files found in C:/condor/log are sometimes helpful in troubleshooting these errors. Sometimes certain third-party anti-virus or firewall programs may block Condor® by overwriting Window’s WinSock. This will cause Condor® to fail when starting and output a bind failed: WSAError = 10038 error to the MasterLog file.
  • Underused Resources: Using condor_status may indicate that not all the machines available on your network are being utilized to process idle jobs in the queue. Condor® is capable of using various metrics for determining if a machine is ready to receive jobs. By default, if you choose Always run jobs and never suspend them it should not use any metrics and simply run all jobs. If a machine is consistently not running jobs, you may want to check its logs for errors such as permission restrictions.