HeavyJobs

Revision as of 15:46, 25 April 2008 by Ward Cunningham (talk | contribs) (Bugs and Todos: discount zabbix)



OurWork Edit-chalk-10bo12.png

What (summary)

Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers.

Why this is important

We will use this infrastructure to manage our algorithmic data collection. This is a strategic direction for the company.

DoneDone

We will be satisfied with this infrastructure when:

  • we can launch, balance, and diagnose all steps of our pilot whois refresh path.
    • fetchers
    • parsers
    • aggregators
  • we have startup scripts that will resume proper job processing after a machine reboot or other operational events.
  • we can monitor overall health and productivity of all heavy job processing through a web interface.

Bugs and Todos

(new items)

  • Improve the job deployment process (see below)
  • Integrate stop and terminate: stop leaves looping jobs looping
  • Heavy_job_monitor racks up lots of cpu. Why? Sleeping jobs aren't so good either.

(prioritized high, medium and low for week with Ethan.)

  • A worker should mark a chunk with its id (array of ids when restarted)
    • this lets us draw a line per worker on throughput graph
  • Workers should do partially completed chunks before starting new chunks.
    • for now we will add ui that can reset an incomplete chunk to zero.
  • A worker should sleep when a manager has no more work to do
  • Integrate the two controllers


  • show chunk id in heavy_jobs/show
  • show ps of workers in heavy_worker/status
  • kill or restart hung workers
  • move fetchers into framework, have it create parsing chunks
  • Tally throughput, good records, etc
  • keep a log of automatic actions
  • Should HeavyJob be the source for actions?? Need better requirements here.


  • Finer-grained progress
  • Zabbix script to count busy and idle workers. (Or count something else interesting. Ethan is not too interested in this. Mostly he doesn't want "noise" alerts that distract him from real emergencies.)

Deploying Heavy Jobs

We're currently running these jobs on mist which is not one of our deployment targets. Mist is chosen by entry in a config file. A git clone of compostus is brought over. Two processes, a heavy job manager and heavy job monitor are launched in a screen session. It is then possible to start new workers through the web interface.

If one wants to add or modify a job algorithms, or modify the monitor, one must log into mist, find the screen session, and then update it as follows.

  1. kill monitor (kills workers)
  2. pull code
  3. restart monitor
  4. restart interrupted chunks
  5. start new workers

Use a variation of this to update the manager, a simpler task because it has no children. When jobs are distributed across multiple machines, there will be a monitor per machine but only one manager.

Pilot Workflow

HeavyJobsWorkflow.png



Retrieved from "http://aboutus.com/index.php?title=HeavyJobs&oldid=15361840"