Difference between revisions of "HeavyJobs"

(Bugs and Todos: discount zabbix)
(Bugs and Todos: restart dark workers)
 

(6 intermediate revisions by the same user not shown)



Line 3: Line 3:
 
== What (summary) ==
 
== What (summary) ==
 
Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers.
 
Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers.
 +
 +
* http://www.aboutus.org/au_web_services/heavy_jobs
  
 
== Why this is important ==
 
== Why this is important ==
Line 19: Line 21:
  
 
(new items)
 
(new items)
 +
* Detect when worker goes dark > 2 min. Record last status in chunk; terminate and restart.
 +
* from feed_aggregator: :error=>"private method `log_error' called for #<HeavyWorker:0xb7e9c318>"
 +
* heavy_jobs/show can't find pid when buried in :last
 +
** (related) first attempt to look both places (in some other method) is coded with nil sensitivity
 
* Improve the job deployment process (see below)
 
* Improve the job deployment process (see below)
* Integrate stop and terminate: stop leaves looping jobs looping
+
* Add startup scripts that launch monitor (and manager) on server reboot
* Heavy_job_monitor racks up lots of cpu. Why? Sleeping jobs aren't so good either.
+
* <s>Integrate stop and terminate: stop leaves looping jobs looping</s>
 +
* <s>Heavy_job_monitor racks up lots of cpu. Why? (trying longer sleeps)  </s>
 +
** <s>Sleeping jobs aren't so good either.</s>
  
 
(prioritized high, medium and low for week with Ethan.)
 
(prioritized high, medium and low for week with Ethan.)
Line 57: Line 65:
  
 
Use a variation of this to update the manager, a simpler task because it has no children. When jobs are distributed across multiple machines, there will be a monitor per machine but only one manager.
 
Use a variation of this to update the manager, a simpler task because it has no children. When jobs are distributed across multiple machines, there will be a monitor per machine but only one manager.
 +
 +
The working part of any heavy job should be unit tested before deployment. This is then wrapped up as a job that can be launched within the Heavy Jobs framework. We don't yet have a functional testing strategy for this part so be careful. Once deployed and started, one should look at production db to make sure that a job is working as intended.
  
 
== Pilot Workflow ==
 
== Pilot Workflow ==
Line 62: Line 72:
 
[[Image:HeavyJobsWorkflow.png|500px]]
 
[[Image:HeavyJobsWorkflow.png|500px]]
  
 +
</noinclude>
 +
[[Category:DevelopmentTeam]]
 
[[Category:OpenTask]]
 
[[Category:OpenTask]]
[[Category:DevelopmentTeam]]
 
</noinclude>
 

Latest revision as of 18:34, 4 May 2008

OurWork Edit-chalk-10bo12.png

What (summary)

Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers.

Why this is important

We will use this infrastructure to manage our algorithmic data collection. This is a strategic direction for the company.

DoneDone

We will be satisfied with this infrastructure when:

  • we can launch, balance, and diagnose all steps of our pilot whois refresh path.
    • fetchers
    • parsers
    • aggregators
  • we have startup scripts that will resume proper job processing after a machine reboot or other operational events.
  • we can monitor overall health and productivity of all heavy job processing through a web interface.

Bugs and Todos

(new items)

  • Detect when worker goes dark > 2 min. Record last status in chunk; terminate and restart.
  • from feed_aggregator: :error=>"private method `log_error' called for #<0xb7e9c318>


Retrieved from "http://aboutus.com/index.php?title=HeavyJobs&oldid=15425162"