Difference between revisions of "HeavyJobs"
(→Bugs and Todos: discount zabbix) |
(→Bugs and Todos: restart dark workers) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
== What (summary) == | == What (summary) == | ||
Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers. | Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers. | ||
+ | |||
+ | * http://www.aboutus.org/au_web_services/heavy_jobs | ||
== Why this is important == | == Why this is important == | ||
Line 19: | Line 21: | ||
(new items) | (new items) | ||
+ | * Detect when worker goes dark > 2 min. Record last status in chunk; terminate and restart. | ||
+ | * from feed_aggregator: :error=>"private method `log_error' called for #<HeavyWorker:0xb7e9c318>" | ||
+ | * heavy_jobs/show can't find pid when buried in :last | ||
+ | ** (related) first attempt to look both places (in some other method) is coded with nil sensitivity | ||
* Improve the job deployment process (see below) | * Improve the job deployment process (see below) | ||
− | * Integrate stop and terminate: stop leaves looping jobs looping | + | * Add startup scripts that launch monitor (and manager) on server reboot |
− | * Heavy_job_monitor racks up lots of cpu. Why? Sleeping jobs aren't so good either. | + | * <s>Integrate stop and terminate: stop leaves looping jobs looping</s> |
+ | * <s>Heavy_job_monitor racks up lots of cpu. Why? (trying longer sleeps) </s> | ||
+ | ** <s>Sleeping jobs aren't so good either.</s> | ||
(prioritized high, medium and low for week with Ethan.) | (prioritized high, medium and low for week with Ethan.) | ||
Line 57: | Line 65: | ||
Use a variation of this to update the manager, a simpler task because it has no children. When jobs are distributed across multiple machines, there will be a monitor per machine but only one manager. | Use a variation of this to update the manager, a simpler task because it has no children. When jobs are distributed across multiple machines, there will be a monitor per machine but only one manager. | ||
+ | |||
+ | The working part of any heavy job should be unit tested before deployment. This is then wrapped up as a job that can be launched within the Heavy Jobs framework. We don't yet have a functional testing strategy for this part so be careful. Once deployed and started, one should look at production db to make sure that a job is working as intended. | ||
== Pilot Workflow == | == Pilot Workflow == | ||
Line 62: | Line 72: | ||
[[Image:HeavyJobsWorkflow.png|500px]] | [[Image:HeavyJobsWorkflow.png|500px]] | ||
+ | </noinclude> | ||
+ | [[Category:DevelopmentTeam]] | ||
[[Category:OpenTask]] | [[Category:OpenTask]] | ||
− | |||
− |
Latest revision as of 18:34, 4 May 2008
What (summary)
Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers.
Why this is important
We will use this infrastructure to manage our algorithmic data collection. This is a strategic direction for the company.
DoneDone
We will be satisfied with this infrastructure when:
- we can launch, balance, and diagnose all steps of our pilot whois refresh path.
- fetchers
- parsers
- aggregators
- we have startup scripts that will resume proper job processing after a machine reboot or other operational events.
- we can monitor overall health and productivity of all heavy job processing through a web interface.
Bugs and Todos
(new items)
- Detect when worker goes dark > 2 min. Record last status in chunk; terminate and restart.
- from feed_aggregator: :error=>"private method `log_error' called for #<0xb7e9c318>
0xb7e9c318>