Difference between revisions of "HeavyJobs"
(→Bugs and Todos: prioritize) |
(→Bugs and Todos: restart dark workers) |
||
(19 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
== What (summary) == | == What (summary) == | ||
Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers. | Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers. | ||
+ | |||
+ | * http://www.aboutus.org/au_web_services/heavy_jobs | ||
== Why this is important == | == Why this is important == | ||
Line 13: | Line 15: | ||
** parsers | ** parsers | ||
** aggregators | ** aggregators | ||
− | * we have startup scripts that will resume proper job processing after a machine reboot | + | * we have startup scripts that will resume proper job processing after a machine reboot or other operational events. |
− | * we can monitor overall health of all heavy job processing | + | * we can monitor overall health and productivity of all heavy job processing through a web interface. |
== Bugs and Todos == | == Bugs and Todos == | ||
− | |||
− | * Workers should do partially completed chunks before starting new chunks. | + | (new items) |
− | * A worker should | + | * Detect when worker goes dark > 2 min. Record last status in chunk; terminate and restart. |
− | * Integrate the two controllers | + | * from feed_aggregator: :error=>"private method `log_error' called for #<HeavyWorker:0xb7e9c318>" |
+ | * heavy_jobs/show can't find pid when buried in :last | ||
+ | ** (related) first attempt to look both places (in some other method) is coded with nil sensitivity | ||
+ | * Improve the job deployment process (see below) | ||
+ | * Add startup scripts that launch monitor (and manager) on server reboot | ||
+ | * <s>Integrate stop and terminate: stop leaves looping jobs looping</s> | ||
+ | * <s>Heavy_job_monitor racks up lots of cpu. Why? (trying longer sleeps) </s> | ||
+ | ** <s>Sleeping jobs aren't so good either.</s> | ||
+ | |||
+ | (prioritized high, medium and low for week with Ethan.) | ||
+ | |||
+ | * <s>A worker should mark a chunk with its id</s> (array of ids when restarted) | ||
+ | ** <s>this lets us draw a line per worker on throughput graph</s> | ||
+ | * <s>Workers should do partially completed chunks before starting new chunks.</s> | ||
+ | ** <s>for now we will add ui that can reset an incomplete chunk to zero.</s> | ||
+ | * <s>A worker should sleep when a manager has no more work to do</s> | ||
+ | * <s>Integrate the two controllers</s> | ||
− | * kill or restart hung workers | + | |
+ | |||
+ | * <s>show chunk id in heavy_jobs/show</s> | ||
+ | * <s>show ps of workers in heavy_worker/status</s> | ||
+ | * <s>kill or restart hung workers</s> | ||
* move fetchers into framework, have it create parsing chunks | * move fetchers into framework, have it create parsing chunks | ||
* Tally throughput, good records, etc | * Tally throughput, good records, etc | ||
Line 31: | Line 52: | ||
* Finer-grained progress | * Finer-grained progress | ||
− | * Zabbix script to count busy and idle workers. (Or count something else interesting.) | + | * Zabbix script to count busy and idle workers. (Or count something else interesting. Ethan is not too interested in this. Mostly he doesn't want "noise" alerts that distract him from real emergencies.) |
+ | |||
+ | == Deploying Heavy Jobs == | ||
+ | We're currently running these jobs on mist which is not one of our deployment targets. Mist is chosen by entry in a config file. A git clone of compostus is brought over. Two processes, a heavy job manager and heavy job monitor are launched in a screen session. It is then possible to start new workers through the web interface. | ||
+ | |||
+ | If one wants to add or modify a job algorithms, or modify the monitor, one must log into mist, find the screen session, and then update it as follows. | ||
+ | # kill monitor (kills workers) | ||
+ | # pull code | ||
+ | # restart monitor | ||
+ | # restart interrupted chunks | ||
+ | # start new workers | ||
+ | |||
+ | Use a variation of this to update the manager, a simpler task because it has no children. When jobs are distributed across multiple machines, there will be a monitor per machine but only one manager. | ||
+ | |||
+ | The working part of any heavy job should be unit tested before deployment. This is then wrapped up as a job that can be launched within the Heavy Jobs framework. We don't yet have a functional testing strategy for this part so be careful. Once deployed and started, one should look at production db to make sure that a job is working as intended. | ||
== Pilot Workflow == | == Pilot Workflow == | ||
Line 37: | Line 72: | ||
[[Image:HeavyJobsWorkflow.png|500px]] | [[Image:HeavyJobsWorkflow.png|500px]] | ||
+ | </noinclude> | ||
+ | [[Category:DevelopmentTeam]] | ||
[[Category:OpenTask]] | [[Category:OpenTask]] | ||
− | |||
− |
Latest revision as of 18:34, 4 May 2008
What (summary)
Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers.
Why this is important
We will use this infrastructure to manage our algorithmic data collection. This is a strategic direction for the company.
DoneDone
We will be satisfied with this infrastructure when:
- we can launch, balance, and diagnose all steps of our pilot whois refresh path.
- fetchers
- parsers
- aggregators
- we have startup scripts that will resume proper job processing after a machine reboot or other operational events.
- we can monitor overall health and productivity of all heavy job processing through a web interface.
Bugs and Todos
(new items)
- Detect when worker goes dark > 2 min. Record last status in chunk; terminate and restart.
- from feed_aggregator: :error=>"private method `log_error' called for #<0xb7e9c318>
0xb7e9c318>