Difference between revisions of "HeavyJobs"
(→Bugs and Todos: some strikes) |
(→Bugs and Todos: show id) |
||
Line 27: | Line 27: | ||
− | * show chunk id in heavy_jobs/show | + | * <s>show chunk id in heavy_jobs/show</s> |
* <s>show ps of workers in heavy_worker/status</s> | * <s>show ps of workers in heavy_worker/status</s> | ||
* <s>kill or restart hung workers</s> | * <s>kill or restart hung workers</s> |
Revision as of 01:39, 23 April 2008
What (summary)
Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers.
Why this is important
We will use this infrastructure to manage our algorithmic data collection. This is a strategic direction for the company.
DoneDone
We will be satisfied with this infrastructure when:
- we can launch, balance, and diagnose all steps of our pilot whois refresh path.
- fetchers
- parsers
- aggregators
- we have startup scripts that will resume proper job processing after a machine reboot
- we can monitor overall health of all heavy job processing with zabbix, including system administrator alerts
Bugs and Todos
(prioritized high, medium and low for this week.)
-
A worker should mark a chunk with its id(array of ids when restarted)- this lets us draw a line per worker on throughput graph
- Workers should do partially completed chunks before starting new chunks.
- for now we will add ui that can reset an incomplete chunk to zero.
-
A worker should sleep when a manager has no more work to do -
Integrate the two controllers
-
show chunk id in heavy_jobs/show -
show ps of workers in heavy_worker/status -
kill or restart hung workers - move fetchers into framework, have it create parsing chunks
- Tally throughput, good records, etc
- keep a log of automatic actions
- Should HeavyJob be the source for actions?? Need better requirements here.
- Finer-grained progress
- Zabbix script to count busy and idle workers. (Or count something else interesting.)