Difference between revisions of "HeavyJobs"

Latest revision as of 18:34, 4 May 2008

Rating: 0 - 0 votes

Company Logo

Company Name

Company Contact

Page Type

This page is about a company.

OurWork

What (summary)

Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers.

http://www.aboutus.org/au_web_services/heavy_jobs

Why this is important

We will use this infrastructure to manage our algorithmic data collection. This is a strategic direction for the company.

DoneDone

We will be satisfied with this infrastructure when:

we can launch, balance, and diagnose all steps of our pilot whois refresh path.
- fetchers
- parsers
- aggregators
we have startup scripts that will resume proper job processing after a machine reboot or other operational events.
we can monitor overall health and productivity of all heavy job processing through a web interface.

Bugs and Todos

(new items)

Detect when worker goes dark > 2 min. Record last status in chunk; terminate and restart.
from feed_aggregator: :error=>"private method `log_error' called for #<0xb7e9c318>

@@ Line 3: / Line 3: @@
 == What (summary) ==
 Manage long-running jobs on available compute resources (servers) using db tables to keep track of work, and inter-process communication to keep track of workers.
+* http://www.aboutus.org/au_web_services/heavy_jobs
 == Why this is important ==
@@ Line 13: / Line 15: @@
 ** parsers
 ** aggregators
-* we have startup scripts that will resume proper job processing after a machine reboot
+* we have startup scripts that will resume proper job processing after a machine reboot or other operational events.
-* we can monitor overall health of all heavy job processing with zabbix, including system administrator alerts
+* we can monitor overall health and productivity of all heavy job processing through a web interface.
 == Bugs and Todos ==
-(prioritized high, medium and low for this week.)
-* Workers should do partially completed chunks before starting new chunks.
+(new items)
-* A worker should terminate when a manager has no more work to do.
+* Detect when worker goes dark > 2 min. Record last status in chunk; terminate and restart.
-* Integrate the two controllers (how to be determined)
+* from feed_aggregator: :error=>"private method `log_error' called for #<HeavyWorker:0xb7e9c318>"
+* heavy_jobs/show can't find pid when buried in :last
+** (related) first attempt to look both places (in some other method) is coded with nil sensitivity
+* Improve the job deployment process (see below)
+* Add startup scripts that launch monitor (and manager) on server reboot
+* <s>Integrate stop and terminate: stop leaves looping jobs looping</s>
+* <s>Heavy_job_monitor racks up lots of cpu. Why? (trying longer sleeps)  </s>
+** <s>Sleeping jobs aren't so good either.</s>
+(prioritized high, medium and low for week with Ethan.)
+* <s>A worker should mark a chunk with its id</s> (array of ids when restarted)
+** <s>this lets us draw a line per worker on throughput graph</s>
+* <s>Workers should do partially completed chunks before starting new chunks.</s>
+** <s>for now we will add ui that can reset an incomplete chunk to zero.</s>
+* <s>A worker should sleep when a manager has no more work to do</s>
+* <s>Integrate the two controllers</s>
-* kill or restart hung workers
+* <s>show chunk id in heavy_jobs/show</s>
+* <s>show ps of workers in heavy_worker/status</s>
+* <s>kill or restart hung workers</s>
 * move fetchers into framework, have it create parsing chunks
 * Tally throughput, good records, etc
@@ Line 31: / Line 52: @@
 * Finer-grained progress
-* Zabbix script to count busy and idle workers. (Or count something else interesting.)
+* Zabbix script to count busy and idle workers. (Or count something else interesting. Ethan is not too interested in this. Mostly he doesn't want "noise" alerts that distract him from real emergencies.)
+== Deploying Heavy Jobs ==
+We're currently running these jobs on mist which is not one of our deployment targets. Mist is chosen by entry in a config file. A git clone of compostus is brought over. Two processes, a heavy job manager and heavy job monitor are launched in a screen session. It is then possible to start new workers through the web interface.
+If one wants to add or modify a job algorithms, or modify the monitor, one must log into mist, find the screen session, and then update it as follows.
+# kill monitor (kills workers)
+# pull code
+# restart monitor
+# restart interrupted chunks
+# start new workers
+Use a variation of this to update the manager, a simpler task because it has no children. When jobs are distributed across multiple machines, there will be a monitor per machine but only one manager.
+The working part of any heavy job should be unit tested before deployment. This is then wrapped up as a job that can be launched within the Heavy Jobs framework. We don't yet have a functional testing strategy for this part so be careful. Once deployed and started, one should look at production db to make sure that a job is working as intended.
 == Pilot Workflow ==
@@ Line 37: / Line 72: @@
 [[Image:HeavyJobsWorkflow.png|500px]]
+</noinclude>
+[[Category:DevelopmentTeam]]
 [[Category:OpenTask]]
-[[Category:DevelopmentTeam]]
-</noinclude>

Difference between revisions of "HeavyJobs"

Company Logo

Company Name

Company Contact

Page Type

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating

Latest revision as of 18:34, 4 May 2008

Company Logo

Company Name

Company Contact

Page Type

What (summary)

Why this is important

DoneDone

Bugs and Todos

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating

Company Logo

Company Name

Company Contact

Page Type

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating