[Twisted-Python] running 1,000,000 tasks, 40 at-a-time

Wed Oct 26 08:24:12 MDT 2011

On 02:02 pm, jrennie at gmail.com wrote:
>The background:
>
>I've been using DeferredSemaphore and DeferredList to manage the 
>running of
>tasks with a resource constraint (only so many tasks can run at the 
>same
>time).  This worked great until I tried to use it to manage millions of
>tasks.  Simply setting them up to run (DeferredSemaphore.run() calls) 
>took
>appx. 2 hours and used ~5 gigs of ram.  This was less efficient than I
>expected.  Note that these numbers don't include time/memory for 
>actually
>running the tasks, only time/memory to set up the running of the tasks.
>I've since written a custom task runner that has uses comparatively 
>little
>setup time/memory by adding a "manager" callback to each task which 
>starts
>additional tasks as appropriate.
>
>My questions:
>
>   - Is the behavior I'm seeing expected?  i.e. are DS/DL only 
>recommended
>   for task management if the # of tasks not too large?  Is there a 
>better way
>   to use DS/DL that I might not be thinking of?

Yes, it's expected.  Queueing up millions of tasks is a lot of work. 
Setting up millions more callbacks to learn about completion is a lot 
more work.  I would not recommend DeferredSemaphore for things beyond 
"user scale" - eg, things that correspond to a single user action, like 
clicking a button in a GUI.
>   - Is there a Twisted pattern for managing tasks efficiently that I 
>might
>   be missing?

I think the generator/cooperator approach works pretty well, and has 
constant (instead of linear) time completion notification and 
distributes setup costs across the lifetime of the queue, probably 
allowing for better resource utilization.

See http://as.ynchrono.us/2006/05/limiting-parallelism_22.html for a 
simple write-up.

Jean-Paul