[Twisted-web] Defers, the reactor, and idiomatic/proper usage
-- new user needs some advice?
Dave Gray
dgray at omniti.com
Thu Jun 23 10:18:06 MDT 2005
I'm not familiar with feedlib, etc, but I'll answer what I can.
Richard Meraz wrote:
> MAXTIME = 60 # Kill crawl after this time
> TIMEOUT = 20 # Kill page retrieval after this time inactive
> MAXDEPTH = 3 # Recurse this depth when crawling page.
>
> # Question: There seem to be many idioms to aggregate information from
> different defered call-back chains in twisted.. Since everything runs
> in a single thread I just stuck my stuff in a global class and everybody
> modifies the vars there as I pass it around to the call-backs that
> should see it. Seems okay for a small script like this?
That seems fine, yeah. I think I would pass around the StateVars
instance as a context if I were coding this. Probably the same effect.
> class StateVars:
> '''Keep Global state for starting/stopping feedfinding and a record
> of links we have checked and their status'''
> connections = 1
> links_checked = {} # Structure: {url: (is RSS/ATOM/RDF, page-content)}
>
> # Question: start_feed_crawl is where I set up my defers. getPage
> returns a defer and I attach my call-back process_link.
> # addCallbacks adds a callback/errback in parallel so only one or the
> other is called? so I need to add
> # the final errback to catch errors from callback process_link ?
Correct. Well, sort of. See below.
> def start_feed_crawl(uri,depth):
> '''Harvest feeds from a uri'''
> # Question: how to time-out this deferred chain if getPage is taking too
> long to finish its work.
> # what exactly does the argument timeout to getPage do, does it timeout
> the socket after a no-response
> # or does it put an upper-bound on how long getPage has to finish its work?
>
> getPage(uri, timeout=TIMEOUT).addCallbacks(callback=process_link,
> callbackArgs=(uri,
> depth, StateVars),
> errback =
> process_error,
> errbackArgs=(uri,StateVars)
> ).addErrback(process_error,
> uri, StateVars)
It seems clearer to me to write this as follows, but that's personal
preference:
d = getPage(...)
d.addCallbacks(...)
d.addErrback(...)
But since you're setting up the call to the same errback twice, you
could simplify this to:
d = getPage(...)
d.addCallback(process_link, uri, depth, StateVars)
d.addErrback(process_error, uri, StateVars)
<http://twistedmatrix.com/projects/core/documentation/howto/defer.html#auto4>
has a nice visual explanation of what happens when.
> # Question: since I'm starting up these defers in a callback they are
> # being created after I've called reactor.run() since we call start_feed_crawl
> # as we find new links that meet our criteria. Am I doing anything bad here?
> # All the examples I've seen (eg. p. 548-552 Python Cookbook, great eg by V. Volonghi
> # and P. Cogolo) have their data up-front and therefore set-up all the defers before calling
> # reactor.run(). Here I'm discovering my data as I go along and setting up deferrs while
> # the reactor is spinning. Here is my fundamental lack of understanding. While this script
> # seems to run okay, is it okay to do this?
Yes, that's fine. I think the one you've seen the most is the odd case -
being able to set up all the Deferreds beforehand.
> # Question: Is this how I kill the reactor -- ie. using some sort of
> state condition. Is there a better way,
> # should I try better to understand deferred-list. For example. A
> top-level deferred-list that contains
> # other deferred-lists which get created to hold all the defers
> (created by start_feed_crawl) for the
> # links on a given page. Could this deferred-list be told to stop
> the reactor when the other lists have
> # fired their callback (after the component defers have finished) ?
> (Sorry for the convoluted question here
> # I'm new at this)
What you want to do is stop the reactor when everything is done
processing. So after you call start_feed_crawl the first time, returning
the Deferred that getPage gives you, you can add a callback to that
which stops the reactor. The trick here is that if you stuff that
deferred into a DeferredList before you add the callback that stops the
reactor then if your first operation itself returns a deferred, the
DeferredList won't call its callbacks until the other Deferred operation
completes. So you'll be stacking up a whole bunch of Deferreds inside
the first one, and the callback on the DeferredList that does the
reactor.stop won't fire until you don't return a Deferred.
There might be an easier way to do this, but this the way I know
(example attached). Someone please let me know if there's an easier way.
To see the example, run it with 'twistd -noy fetchpage.tac' then do
'telnet localhost 9000' and send:
GET /?target=http://www.google.com/ HTTP/1.1
Host: localhost
> Final question: occasionally I get errors that come from the http.py
> code in twisted. This get printed to the console, but don't necessarily
> stop my program. Should my errbacks be catching these? How do I keep
> errors from getting logged to the console (beside redirecting stderr). I
> can post an example if necessary of the errors I'm getting.
When you create the DeferredList, pass in consumeErrors=1 - this will
make debugging that much more annoying though...
HTH,
Dave
-------------- next part --------------
from twisted.web import server
from twisted.web.resource import Resource
from twisted.web.client import getPage
from twisted.internet import defer, reactor
from twisted.python import log
from cgi import escape
class Foo(Resource):
counter = 0
isLeaf=True
def render_GET (self, request):
self.rq = request
target = escape(request.args['target'][0])
d = getPage(target).addCallback(self.print_page)
d.addErrback(log.err)
dl = defer.DeferredList([d])
dl.addCallback(stopNow)
dl.addErrback(log.err)
return server.NOT_DONE_YET
def print_page (self, html):
if Foo.counter < 5:
Foo.counter += 1
print 'request '+str(Foo.counter)
d = defer.Deferred()
d.addCallback(self.print_page)
d.addErrback(log.err)
reactor.callLater(1, d.callback, html)
return d
else:
print 'now we can write stuff back'
self.rq.write(str(len(html))+' '+str(Foo.counter))
self.rq.finish()
self.rq.transport.loseConnection()
# no deferred being returned, stopNow fires
def stopNow(cbval):
# can't add reactor.stop as a callback directly
# because it doesn't know what to do with the extra
# argument being returned from the callback
print cbval
reactor.stop()
resource = Foo()
site = server.Site(resource)
from twisted.application import service, internet
application = service.Application("Foo")
internet.TCPServer(9000, site).setServiceParent(application)
# vim: ai sts=4 sw=4 expandtab syntax=python :
More information about the Twisted-web
mailing list