[Twisted-Python] Scalability of an rss-aggregator
Andrew Bennetts
andrew-twisted at puzzling.org
Wed Mar 31 05:34:06 MST 2004
On Wed, Mar 31, 2004 at 01:27:49PM +0200, Valentino Volonghi aka Dialtone wrote:
> Andrew Bennetts wrote:
>
> >On Wed, Mar 31, 2004 at 09:33:58AM +0200, Valentino Volonghi aka Dialtone
> >wrote:
> >
> >
> >>Hi all,
> >>attached you will find my rss-aggregator made with twisted.
> >>
> >>It's really fast although when I tried with 745 feeds I got some problems.
> >>When the download reached 300 parsed feeds (more or less) it locked till
> >>I pressed Ctrl+C and then it
> >>processed the remaining 340 feeds in less than 30 seconds... I think
> >>that my design has at least an issue
> >>but I cannot find it so easily and I hope someone on this list can help
> >>me to improve it.
> >
> >By default, Twisted uses the platform name resolver, which is blocking.
> >Perhaps a non-existent domain is causing gethostbyname to block?
> >
> Uhmm... dunno, but I tried to remove the 'locking' feed-source and it
> didn't change.
Hmm, it's unlikely to be DNS lookups causing it, then.
We need some way to narrow down where it's happening. There are a few
options I can think of, but they're all a bit heavyweight...
- Use strace to get some idea what it's doing
- Use the --spew option of twistd (or manually install the spewer with
"from twisted.python.util import spewer; sys.settrace(spewer)")
- Use gdb to attach the process, then and look at the backtrace there.
(You can apparently get the python backtrace in gdb by putting this macro in
your .gdbinit:
define ppystack
while $pc < Py_Main || $pc > Py_GetArgcArgv
if $pc > eval_frame && $pc < PyEval_EvalCodeEx
set $__fn = PyString_AsString(co->co_filename)
set $__n = PyString_AsString(co->co_name)
printf "%s (%d): %s\n", $__fn, f->f_lineno, $__n
end
up-silently 1
end
select-frame 0
end
But I've never tried this...
)
Is it possible that feedparser is hanging on trying to parse that feed?
Perhaps trying putting print statements before and after the
feedparser.parse call.
> >You should be able to test this theory by installing Twisted's resolver:
> >
> > from twisted.names import client
> > reactor.installResolver(client.createResolver())
> >
> >client.createResolver makes a resonable effort to use your system's DNS
> >configuration (by looking at /etc/resolve.conf on posix systems, for
> >example), so it should work without any special arguments.
> >
> ok, it changes into a totally non-working script :)
>
> I get a lot of these:
> [Failure instance: Traceback: exceptions.TypeError, unsubscriptable object
> /usr/lib/python2.3/site-packages/twisted/internet/defer.py:313:_runCallbacks
> /usr/lib/python2.3/site-packages/twisted/names/resolve.py:44:__call__
> /usr/lib/python2.3/site-packages/twisted/names/common.py:36:query
> /usr/lib/python2.3/site-packages/twisted/names/common.py:104:lookupAllRecords
> /usr/lib/python2.3/site-packages/twisted/names/client.py:266:_lookup
> /usr/lib/python2.3/site-packages/twisted/names/client.py:214:queryUDP
> ]
Ouch. I wonder how that bug crept in? The twisted.names code is expecting a
sequence of timeouts (to re-issue the query with, until failing at last), but
twisted.internet is only giving it a single integer. I've filed a bug
report for this: http://twistedmatrix.com/bugs/issue570, if you care :)
> >>BTW When it finishes (with all 740 feeds) it reports an awesome 330
> >>seconds which is an impressive time, less than half a second
> >>for each feed, and It downloads more than 50Mb of feeds from the net
> >>(with 745 feeds to download).
> >>
> >
> >Nice!
> >
> >
> Yup, was going to ask for my script to be used instead of asyncore to
> Straw developers.
> Straw has a lot of problems with 200 feeds ie resets the connection and
> such. This would be an awesome improvement.
Absolutely. I've heard similar complaints about straw, and I've been hoping
some keen person would apply Twisted to fix the problem :)
-Andrew.
More information about the Twisted-Python
mailing list