[Twisted-Python] Extending t.w.spider

janko at mind-nest.com janko at mind-nest.com
Fri Jul 4 02:12:38 MDT 2003


Bok,

I am sending this as suggested by "forgot who" on #twisted. 

I extended t.w.spider , this isn't finished. It seems to work however. 
You can download, run it and look it crawl ower certain domain ... It prints out
quite few information doing work so you know what is going on ( mainly for testing purposes ).

I am hoping just for a quick scan thru code and run by someone more experienced in this. So I will know if I did it python/twisted/sane way, before I go further and conect it to various stuff. ( also read NOTE at the end of email )

You can get this at http://www.mind-nest.com/downloads/walker.tgz or 
http://www.mind-nest.com/downloads/walker.zip

When finished I plan give it back to twisted... of course if they will
accept it.

note for the maillist moderator: I sent this mail when I was not yet a member some week ago, I got response that you must first approve it. As it didn't happen in 1 week and I am a member now I sent it again as a member.

lp
:janko
janko at mind-nest.com
www.mind-nest.com


***

This module extends twisted.web.spider.SpiderSender class into WalkerSender
(and also extends htmllib.HTMLParser and t.w.c.HTTPClientFactory so
WalkerSender can use them):

    LinkParser
    -collects links on a page
    -also collects frame scr-es, to crawl ower frames
    -can also collect images, links(css,js..) for link/img validating
purposes

    HTTPCollector
    -Doesnt store content of page to file but to variable
    -Can be easily set to diferent link-parsers or page-downloaders(*)..
    -Returns *self* to the callback, so that links, content, or anything
else it collects can be retrieved
    by the callback method. (**)

    WalkerSender
    -Can be easily set to diferent http-collectors/downloaders(***)..
    -Uses dictionary instead of list for queue now.. explaind below
    -Has 4 more plugins/events, some existent are cahnged to be more
powerfull
    -Plugin to filter links to whatever you wanth (extensions, domains...)
    -Plugin to fill with some algorithm to prevent from looping
    -Plugin to notify that download failed
    -Plugin to tell that all links found while crawling were crawled and
there is nowwhere else to go
    (Likely/hopefully to occur if doing One Site/domain crawler as I was,
and when timeouting or some other shit happens)
    -Plugin notifyDownloadEnd has aditional argument downloader which holds
anything you prepare in dowloader class(****)

    some smaller things were made to get it working
    -preventing from starting downloader on page/url that is already
downloading
    -queue (is now dictionary) so it can't have multiple same pages in it
(the depth of the first ocurrence is stored)
    -removes fragments from urls (www.a.org/index.html#fragment) so we get
multiple -same- pages that are then filtered
    -it doesnt remove ?queries as they often mean new content..

    I made OneSiteWalkerSender as an example.. I intend to make one site
search engine (with pyndex probably) and a
    anchor/link/img... validating script. OneSiteWalkerSender has just now
crawled ower 525 pages of www.google.com.
    I also tested it with other sites.

    NOTE: Don't shoot me or something if I made something very stupid, I am
very new to Python and Twisted
    and don't understand many important issues on any of them. Where I
marked with * in upper description I am a little
    suspicious with my way of doing it.

***

-------------- next part --------------
An HTML attachment was scrubbed...
URL: </pipermail/twisted-python/attachments/20030704/5f94bd05/attachment.html>


More information about the Twisted-Python mailing list