web crawler - How can I gather all links on a site without content? -


I would like to get all the URLs to a site link (on the same domain) without wget without downloading all the content. Is there a way to just list the links, download it?

For a little background of what I am using for this, if someone can come up with a better solution: I am trying to create a robots.txt file in which p [4-9 ]. Not all files ending in HTML are included, but robots.txt does not support regular expressions. That's why I am trying to get all the links and then running a regular expression against them and putting the result in robots.txt. Any ideas?

My recommendation: Combine wget and gawk into one (very) small shell script.

There is a good overview of AWK on Wikipedia:


Comments

Popular posts from this blog

Eclipse CDT variable colors in editor -

AJAX doesn't send POST query -

wpf - Custom Message Box Advice -