Site Map Generator

by Vladimir Toncar

This software is a platform-independent site map generator. It crawls a web site starting from a given URL and outputs XML sitemap file that you can use for Google (via Google Webmaster Tools) or other search engines. Site maps are useful for SEO — you can give the search engine hints about what pages it can index at web your site. The site map generator program is published under GNU General Public License.

To run the generator, you do not need a shell access to your web server. The script is implemented as a simple crawler that can run from any computer that has Python installed on it. The crawler only follows local links and skips links to external sites. It will also not follow links marked with rel="nofollow" and will not crawl into directories that are disallowed in the robots.txt file.

The generator will generate sitemap records with the "<lastmod>" dates if your web server returns web pages with the 'Last-Modified' time stamp. If the crawler encounters an error when downloading a page or when parsing it, it will try to continue with another page.

To run the script, you will need Python version 2.5 or higher. (You can download Python from Python's official site.) The script needs no installation, simply copy it to a suitable directory and run it from there.

The script is mainly useful for smaller and medium-sized sites. It only generates a single sitemap file, so it will max out at 50,000 URLs (this is Google's limit for sitemap files). The script's default limit is 1,000 URLs but you can change it with the -m option.

The script's command line syntax is as follows:
     python <options> <starting URL>

The options are as follows:

-h  --help  Print the help and exit
-b <ext>  --block <ext>   Exclude URLs with the given extension; <ext> must be without the leading dot. The comparison is case insensitive, so for example DOC and doc are treated the same. You can use this option several times to block several extensions.
-c <value>  --changefreq <value>   Set the change frequency. The given value is used in all sitemap entries (maybe a future version of this script will change that). The allowed values are: always, hourly, daily, weekly, monthly, yearly, never.
-p <prio>  --priority <prio>   Set the priority. The value must be from the interval between 0.0 and 1.0. The value will be used in all sitemap entries.
-m <value>  --max-urls <value>   Set the maximum number of URLs to be crawled. The default value is 1000 and the largest value that you can set is 50000 (the script generates only a single sitemap file).
-o <file>  --output-file <file>   Set the name of the generated sitemap file. The default file name is sitemap.xml.


Usage example:
     python -b doc -b bmp -o test_sitemap.xml


Change log

Version 1.0.1: Added missing XML entity escaping.

Version 1.0.2: Added handling of BASE HREF tag.

Version 1.0.3 (2008-08-06): Added handling of HTTP redirects by Pavel "ShadoW" Dvořák

Version 1.1.0 (2009-09-05): Added support for the 'nofollow' tag and for robots.txt.