Problem:

Viele (bösartige) Bots beachten die robots.txt nicht.
Andere (nicht bösartige) Bots sehen Änderungen darin leider erst (viel viel viel) später weil sie diese nicht ständig kontrollieren.

Dadurch wird jedesmal hoher Traffic verursacht, Apache oder gar die Datenbank zu stark belastet.

Lösung:

Wir sperren die Bots dauerhaft per mod_rewrite aus.
Voraussetzung: mod_rewrite ist bereits installiert und lauffähig im Apache aktiviert.

Basis:

Grundsätzlich ist es egal, wo die Regeln eingebaut werden. Mögliche Orte: VirtualHost, Directory oder .htaccess.
(Wie immer der Hinweis: .htaccess ist ein Performance-Fresser!)

Das Grundgerüst bilden folgende Zeilen:

RewriteEngine On
# hier dazwischen kommen die u.g. Conditions
#...
# und den Abschluß macht die Rule:
RewriteRule ^.* - [F,L]

Bekannt 'böse' Bots (Liste von Server-Wissen.de):

RewriteCond %{HTTP_USER_AGENT} ^Alexibot [OR]
RewriteCond %{HTTP_USER_AGENT} ^asterias [OR]
RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Black [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlowFish [OR]
RewriteCond %{HTTP_USER_AGENT} ^BotALot [OR]
RewriteCond %{HTTP_USER_AGENT} ^BuiltBotTough [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bullseye [OR]
RewriteCond %{HTTP_USER_AGENT} ^BunnySlippers [OR]
RewriteCond %{HTTP_USER_AGENT} ^Cegbfeieh [OR]
RewriteCond %{HTTP_USER_AGENT} ^CheeseBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Convera [OR]
RewriteCond %{HTTP_USER_AGENT} ^CopyRightCheck [OR]
RewriteCond %{HTTP_USER_AGENT} ^cosmos [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DataFountains [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DittoSpyder [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^Email [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^Extractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Foobot [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Global Confusion [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^hloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^httplib [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^IBM_Planetwide [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Indy Library [OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^Jakarta [OR]
RewriteCond %{HTTP_USER_AGENT} ^JennyBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^Kenjin [OR]
RewriteCond %{HTTP_USER_AGENT} ^Keyword [OR]
RewriteCond %{HTTP_USER_AGENT} ^LexiBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^libWeb [OR]
RewriteCond %{HTTP_USER_AGENT} ^lwp [OR]
RewriteCond %{HTTP_USER_AGENT} ^Lynx [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mata [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIIxpc [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister [OR]
RewriteCond %{HTTP_USER_AGENT} ^moget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^NPBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Openfind [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^ProPowerBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^ProWebWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^QueryN [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^RepoMonkey [OR]
RewriteCond %{HTTP_USER_AGENT} ^RMA [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SlySearch [OR]
RewriteCond %{HTTP_USER_AGENT} ^Snoopy [OR]
RewriteCond %{HTTP_USER_AGENT} ^SpankBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^spanner [OR]
RewriteCond %{HTTP_USER_AGENT} ^Super [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^suzuran [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Telesoft [OR]
RewriteCond %{HTTP_USER_AGENT} ^The.Intraformant [OR]
RewriteCond %{HTTP_USER_AGENT} ^TheNomad [OR]
RewriteCond %{HTTP_USER_AGENT} ^TightTwatBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Titan [OR]
RewriteCond %{HTTP_USER_AGENT} ^turingos [OR]
RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^URLy.Warning [OR]
RewriteCond %{HTTP_USER_AGENT} ^VCI [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^web [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^www [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]

Erweiterungen:

Und hier hoch ein paar kritischere Regeln. Evtl. wird hier aber zu viel verboten. Dies muß jeder selber wissen.

RewriteCond %{HTTP_USER_AGENT} collect [NC,OR]
RewriteCond %{HTTP_USER_AGENT} crawl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} download [NC,OR]
RewriteCond %{HTTP_USER_AGENT} francis [NC,OR]
RewriteCond %{HTTP_USER_AGENT} grabb [NC,OR]
RewriteCond %{HTTP_USER_AGENT} harvest [NC,OR]
RewriteCond %{HTTP_USER_AGENT} httrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} leech [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libwww [NC,OR]
RewriteCond %{HTTP_USER_AGENT} majestic [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ng-search [NC,OR]
RewriteCond %{HTTP_USER_AGENT} nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} offline [NC,OR]
RewriteCond %{HTTP_USER_AGENT} omni [NC,OR]
RewriteCond %{HTTP_USER_AGENT} robot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} suck [NC,OR]
RewriteCond %{HTTP_USER_AGENT} sohu [NC,OR]

Folgende Einträge sind gefälschte Browserkennungen die einen normalen User vorgaukeln sollen:

RewriteCond %{HTTP_USER_AGENT} ^MSIE 6.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4.0 (compatible; MSIE 6.0; Win32) [OR]
RewriteCond %{HTTP_USER_AGENT} MSIE 6.0b [OR]

Nun noch ein paar Security-Tips gegen XSS-Angriffe:

RewriteCond %{QUERY_STRING} .*'.* [OR]
RewriteCond %{QUERY_STRING} .*%27.* [OR]
RewriteCond %{QUERY_STRING} .*".* [OR]
RewriteCond %{QUERY_STRING} .*%22.* [OR]
RewriteCond %{QUERY_STRING} .*`.* [OR]
RewriteCond %{QUERY_STRING} .*%60.* [OR]
RewriteCond %{QUERY_STRING} .*%25.* [OR]
RewriteCond %{QUERY_STRING} .*echr.* [OR]
RewriteCond %{QUERY_STRING} .*esystem.* [OR]
RewriteCond %{QUERY_STRING} .*passthru.* [OR]
RewriteCond %{QUERY_STRING} .*wget.*

Kategorien:

Stichwörter:

apache · apache2 · rewrite · mod_rewrite · bots · spider · aussperren · robots.txt ·