Ravens PHP Scripts: Forums
 

 

View next topic
View previous topic
Post new topic   Reply to topic    Ravens PHP Scripts And Web Hosting Forum Index -> General/Other Stuff
Author Message
HauntedWebby
Involved
Involved


Joined: May 19, 2004
Posts: 363
Location: Ogden, UT

PostPosted: Wed Sep 29, 2004 12:57 pm Reply with quote

Ok ... this is a stupid questions ... but I have to ask Very Happy

If I want to stop bots from visitng my site for a week, and I change my meta tag to META NAME=\"REVISIT-AFTER\" CONTENT=\"7 DAYS\", and change my robot.txt to disallow: *.*

Will that stop them?

_________________
--Webby-- 
View user's profile Send private message Send e-mail
Raven
Site Admin/Owner


Joined: Aug 27, 2002
Posts: 17086

PostPosted: Wed Sep 29, 2004 9:20 pm Reply with quote

My understanding (copied from the Internet somewhere/sometime):

The robots.txt method is the best, you can also, stack these in the file to create different rules for different bots e.g.
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
User-agent: Googlebot-Image
Disallow: /images/

The above broke down, means that all robots are barred from the site,
except google, which can spider the lot, and googles image indexer, that is allowed access to anything outside of the images directory... google is very well behaved, and you can even use support for wildcard extensions with google, e.g.

Disallow: *.pl
will stop google from indexing any pl files but allow it to index anything else.

Also, there is an older method that some bots still use, and thats a meta tag version, though its now falling into obscurity as almost every bot that recognises this, recongises robots.txt as well: [meta name="robots" content="noindex, nofollow"] will stop older bots from indexing the page upon which it is found, or following any links from that page. Similarly [meta name="robots" content="index, nofollow"] will allow the bot to index that page, but not follow to spider the rest of the site.

As to the meta tag, most of what I have read is that the search engins operate on their own schedule, but I am no expert in this area Wink
 
View user's profile Send private message
HauntedWebby
PostPosted: Thu Sep 30, 2004 9:31 am Reply with quote

I have in my robot.txt

User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: admin.php
Disallow: /admin/
Disallow: /images/
Disallow: /includes/
Disallow: /themes/
Disallow: /blocks/
Disallow: /modules/
Disallow: /language/
Disallow: *.*

The disallow *.* is a new thing, the rest is there ALL the time. The one that seems to be hitting a lot is jeteye.com. I've never even heard of these guys Sad

But my traffic has slowed a little. It wouldn't bother me but my traffic is almost 2GB a day. But the reports show that it's not the bots that are doing all the traffic, and I can't see any one file being downloaded a lot to figure out where all the bandwidth is being used.
 
Raven
PostPosted: Thu Sep 30, 2004 10:20 am Reply with quote

Have you reviewed awstats from cPanel? if you want to disallow everything, then just use

Disallow: /

I don't know that *.* works or not.
 
HauntedWebby
PostPosted: Thu Sep 30, 2004 10:59 am Reply with quote

I have reviewed the awstats, it's weird. Everything looks normal. I can't see any big useages on anything.

I've changed my setting to see if that works. Smile
 
Muffin
Client


Joined: Apr 10, 2004
Posts: 649
Location: UK

PostPosted: Thu Sep 30, 2004 4:59 pm Reply with quote

Could be someone is linking to images on your site

_________________
Classic Mini rules the bends & bends the rules!
[img] 
View user's profile Send private message
HauntedWebby
PostPosted: Thu Sep 30, 2004 5:06 pm Reply with quote

Nope that's not it. I watch that very close Smile

After talking in with Raven we figured out what it was. I am actually using that much bandwidth. Now that I know what it is I won't be so worried I have someone stealing bandwidth Smile
 
Muffin
PostPosted: Fri Oct 01, 2004 2:08 pm Reply with quote

Glad you got it sussed
 
Raven
PostPosted: Sat Oct 02, 2004 5:08 pm Reply with quote

Just for clarification, I recently found this, which substatntiates my suspicion that wildcards *.* are not allowed/honored Wink
Quote:
Where do I find out how robots.txt files work?

For a complete overview of how robots.txt exclusion files work, visit: Only registered users can see links on this board! Get registered or login!

The basic concept is simple. By writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:

# /robots.txt file for controlling indexing
User-agent: webcrawler
Disallow:

User-agent: googlebot
Disallow: /

User-agent: *
Disallow: /tmp
Disallow: /logs

The first line, starting with '#', specifies a comment

The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.

The second paragraph indicates that the robot called ‘googlebot’ has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.

The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.

Two common errors:

Wildcards are not supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp/'.
You shouldn't put more than one path on a Disallow line.
Where do I find out more about robots?

There is a Web robots home page on: Only registered users can see links on this board! Get registered or login!
 
Display posts from previous:       
Post new topic   Reply to topic    Ravens PHP Scripts And Web Hosting Forum Index -> General/Other Stuff

View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2007 phpBB Group
All times are GMT - 6 Hours
 
Forums ©