PHP Web Host - Quality Web Hosting For All PHP Applications $35/month $250/year (Unlimited) - $25/month - 200,000 impressions - Your Ad Could be Here - Click For Details
  Login or Register
 • Home • Downloads • Your Account • Forums • 

View next topic
View previous topic


Google
 
Web RavenPHPScripts (This Site)
Post new topic   Reply to topic
Author Message
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Wed Sep 21, 2005 4:55 pm Reply with quote Back to top

We all know that some search engines have a monopol. Therefore I found this is a great and very interesting project:
Only registered users can see links on this board!
Get registered or login to the forums!




But I wasn't very responsive when I checked my logfiles. The crawler hits several times the same file. After all (weeks later) I think the bot cannot handle the session ids in the forums.

Code:
67.53.54.213 - - [20/Sep/2005:17:47:15 +0200] "GET /forum-11.html&sid=1b8043b9d6b72afe85a40aa8e8aedcce HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:20 +0200] "GET /forum-11.html&sid=1b88f66ad6a2a53de5cc20cf96cdf730 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:25 +0200] "GET /forum-11.html&sid=1b8a8ec7e7ee5117b23746551a59f242 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:30 +0200] "GET /forum-11.html&sid=1b90c226566b115aba82160c06199c48 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:35 +0200] "GET /forum-11.html&sid=1bb945a7ab5218f27bda9952a13befcf HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:40 +0200] "GET /forum-11.html&sid=1bc73a5c5e05535bf3e6f672ed040cbb HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:45 +0200] "GET /forum-11.html&sid=1bd1a7eab553b5fd522231ef679ad112 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:50 +0200] "GET /forum-11.html&sid=1be250dffaa8f3fe420cc513705d1553 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:55 +0200] "GET /forum-11.html&sid=1c084facd4dda1e9b6139147f85f7985 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"

67.53.54.213 - - [20/Sep/2005:17:48:04 +0200] "GET /forum-11.html&sid=1c44737502f47007f9ed91d176da7f27 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:48:08 +0200] "GET /forum-11.html&sid=1c4ae8a9b8cd3e2eb3fb1a47f418eef9 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:48:13 +0200] "GET /forum-11.html&sid=1c55d390f897683c30b7363d916713eb HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:48:17 +0200] "GET /forum-11.html&sid=1c61bf7c1bafb8229f24b61f8ef4222f HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:48:24 +0200] "GET /forum-11.html&sid=1ca8f8c274562ce712c37fbffcb0eccf HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:48:28 +0200] "GET /forum-11.html&sid=1cb9dde375e1be3033c0646b7d850617 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
View user's profile Send private message Visit poster's website
djmaze
Subject Matter Expert


Joined: May 15, 2004
Posts: 689
Location: http://tinyurl.com/5z8dmv

PostPosted: Thu Sep 22, 2005 1:14 am Reply with quote Back to top

There are 0 to none bots that handle sessions that's why it sometimes seems that you have 500 visitors Smile
Secondly sid's suck because a searchengine relies on a list of urls and when the sid is listed in the searchengine your pages will be hit more and more abusively.

Get rid of sid and everything should become normal
View user's profile Send private message Visit poster's website
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Thu Sep 22, 2005 4:56 am Reply with quote Back to top

I know about our sid problem. Nowbody of the experts I asked was able to fix this.
But the bot above is an extreme example. I havenīt seen other bots in our logs with such a behavior.
View user's profile Send private message Visit poster's website
djmaze
Subject Matter Expert


Joined: May 15, 2004
Posts: 689
Location: http://tinyurl.com/5z8dmv

PostPosted: Thu Sep 22, 2005 5:47 am Reply with quote Back to top

It's d*** easy to fix, phpBB has a function that uses 'sid=' remove that part and it works.
View user's profile Send private message Visit poster's website
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Thu Sep 22, 2005 6:15 am Reply with quote Back to top

If you think itīs so easy you can take a look.I give you the adress for our session.php.
View user's profile Send private message Visit poster's website
djmaze
Subject Matter Expert


Joined: May 15, 2004
Posts: 689
Location: http://tinyurl.com/5z8dmv

PostPosted: Thu Sep 22, 2005 7:59 am Reply with quote Back to top

Look inside the function append_sid()
In the end it uses $SID so now you only have to figure out where $SID gets assigned a value.

At the end of the function session_begin() we see a line
Code:
$SID = 'sid=' . $session_id;

change it into
Code:
$SID = '';

and watch the miracle happen Wink

An expert is no expert if he can't debug and backtrace things that happen.
I just opened a copy of php-nuke did a file search on 'sid=' and found the files within an second and i'm no expert either, i'm just a code guru.
View user's profile Send private message Visit poster's website
hitwalker
Sells PC To Pay For Divorce


Joined:
Posts: 5661

PostPosted: Thu Sep 22, 2005 8:26 am Reply with quote Back to top

lol...
View user's profile Send private message
majestic-12
New Member
New Member


Joined: Sep 22, 2005
Posts: 10

PostPosted: Thu Sep 22, 2005 5:55 pm Reply with quote Back to top

Hi there,

I am the creator of the bot -- found this forum just like you found my bot - from the log file Smile

I am suprised session ID was present in the URL because a few months ago I implemented SID= filtering that was removing these SIDs from URLs before deduplication, its possible that the URLs that were crawled were loaded before that change. I could swear this code works, but I am going to double-check it, this is not to guarantee that you won't have a few of those URLs crawled, but hopefully a new batch of URLs won't have these SIDs.

Anyhow, its a good idea to get rid of those SIDs because even big search engines like Google get confused.

regards

alexc
View user's profile Send private message
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Thu Sep 22, 2005 6:17 pm Reply with quote Back to top

Hi,

great project.
I know also the similar german project YACY.
But if you really implented a SID filtering why doesnīt this not work. Your bot caused 180 MB until 20.9. on our website. I know, I have wait for long time, because I thought that something would change. I added the bot today to my robots.txt. I hope he doesnīt ignore this.
View user's profile Send private message Visit poster's website
majestic-12
New Member
New Member


Joined: Sep 22, 2005
Posts: 10

PostPosted: Thu Sep 22, 2005 6:20 pm Reply with quote Back to top

The bot does NOT ignore robots.txt and it support Crawl-Delay parameter to have bigger than normal (1 sec) delay between requests.

I do have SID filtering implemented, however I am going to recheck it and find out where the bug is -- it could be bug in filtering code or it could be that URL was found before SID filtering was implemented.

I know about YACY, but I think the model they are trying to achieve - P2P search engine - is not possible with current level of consumer grade hardware. I do not believe it will scale to WWW levels of billions of web pages.
View user's profile Send private message
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Thu Sep 22, 2005 6:32 pm Reply with quote Back to top

Some days ago I have read in YACYS forum about your discussion. Was very interesting.
However, I wish you the best for your project and that I found a way to get rid of the sid s.

Smile
View user's profile Send private message Visit poster's website
majestic-12
New Member
New Member


Joined: Sep 22, 2005
Posts: 10

PostPosted: Thu Sep 22, 2005 6:36 pm Reply with quote Back to top

Thanks - best wishes to whatever you do in cyber life too! Smile

I did have friendly discussion with Yacy people but they did not agree with me, which is fine -- perhaps I am wrong and P2P is possible, but I choose to focus on something that can definately happen Smile

I will review SID removal before loading next batch of data.
View user's profile Send private message
majestic-12
New Member
New Member


Joined: Sep 22, 2005
Posts: 10

PostPosted: Mon Sep 26, 2005 8:49 am Reply with quote Back to top

I am back!

As promised I tested my code to see if there was a bug. Now my code was NOT removing your session ID, but there is a good reason for it -- your URL is actually not correct because you use query parameter without using query identified ? first!

Ie, you have URL:

/forum-11.html&sid=1b8043b9d6b72afe85a40aa8e8aedcce

Note & -- you added parameter right after filename (which I am sure it used as internal rewrite, but URL parsing does not know that).

A correct way to do it would have been this:

/forum-11.html?sid=1b8043b9d6b72afe85a40aa8e8aedcce

Use ? to start query string, and then have your SID. This is the proper way to have URLs.

Just tested my code with it and if you had proper query delimiter then session IDs will have been removed Smile

Now, what's the way forward here, technically speaking its your "fault" as you use non-standard compliant URLs, however I will write a special parsing for current load only to correct your URLs, but you really need to fix them here as you would confuse other search engines -- Google might just check URL and not crawl for it in the first place, so you losing out in terms of traffic anyway.
View user's profile Send private message
hitwalker
Sells PC To Pay For Divorce


Joined:
Posts: 5661

PostPosted: Mon Sep 26, 2005 9:04 am Reply with quote Back to top

that would be bad news for susann cause her number 1 priority is a good rank..
well susannn ,you have some work to do..
View user's profile Send private message
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Mon Sep 26, 2005 1:29 pm Reply with quote Back to top

Thanks majestic-12 for your informative reply.
Well, Iīm really not the expert, but I thought our rewrite urls are correct.Im using GT Next Gen. I had enough stress with this forum in the past and Iīm sure itīs not a typical nuke forum. But its working and our members like it. Iī donīt touch the forums files anymore but actual Iīm looking for someone who can fix some issues.
Thanks again.

Btw: I know Bull from my other forum. Was he a Beta tester ?

@ Hitwalker
donīt worry.

Our ranking is good enough. We get daily between 5 - 15 new members.You know which game we support. Check us. Wink


Last edited by Susann on Mon Sep 26, 2005 1:33 pm; edited 1 time in total
View user's profile Send private message Visit poster's website
majestic-12
New Member
New Member


Joined: Sep 22, 2005
Posts: 10

PostPosted: Mon Sep 26, 2005 1:32 pm Reply with quote Back to top

Your rewrite is fine, its just the SID bit that's the problem, if I were you I'd disable it completely because even though my bot understands it (provided URL is properly formatted), but others won't.
View user's profile Send private message
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Mon Sep 26, 2005 1:44 pm Reply with quote Back to top

Quote:
just the SID disable it completely
That isn t so easy.
I know how other webmaster her phpBB forums search friendly optimized. They donī t disable the sids completly.
View user's profile Send private message Visit poster's website
majestic-12
New Member
New Member


Joined: Sep 22, 2005
Posts: 10

PostPosted: Mon Sep 26, 2005 2:40 pm Reply with quote Back to top

Susann wrote:
Quote:
just the SID disable it completely
That isn t so easy.


You probably right about this -- but you definately need to fix your URLs by changing & to ?, because without it a URL parsing routine will think its a fancy filename rather than query parameter.
View user's profile Send private message
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Mon Sep 26, 2005 3:46 pm Reply with quote Back to top

Quote:
but you definately need to fix your URLs by changing & to ?

Yes, I understand, but how do I do this exactly ?

-----------------------------------------------------------------------------
Really I ask me daily what problem and fault should I fix first.


Last edited by Susann on Mon Sep 26, 2005 4:02 pm; edited 1 time in total
View user's profile Send private message Visit poster's website
majestic-12
New Member
New Member


Joined: Sep 22, 2005
Posts: 10

PostPosted: Mon Sep 26, 2005 3:53 pm Reply with quote Back to top

Susann wrote:
Yes, I understand, but how do I do this exactly ?


Well, you will need to edit the code that appends those SIDs, from what I can see you must be using some internal re-writing to have nice ".html"'s, and that code incorrectly adds SID without starting query itself first using ? char.

I changed my url loader to fix this for you, but I doubt other search engines will do the same.
View user's profile Send private message
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Thu Oct 27, 2005 5:04 pm Reply with quote Back to top

I did my homework. There arenīt any sessions for bots.
But Iīm wondering that your bot still gets sid. Rolling Eyes
I swear I couldnīt find any session ids when I checked my logfiles and also when I used some helpful tools to check my site. Really strange. Iīm also wondering that we are the only one with this majestic bot problem on earth, because a lot of php-nuke sites have the same sid problem. Any other ideas ?
Code:
62.20.222.103 - - [26/Oct/2005:02:07:12 +0200] "GET /forum-29.html&sid=91d94aba163e7238014ed962c45d7b8d HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"
62.20.222.103 - - [26/Oct/2005:02:07:14 +0200] "GET /forum-29.html&sid=920f3fde9d8138a242ef4b537fae9a4a HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"
62.20.222.103 - - [26/Oct/2005:02:07:16 +0200] "GET /forum-29.html&sid=97668c6ea394a2ebf3674c2c82bc34f6 HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"
62.20.222.103 - - [26/Oct/2005:02:07:24 +0200] "GET /forum-29.html&sid=97cc19a6cb5bffedac687f2c5525fbe0 HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"
62.20.222.103 - - [26/Oct/2005:02:07:25 +0200] "GET /forum-29.html&sid=9956ef31690f42e7e14078ad52ea3cb7 HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"
62.20.222.103 - - [26/Oct/2005:02:07:27 +0200] "GET
------------cut--------------------


64.242.88.50 - - [26/Oct/2005:12:57:14 +0200] "GET /forum-29.html HTTP/1.1" 200 11200 "-" "Mozilla/4.0 compatible ZyBorg/1.0 Dead Link Checker (wn.dlc@looksmart.net; http://www.WISEnutbot.com)"


66.249.64.66 - - [26/Oct/2005:14:51:16 +0200] "GET /forum-29.html HTTP/1.0" 200 24772 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"


Code:
 
------------
12.207.8.100 - - [24/Oct/2005:00:25:59 +0200] "GET /forums-faq.html&sid=cfbe5819c57d46870048770cca259052 HTTP/1.1" 200 53029 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"


195.27.215.91 - - [24/Oct/2005:03:12:07 +0200] "GET /forums-faq.html HTTP/1.0" 200 52980 "-" "Seekbot/1.0 (http://www.seekbot.net/bot.html) HTTPFetcher/0.3"


66.249.71.10 - - [24/Oct/2005:09:24:53 +0200] "GET /forums-faq.html HTTP/1.0" 200 52980 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
Code:
 

212.239.212.249 - - [24/Oct/2005:10:35:42 +0200] "GET /forum-19.html&sid=07298bf4713e2e3e08cf383f57b19e43 HTTP/1.1" 200 32890 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"

68.142.251.26 - - [24/Oct/2005:22:51:20 +0200] "GET /forum-19.html HTTP/1.0" 200 6239 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
View user's profile Send private message Visit poster's website
majestic-12
New Member
New Member


Joined: Sep 22, 2005
Posts: 10

PostPosted: Thu Oct 27, 2005 5:17 pm Reply with quote Back to top

Susann -- these are likely to be old urls - the change you make on site do not have immediate effect on old crawled data -- note that these urls you referenced exibit same error we discussed above - no query string indicator - ?, this means they were parsed from older pages when you did not have this problem fixed.
View user's profile Send private message
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Thu Oct 27, 2005 5:24 pm Reply with quote Back to top

I supposed so. But what should I do ? Would it be a good idea to ban all your IPīs for the next 1-3 month or how long ?
View user's profile Send private message Visit poster's website
majestic-12
New Member
New Member


Joined: Sep 22, 2005
Posts: 10

PostPosted: Thu Oct 27, 2005 7:27 pm Reply with quote Back to top

Susann wrote:
Would it be a good idea to ban all your IPīs for the next 1-3 month or how long ?


No, it would not be a good idea because we use distributed model and number of IPs is very high with new ones constantly being added -- this simply won't work. If you want the bot to stop crawling then you can use robots.txt or I can add your domain to list of domains that should not be crawled.
View user's profile Send private message
Susann
Moderator


Joined: Dec 19, 2004
Posts: 3143
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Fri Oct 28, 2005 8:17 am Reply with quote Back to top

Thanks majestic -12 for your replies.
Gave me the information which I needed.
View user's profile Send private message Visit poster's website
Display posts from previous:       
Post new topic   Reply to topic

View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Forums ©
 

All logos and trademarks in this site are property of their respective owner.
The comments are property of their posters, all the rest Đ 2002-2011 by Raven

You can syndicate our news using the file xml

CSE HTML Validator Helped Clean up This Page! [Valid RSS] valid RSS 2.0 Valid robots.txt Stop Spam Harvesters, Join Project Honey Pot

Website engines core code is Đ copyright by PHP-Nuke but has been heavily patched and modified by myself and others.
PHP-Nuke is a free software released under the GNU/GPL.


:: fisubice phpbb2 style by Daz :: PHP-Nuke theme by www.nukemods.com ::
:: fisubice Theme Modified by the RavenNuke™ Team ::

:: W3C CSS Compliance Validation :: W3C HTML 4.01 Transitional Compliance Validation ::

zerosum