Ravens PHP Scripts: Forums
 

 

View next topic
View previous topic
Post new topic   Reply to topic    Ravens PHP Scripts And Web Hosting Forum Index -> General/Other Stuff
Author Message
mrix
Client


Joined: Dec 04, 2004
Posts: 757

PostPosted: Mon Jul 25, 2005 2:55 am Reply with quote

Hello all, I was wondering if anyone has any idea`s about the new google site map ? - how to go about adding the right format sitemap for a phpnuke site etc
thanks for any help
Cheers
mrix


Last edited by mrix on Thu Sep 08, 2005 2:30 pm; edited 1 time in total 
View user's profile Send private message Visit poster's website
mrix
PostPosted: Mon Jul 25, 2005 3:20 am Reply with quote

I managed to find this info on this site Only registered users can see links on this board! Get registered or login!

still a bit behond me I think, is there any new options to go about adding a google sitemap ? or I`ll just have to tax the olds brains a bit more Smile

Cheers
all
mrix
 
grantb
Regular
Regular


Joined: Feb 16, 2005
Posts: 67
Location: Canada

PostPosted: Mon Jul 25, 2005 2:53 pm Reply with quote

I just went through the proccess myself and found a few tools to get the job done. Check out my post Only registered users can see links on this board! Get registered or login! I tried a few different programs and gave some info on each one that I tried. Smile

_________________
Only registered users can see links on this board! Get registered or login! 
View user's profile Send private message
Dauthus
Worker
Worker


Joined: Oct 07, 2003
Posts: 211

PostPosted: Mon Jul 25, 2005 4:40 pm Reply with quote

I used the one here. It worked easily and even submits you sitemap to google for you. It was fairly easy, just upload and run. It even lets you change your settings and such.
Only registered users can see links on this board! Get registered or login!

_________________
Only registered users can see links on this board! Get registered or login!
Vivere disce, cogita mori 
View user's profile Send private message Visit poster's website
kguske
Site Admin


Joined: Jun 04, 2004
Posts: 6383

PostPosted: Mon Jul 25, 2005 8:34 pm Reply with quote

Dauthus, is it stable? I thought I downloaded that last week - and that was 2 versions ago... It does show that it's an active development effort - good news.

_________________
I google, therefore I exist...
Only registered users can see links on this board! Get registered or login!
 
View user's profile Send private message
Dauthus
PostPosted: Mon Jul 25, 2005 8:48 pm Reply with quote

Worked fine for me. I couldn't get google's version and instructions to work on my server. This went over without a hitch.
 
Guardian2003
Site Admin


Joined: Aug 28, 2003
Posts: 6793
Location: Ha Noi, Viet Nam

PostPosted: Tue Jul 26, 2005 2:04 am Reply with quote

Dauthus wrote:
<snip> I couldn't get google's version and instructions to work on my server. <snip>

Yeah, not everyone has Python installed, hence the news article I posted.
I'll certainly revisit some of the other offerings mentioned in this thread though as my own method is a little long winded, though it has produced some excellent results for me.
 
View user's profile Send private message Send e-mail
mrix
PostPosted: Tue Jul 26, 2005 2:26 am Reply with quote

Hello again all, I tried the varius mrthods and grantb page had loads of great info which worked for me etc, I personally couldnt get the other method to work for some reason?
Thanks
mrix
 
Guardian2003
PostPosted: Tue Jul 26, 2005 3:06 am Reply with quote

I just tried the software from enarion.net as per the link posted above. Tis seems to be a vast impreovement on their initial offering and is looking pretty good so far Smile
 
Guardian2003
PostPosted: Fri Jul 29, 2005 3:57 am Reply with quote

Hmm, I had a few problems with that script not over writing previously created file and the server load went a bit dodgy at times so I hunted around some more.

I found a Windows based program that looked very promising - run it from your PC and it crawls the site for you yadda yadda.
All the usual features and one nice touch I particualrly liked was it fetched the robots.txt file from the site first and added that to the 'blocked content' list. This saved quite a few minutes of time as it meant the majority of directories I didnt want the software to crawl got added automatically.
Another nice feature I quite liked, apart from the automated file upload facility (optional) and the google pinging (optional) is the automatic google sitemap schema compliance checking - for example, a url crawled with '&' in it automatically gets changed (upon sitemap generation) to '&amp'.
If I have peaked your interest, the software is currently free from Only registered users can see links on this board! Get registered or login!

As usual, I would appreciate any comments from anyone else who tries it out.

My site uses googletap so I cannot vouch for the amount of url's this software will crawl without it but I managed to get links to every forum, every forum post, every news article plus the category links etc tec. I also have the Amazon module installed on my site (just for fun) but it did a fantastic job of creating lots of valid links to products - something those who run commercial sites might find of particular interest.
 
Dauthus
PostPosted: Fri Jul 29, 2005 9:26 am Reply with quote

I'm going to give it a shot today. I'll let you know how it goes.
 
Dauthus
PostPosted: Sat Jul 30, 2005 8:32 am Reply with quote

Ok, here's what I have so far using SOFTplus GSiteCrawler that Guardian 2003 spoke of at Only registered users can see links on this board! Get registered or login!

So far it has been crawling for 2 days and has found 17647 links. This is taking quite a while, but it is hitting everything in the site. The xml it last uploaded was over 559.000 kb.

I don't know a whole lot about this, but isn't this kind of large??
 
mrix
PostPosted: Sun Jul 31, 2005 5:06 am Reply with quote

I have found when I try to scan my subdomain it tends to scan my main web site files as well is there a way around this ?
thanks
mrix
 
Guardian2003
PostPosted: Sun Jul 31, 2005 6:02 am Reply with quote

mrix wrote:
I have found when I try to scan my subdomain it tends to scan my main web site files as well is there a way around this ?
thanks
mrix

When you are creating the 'New Project' from the top menu bar are you entering the full sub domain url e.g. Only registered users can see links on this board! Get registered or login! ??
I have not tried it on a sub domain yet but I'll try to get around to that soon.
It might be worth entering the normal site url ( Only registered users can see links on this board! Get registered or login! ) in the 'banned url's' area, this may stop from crawling the whole site and keep it within the sub domain.

Dauthus - 559k is not large at all, google will accept single sitemaps of up to 10Meg in size. The crawling does seem vey slow though - there is an option to increase the number of bots crawling simultaneously so you might want to check that out. I had might set at 10 bots and it took about 3 hours to crawl my site giving me around 15,000 valid links.
 
softplus
New Member
New Member


Joined: Jul 31, 2005
Posts: 8
Location: Switzerland

PostPosted: Sun Jul 31, 2005 7:17 am Reply with quote

Hi guys
just a short note, I'm the one behind the GSiteCrawler (looks like tracking referrers is worth it after all Smile).

@Dauthus: the size doesn't matter, the program will automatically make a new sitemap file every 9-10MB (or 50'000 URLs; you can adjust this value) + it will automatically gzip the file to save bandwidth. If you have several sitemap-files it will make a sitemap-index file which contains links to your sitemap-files. You will then just need to submit the sitemap-index file. (you can have up to 1000 sitemaps in a sitemap index file, i.e. max 5 Mil. URLs Smile)

However, it does seem a bit strange that after 2 days it still only has 17k URLs, that's 11 URLs/Minute (if I can calculate right). I'm guessing you're either serving large pages or your server is bogging down. On a 512k ADSL line I get up to 40 URLs/Minute sustained. If you have a copy of your site locally, you should get up to 300 URLs/Minute.

@mrix: Try the current beta-version (a new one is coming today as well). I know this was an issue with one of the earlier versions. Also make sure you have your subdirctory in the Main-URL field. It will then only include pages in or "below" that directory.

Let me know if you find other issues or come up with neat new ideas!
Thanks
 
View user's profile Send private message Visit poster's website
mrix
PostPosted: Sun Jul 31, 2005 7:29 am Reply with quote

Hi the program works really well on my main site but still cant get the sub domain to scan right but I will look forward to downloading the latest version if its out today.
thanks for this great program
mrix
 
Dauthus
PostPosted: Sun Jul 31, 2005 5:23 pm Reply with quote

Softplus. Maybe because me no have DSL. My connection is a little slower than most, so I don't have a problem with it taking a while. Thanks for the info. When my site comes back together (Lost the hard drive on my server) I will run it again.
 
mrix
PostPosted: Mon Aug 01, 2005 1:25 am Reply with quote

Great I have now installed the updated version and my sub domain problem is fixed and it scans now ok. Smile

Thanks
mrix
 
Steptoe
Involved
Involved


Joined: Oct 09, 2004
Posts: 293

PostPosted: Mon Aug 01, 2005 2:44 pm Reply with quote

Used it last night...been looking at others over the last month...I think it worked great. crawling without filters took ages.
The site has about 1200 posts, 200 members. Google Tap installed
After 3 crawls on a p3 512 ram from LAN(nearly total of 7000 hits) and about 5 hrs ended up with a site map od 70k with exactly what should be in it....it was picking up stuff like reply to links and new posts, member profiles etc, gallery admin stuff (maybe I shouldnt have crawled from a machine that was logged in as Admin and meber lol ?)I ended up with filtering with the following...(this also took out gallery pics)
Used it last night...been looking at others over the last month...I think it worked great. crawling without filters took ages.
The site has about 1200 posts, 200 members. Google Tap installed
After 3 crawls (nearly totla of 7000 hits) and about 5 hrs ended up with a site map od 70k with exactly what should be in it....it was picking up stuff like reply to links and new posts, member profiles etc, gallery admin stuff (maybe I shouldnt have crawled from a machine that was logged in as Admin and meber lol ?)I ended up with filtering with the following...(this also took out gallery pics)

[/code]Statistics for kakariki.net2 Date: 02/08/2005 00:31:47

Main URL: Only registered users can see links on this board! Get registered or login! (not case sensitive)

Number of URLs listed total: 498
Number of URLs listed to be included: 469
Number of URLs listed to be crawled: 469
Number of URLs still waiting in the crawler: 2466 (may include some already listed)
Number of URLs aborted in the crawler: 0

Ban URLs with these texts in them from being crawled:
Only registered users can see links on this board! Get registered or login! admin.php
Only registered users can see links on this board! Get registered or login! admin/
Only registered users can see links on this board! Get registered or login! blocks/
Only registered users can see links on this board! Get registered or login! Conservation_Sites/
Only registered users can see links on this board! Get registered or login! db/
Only registered users can see links on this board! Get registered or login! images/
Only registered users can see links on this board! Get registered or login! includes/
Only registered users can see links on this board! Get registered or login! language/
Only registered users can see links on this board! Get registered or login! Members_Sites/
Only registered users can see links on this board! Get registered or login! modules/Approve_Membership/
Only registered users can see links on this board! Get registered or login! modules/Encyclopedia/
Only registered users can see links on this board! Get registered or login! modules/FAQ/
Only registered users can see links on this board! Get registered or login! modules/Members_List/
Only registered users can see links on this board! Get registered or login! modules/MS_Analysis/
Only registered users can see links on this board! Get registered or login! modules/Private_Messages/
Only registered users can see links on this board! Get registered or login! modules/Statistics/
Only registered users can see links on this board! Get registered or login! modules/Submit_News/
Only registered users can see links on this board! Get registered or login! modules/Surveys/
Only registered users can see links on this board! Get registered or login! modules/WeatherHarvest/
Only registered users can see links on this board! Get registered or login! modules/Who-is-Where/
Only registered users can see links on this board! Get registered or login! modules/Your_Account/
Only registered users can see links on this board! Get registered or login! msaworkflag/
Only registered users can see links on this board! Get registered or login! Sounds/
Only registered users can see links on this board! Get registered or login! themes/

Drop these parts of URLs being crawled:
-new-
account
addlink
date
days
download-file
edit
faq
forums-cp
freind
fsearch
group
harvest
hits
kist
mark
messages-post
next
order
pass
popular
popup
previous
print
profile
quote
random
rate
redirect
reply
rss
slideshow
statis
stats
submit
title
topd
uname
unanswered
userinfo
weather
your_

Remove these parameters when specified while being crawled:
PhpSessId
PhpSessionId
Session
SessionId
SID
XTCsid

ROBOTS.TXT from 01/08/2005 22:24:14:
User-agent: Mediapartners-Google*
Disallow:


User-agent: *
Disallow: /admin.php
Disallow: /admin/
Disallow: /images/
Disallow: /includes/
Disallow: /themes/
Disallow: /blocks/
Disallow: /db/
Disallow: /msaworkflag/
Disallow: /Sounds/
Disallow: /modules/Private_Messages/
Disallow: /modules/FAQ/
Disallow: /modules/Members_List/
Disallow: /Members_Sites/
Disallow: /Conservation_Sites/
Disallow: /modules/MS_Analysis/
Disallow: /modules/Statistics/
Disallow: /modules/Submit_News/
Disallow: /modules/Surveys/
Disallow: /modules/Your_Account/
Disallow: /modules/Approve_Membership/
Disallow: /modules/Who-is-Where/
Disallow: /modules/WeatherHarvest/
Disallow: /modules/Encyclopedia/
Disallow: /language/



End of data.
Code:


I left frequency at 1 and Prioty at 0.50...didnt quite know what to do there

Google previouly has only hit the front page, other crawlers msn aol yahoo have tended to go a little deeper, hence only rated 5 to 8 on google and 1 or 2 on the other search engines...so its a matter of wait and see what happens now lol


Last edited by Steptoe on Tue Aug 02, 2005 2:25 pm; edited 1 time in total 
View user's profile Send private message
softplus
PostPosted: Mon Aug 01, 2005 2:57 pm Reply with quote

@Steptoe: sounds good. Yes it makes a BIG difference on a "complex" site (i.e. most CMS-based sites) if you play with the filter settings or not. These types of things have been known to keep Google (and other SEs) from indexing - if it can't figure out which parameters are important and which to ignore, it ends up either indexing the same content twice (will either keep one or remove both) or it ends up indexing an invalid page (+ remove it). You can watch this in your log files sometimes.

This is where Google sitemaps really shine - you're telling Google the URLs which are valid, and it thanks you by putting them in its index. One thing you can do to help other Search engines get it right as well is to add a HTML sitemap to your site (linked from the front page). You can have the GSiteCrawler generate the page when you do the Google sitemap and you'll just need to update it regularly on the server (I'm going for automation next Smile). The HTML sitemap doesn't even need to be "visible" to the user, as long as the SEs find it, it's ok.

Keep it up!
 
Guardian2003
PostPosted: Mon Aug 01, 2005 11:33 pm Reply with quote

Yes I agrre, I am using both xml anf html sitemaps on my site.

Steptoe - if you had used the option for it to upload the robots.txt file, you would have found most of the 'rubbish' links already excluded, especially those 'reply' and 'quote' links for the forum.
Generally speaking, it is probably better to exclude the 'modules' directory from being crawled as you have found out but, there are instances when selective crawling of the modules dirctory can be beneficially, e.g. if you are using the weblinks module.

Softplus - one feature I would like to see is when, during a crawl, if you add additional 'parts of urls not to be crawled' there is some sort of hard refresh and removal of any exisiting crawled links removed. At the present time, it is difficult to tell whether any additional 'exclusion' criteria has been acknowledged as it still seems to crawl them for some time and removing links manually after a crawl is finished can be tedious on big sites.
Another feature I would like to see is the ability to 'find and replace'. For example, your software does a very nice job of automatically converting '&' to '&amp' - it would be nice to cater for future expansion by allowing additional find/replace strings.
 
Steptoe
PostPosted: Tue Aug 02, 2005 12:13 am Reply with quote

Quote:
Steptoe - if you had used the option for it to upload the robots.txt file, you would have found most of the 'rubbish' links already excluded, especially those 'reply' and 'quote' links for the forum.


yeah I know, but if u look at my robots.txt it still includes forums anyway
As I am on LAN @100meg BW etc wasnt an issue, and as far as time is concerned, there was a good movie on TV...(try out the new projectorlol)
When getting software I like to see what it does...

Suggestion. another couple lol
1/when one put a parameter into the search for part of url...if there was also abutton that enabled unselecting the crawl and include tk boxes
2/ If one has 2 projects but the same url it scans both?? if it could just do 1 or able to select which ones.

I think there is something wrong with my google tap...msn is the only one thar searches forums...everything almost...none of the others do..(not what I said above) also google searched today and again only hit the front page. If I clk forums link I get Only registered users can see links on this board! Get registered or login!
If I type Only registered users can see links on this board! Get registered or login! that will also go to the forms...is it meant to be like that?
Also in the google tap box all thats there is the home page??
Im having a lot of trouble getting my head around all this...give me hardware or a network anyday lol
 
softplus
PostPosted: Tue Aug 02, 2005 1:46 am Reply with quote

@Guardian2003: Refiltering is high on my list, thanks for the confirmation. Search + Replace in the URLs sounds like a good idea, I'll add it to my list as well.

@Steptoe: To your first suggestion - it's already possible. Just enter a text to search, click on "select" - it will select the matching URLs. Now below the search box you have a small "toolbar" where you can change all the settings, i.e. Crawl / Manual / Include: on / off / swap, the same for priority + frequency. The way you can quickly change the settings everywhere. As to your second question, I didn't quite understand what you mean? If you have 2 projects, even with the same URL, they will be treated differently in the program (i.e. different filter settings, etc.). Or do you mean something else? Perhaps I just need another coffee to understand Smile.

Thanks for the feedback, guys!
 
Steptoe
PostPosted: Tue Aug 02, 2005 2:19 pm Reply with quote

When I did the 1st scan...no robots I called it project 1 so it came up with heaps
2nd scan I called project 2 with robots and filters
When the 2nd scan was finished the 1st scan was the same as the 2nd.
Both had the same url...
ZMay I did something while playing around...working long days and was late at night, bad combo...story of my life at the monent..2 days off in 6 weeks not good.
I must still complement you on the program...
please forgive me for asking what seems to be stupid questions. There are many other lay ppl who dont have a grasp on stuff coders will often take for granted.
What exactly is the effect changing frequency/ prioiity do?
What is the diff between using the sitemap.xml.gz or the sitemap.xml?
I assume one has to submitt one or the other to google?
One submitted does google still use other files and do they over ride the sitemap?
 
softplus
PostPosted: Tue Aug 02, 2005 2:53 pm Reply with quote

Don't worry about "stupid" questions, it's the stupid answers you have to be afraid of Smile. These are all things which are really new and Google doesn't explain them very well (sometimes), so they are left for people to guess at...

Frequency is meant to specify how often the Googlebot should check your page for changes. In my program it just takes the age (in days) and shows that. For the sitemap.xml it is translated to always / hourly / daily / weekly / monthly and yearly. Seeing that the Googlebot doesn't have much free time, I guess that anything higher than weekly is (for most "normal" sites) probably just a dream.

Priority is relative to your sitemap, to let the Googlebot know which pages are "more important" to crawl the way you specified. I tend to give my main page a 1.0 and the others all 0.5. Some people leave the priority out of the file, but I imagine it might help Google figure out which pages are the main pages (i.e. not just for the crawling but also for the search engine results). However, I don't have any evidence for that Smile.

The sitemap.xml file is compressed as a gz-file in the sitemap.xml.gz. If you have a larger site, the idea is that Google can get the compressed version and save some bandwidth for you. If bandwidth is no issue, you can use either one - however, I suggest using the .xml (if bandwidth is no issue) as you can check it easier (just open the link in IE). In reality you use one or the other and submit that one to Google (in your "My Sitemaps"). Using both probably doesn't have any influence.

The submitted files do not take precedence over existing files in the Google index. Using the sitemaps is not a way of getting existing links removed, but rather Google will know which links to add (if they are new). Also, using sitemaps regularly will let Google know which pages have been modified (with the last change date). This lets the Google bot get those pages explicitly, i.e. you don't have to wait for Google to check all pages to find the one you changed.

Hope it helped, keep the questions coming!
 
Display posts from previous:       
Post new topic   Reply to topic    Ravens PHP Scripts And Web Hosting Forum Index -> General/Other Stuff

View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2007 phpBB Group
All times are GMT - 6 Hours
 
Forums ©