Wikinews:Google news

Outdated: See User:Brian McNeil/Google News for brief notes on current setup.

This page aims to describe how google news is currently mirroring us. This is off the top of my head, so take it with a grain of salt.

Googlebot loads the main page. It looks for links which it assumes to be articles. Any link with an id number in it is assumed to be an article. If a link does not have a number in it, it is ignored. If a link has a rel=nofollow tag on it, it is ignored.

Since by default we don't have id numbers in our articles, some software modifications were made. Specifically DPL's were modified to add an extra parameter,  when   the url changes from http://en.wikinews.org/wiki/example_article to http://en.wikinews.org/wiki/example_article?curid=1234. See WN:DPL for more details on dpl parameters. Changes were also made to output the url with parameters using the /wiki/title access point instead of the primary script access point (/w/index.php?title=... ), as links to the primary script access point have the nofollow tag, and thus don't work.

Please note that curid is not just a dummy parameter. It points the internal article id in the database. In fact it overides the title. For example, article 1234 is actually Gusenbauer reelected as Austrian Social Democrat Party leader. You can have a link to any title, and it will still point to that article if the right curid is there. For example http://en.wikinews.org/wiki/An_article_that_clearly_does_not_exist_but_goes_to_the_re-election_thing_anyhow?curid=1234

Flagged revisions and the review process which we all now know and love was also partially the result of trying to be included on google news. google news does not really like the idea of publishing without peer review which we used to do in our earlier years. Changes to the DPL were also made so that it has the option of showing only sighted articles (see WN:DPL). Main page dpl's take advantage of this (as thats what google uses), however most portals still display all articles, regardless of if they are sighted, simply because no one has changed the dpls. (this doesn't really matter much as the publish tag is only applied after the article is sighted anyhow)

As a result of how we are indexed we took the developing stories list off the main page, as even if we set showcurid to false, google would still index any page that contains a number in it (ex: Wikinews Shorts: April 14, 2007). Some people (as in the person who wrote this page, however most people disagree with him) were concerned that that may somewhat limit new users exposure to the wiki aspects of wikinews.

For more information, please the Water cooler archives, or ask someone.

- Other indexing schemes which i think are possible is using a non-mainpage page as google's starting point, or getting someone to write a special page extension that outputs the latest sighted articles in google's xml sitemap format.

other google
The above applies only to google news. It should be noted that other portions of google index us differently. This includes google news archive, as well as normal google web search.

All search engines are prevented from indexing the following pages:
 * Some (all?) special pages. (this is because they are often hard on the server. and can have loops, and what not. this is the same for all mediawiki servers)
 * Portal:Prepared stories (included in robots.txt) and subpages. note Story preparation is not in robots.txt, but prevented from being indexed through other means.
 * Things not in the main namespace with  on them
 * This includes Story preparation and portal:prepared stories
 * Anything that includes the template prepared (that is not in the main namespace)
 * History pages, diffs, and more or less anything that can only be accessed via a url that starts with http://en.wikinews.org/w/index.php (as opposed to being accessed by a url starting with http://en.wikinews.org/wiki/ )