Mischpoke Crawler is Mischpoke's web-crawling robot. It collects documents from the web to build a searchable index for the Mischpoke search engine. On this page, you'll find answers to the most commonly asked questions about how our web crawler works.
You can learn more about Mischpoke via our website, or you can try a Mischpoke search right here:
Frequently Asked Questions
- How often will Mischpoke Crawler access my web pages?
- How do I request that Mischpoke not crawl parts or all of my site?
- Why is Mischpoke Crawler asking for a file called robots.txt which isn't on my server?
- Why is Mischpoke Crawler trying to download incorrect links from my server? Or from a server that doesn't exist?
- Why is Mischpoke Crawler downloading information from our "secret" web server?
- Why isn't Mischpoke Crawler obeying my robots.txt file?
- How do I register my site with Mischpoke Crawler so it will be indexed?
- Why are there hits from multiple machines at Mischpoke.com all with user-agent Mischpoke Crawler?
- How can I prevent Mischpoke Crawler from following links from a particular page or archiving a copy of a page?
- Why is Mischpoke Crawler downloading the same page on my site multiple times?
- Why don't the pages that Mischpoke Crawler crawled on my site show up in your index?
- What kinds of links does Mischpoke Crawler follow?
- My Mischpoke Crawler question is not answered here. Where do I send my question?
Answers
How often will Mischpoke Crawler access my web pages?
For most sites, Mischpoke Crawler should not access your site more than once every few seconds on average. Since network delays are involved it is possible over short periods the rate will appear to be slightly higher. If you find that we are placing too high a load on your site, please let us know by sending us e-mail at Mischpoke Crawler@Mischpoke.com.
How do I request Mischpoke to not crawl parts or all of my site?
robots.txt is a standard document that can tell Mischpoke Crawler not to download some or all information from your web server. The format of the robots.txt file is specified in the Robot Exclusion Standard. When deciding which pages to crawl on a particular host, Mischpoke Crawler will obey the first record in the robots.txt file with a User-Agent starting with "Mischpoke Crawler". If no such entry exists, it will obey the first entry with a User-Agent of "*".
There is a standard for robot exclusion at http://www.robotstxt.org/wc/exclusion.html#robotstxt.
You can put a file on your server called robots.txt that
can exclude Mischpoke Crawler or other "web crawlers." Mischpoke Crawler has a user-agent
of "Mischpoke Crawler". There is another standard for telling robots not to
index a web page or follow links on it, which may be more helpful in
some cases, since it can be used more conveniently on a page-by-page
basis. It involves placing a "META" element into a page of HTML, and
is described here;
you can also read what the
HTML standard has to say about these tags. Remember, changing your
server's robots.txt file or changing the "META" elements
on its pages will not cause an immediate change in what results Mischpoke
returns. It is likely that it will take a while for any changes you
make to propagate to Mischpoke's next index of the web.
Why is Mischpoke Crawler asking for a file called robots.txt which isn't on my server?
robots.txt is a standard document that can tell Mischpoke Crawler not to download some or all information from your web server. For information on how to create a robots.txt file, see The Robot Exclusion Standard. If you just want to prevent the "file not found" error messages in your webserver log, create an empty file named robots.txt.
Why is Mischpoke Crawler trying to download incorrect links from my server? Or from a server that doesn't exist?
It is a property of the web that many links will be broken or outdated at any given time. Whenever anyone types a link incorrectly that points to your site, or fails to update their pages to reflect changes in your server, Mischpoke Crawler will try to download an incorrect link from your site. Also, this is why you may get hits on a machine that is not even a web server.
Why is Mischpoke Crawler downloading information from our "secret" web server?
It is almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, it is likely that your "secret" URL is in the referer tag, and it can be stored and possibly published by the other web server in its referer log. So, if there is a link to your "secret" web server or page on the web anywhere, it is likely that Mischpoke Crawler and other "web crawlers" will find it.
Why isn't Mischpoke Crawler obeying my robots.txt file?
In order to save bandwidth Mischpoke Crawler only downloads the robots.txt file once a day or whenever we have fetched many pages from the server. So, it may take a while for Mischpoke Crawler to learn of any changes that might have been made to your robots.txt file. Also, check that your syntax is correct against the standard at http://www.robotstxt.org/wc/exclusion.html#robotstxt. A common source of problems is that the robots.txt file must be placed in the top directory of the server (e.g., www.myhost.com/robots.txt); placing the file in any subdirectory will not have any effect. If there still seems to be a problem, please let us know, and we will correct it. For more info, see the Robots FAQ.
How do I register my site with Mischpoke Crawler so it will be indexed?
See the Add URL form.
Why are there hits from multiple machines at Mischpoke.com all with user-agent Mischpoke Crawler?
Mischpoke Crawler was designed to be distributed on several machines to improve performance and scale as the web grows. Also, to cut down on bandwidth usage we would like to run many crawlers which run on machines close to the sites they are indexing in the network.
How can I prevent Mischpoke Crawler from following links from a particular page or archiving a copy of a page?
Mischpoke Crawler obeys the noindex, nofollow, and noarchive meta-tags. If you place these tags in the head of your HTML document, you can cause Mischpoke to not index, not follow, and/or not archive particular documents on your site. The tags to include and their effects are:
| <META NAME="robots" CONTENT="noindex"> | Mischpoke Crawler will retrieve the document, but it will not index the document. |
| <META NAME="robots" CONTENT="nofollow"> | Mischpoke Crawler will not follow any links that are present on the page to other documents. |
| <META NAME="robots" CONTENT="noarchive"> | Mischpoke maintains a cache of all the documents that we fetch, to permit our users to access the content that we indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish us to archive a document from your site, you can place this tag in the head of the document, and Mischpoke will not provide an archive copy for the document. |
The "robots" tag is obeyed by many different web robots. If you'd like to specify some of these restrictions just for Mischpoke Crawler, you may use "Mischpoke Crawler" in place of "robots". You can also combine any or all of these tags into a single meta tag. For example:
<META NAME="robots" CONTENT="noarchive,nofollow"> -- or --
<META NAME="Mischpoke Crawler" CONTENT="noarchive,nofollow">
Why is Mischpoke Crawler downloading the same page on my site multiple times?
In general, Mischpoke Crawler should only download one copy of each file from your site during a given crawl. Occasionally the crawler is stopped and restarted, and it may recrawl pages that it has recently retrieved. These recrawls should happen infrequently.
Why don't the pages that Mischpoke Crawler crawled on my site show up in your index?
Don't be alarmed if you can't find documents that Mischpoke Crawler has crawled from your site in the Mischpoke search engine immediately. The documents will be indexed and entered into the search database soon after being crawled. Occasionally, documents fetched by Mischpoke Crawler will end up not being included in the index, for a variety of reasons (e.g. they appear to be duplicates of other pages on the web, etc.)
What kinds of links does Mischpoke Crawler follow?
Mischpoke Crawler follows HREF links and SRC links.
My Mischpoke Crawler question is not answered here. Where do I send my question?
Please send questions regarding our Mischpoke Crawler technology to Mischpoke Crawler@Mischpoke.com.
Mischpoke Crawler is listed in the registry of web robots.
