前几天,在本站论坛给出了Mediapartners-Google蜘蛛的作用:Mediapartners-Google是什么蜘蛛。今天,再给出谷歌旗下所有蜘蛛的列表,以及这些蜘蛛的作用,方便各位解决各种谷歌爬虫引起的问题。
谷歌旗下一共有九类爬虫,分别为API类、广告类、图片类、新闻类、视频类、网页类、订阅类、图标类、页面转码类等爬虫。共计十七个爬虫,分别为APIs-Google、AdSense、AdsBot Mobile Web Android、AdsBot Mobile Web、AdsBot、Googlebot Image、Googlebot News、Googlebot Video、Googlebot (Desktop)、Googlebot (Smartphone)、Mobile AdSense、Mobile Apps Android、Feedfetcher、Google Read Aloud、Duplex on the web、Google Favicon、Web Light等爬虫。
下面我们将分别给出这十七个爬虫的UA列表:
UA表示User agent,即每个网络主机都有的一个客户端身份标记;这里将列出这些UA的简写及其详细UA。下文中,我们用UA表示简写,用User agent表示全称。
APIs-Google
- UA:APIs-Google
- User Agent:APIs-Google (+https://developers.google.com/webmasters/APIs-Google.html)
AdSense
- UA:Mediapartners-Google
- User Agent:Mediapartners-Google
AdsBot Mobile Web Android
- UA:AdsBot-Google-Mobile
- User Agent:Mozilla/5.0 (Linux; Android 5.0; SM-G920A) AppleWebKit (KHTML, like Gecko) Chrome Mobile Safari (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)
AdsBot Mobile Web
- UA:AdsBot-Google-Mobile
- User Agent:Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)
AdsBot
- UA:AdsBot-Google
- User Agent:AdsBot-Google (+http://www.google.com/adsbot.html)
Googlebot Image
- UA:Googlebot-Image
- UA:Googlebot
- User Agent:Googlebot-Image/1.0
Googlebot News
- UA:Googlebot-News
- UA:Googlebot
- User Agent:Googlebot-News
Googlebot Video
- UA:Googlebot-Video
- UA:Googlebot
- User Agent:Googlebot-Video/1.0
Googlebot (Desktop)
- UA:Googlebot
- User Agent:
- Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
- Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/W.X.Y.Z‡ Safari/537.36
or (rarely used): - Googlebot/2.1 (+http://www.google.com/bot.html)
Googlebot (Smartphone)
- UA:Googlebot
- User Agent:Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z‡ Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mobile AdSense
- UA:Mediapartners-Google
- User Agent:(Various mobile device types) (compatible; Mediapartners-Google/2.1; +http://www.google.com/bot.html)
Mobile Apps Android
- UA:AdsBot-Google-Mobile-Apps
- User Agent:AdsBot-Google-Mobile-Apps
Feedfetcher
- UA:FeedFetcher-Google
- Does not respect robots.txt rules
- User Agent:FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)
Google Read Aloud
- UA:Google-Read-Aloud
- Does not respect robots.txt rules
- Current agents:
- Desktop agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36 (compatible; Google-Read-Aloud; +https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers)
- Mobile agent: Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36 (compatible; Google-Read-Aloud; +https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers)
- Former agent (deprecated): google-speakr
Duplex on the web
- UA:DuplexWeb-Google
- May ignore the * user-agent wildcard
- User Agent:Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012; DuplexWeb-Google/1.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Mobile Safari/537.36
Google Favicon
- UA:Google Favicon
- For user-initiated requests, ignores robots.txt rules
- User Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
Web Light
- UA:googleweblight
- Does not respect robots.txt rules
- User Agent:Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19
文中出现的Chrome/W.X.Y.Z‡表示Chrome的任意版本号,即爬虫的Chrome版本会随着时间的推移而变化。
下面举个例子,来说明如何使用这些蜘蛛爬虫并限制他们抓取哪些页面:
User-agent: Googlebot
Disallow: /
User-agent: Mediapartners-Google
Disallow:
上述代码表示禁止谷歌收录网站,但允许谷歌广告爬虫爬站,这样网站内的谷歌广告还会是正常的。
User-agent: Googlebot
Disallow:
User-agent: Googlebot-Image
Disallow: /personal
上述代码允许谷歌收录网站,但不允许图片爬虫爬取personal目录的图片!
这些代码需要放置在网站根目录robots.txt文件中即可生效。
转载文章,原文出处:Google Search Central,由古哥整理发布
如若转载,请注明出处:https://iymark.com/articles/950.html