Robots were once rare visitors, difficult to spot in server logs. Today, however, bots account for roughly 97-98% of website traffic. Given that bandwidth and infrastructure usually come at a cost, managing this automated traffic efficiently is essential. Not all bots provide value – some improve SEO visibility and user experience, while others waste resources, scrape content, or cause security issues. Smart management of robots.txt can significantly improve your website’s performance, security, and SEO.
How robots.txt Works
Robots.txt is a simple text file placed in a website’s root directory, providing instructions to bots (also called crawlers) about which parts of the site they can access. Legitimate crawlers from search engines and trusted monitoring tools typically respect these rules. However, malicious or aggressive bots may ignore robots.txt entirely, so it should be viewed as an initial protective measure rather than a complete security solution. Combining robots.txt with other measures enhances overall defense, reduces server load, and improves the user experience.

Categories of Bots and Crawlers
Bots fall into several categories based on their purpose. Understanding these helps decide which ones to allow or block to optimize your site’s performance and security.
Search Engine Bots & Crawlers
Search engine bots are essential for indexing websites and driving organic traffic. Examples include Googlebot (Google), Bingbot (Bing), DuckDuckBot (DuckDuckGo), YandexBot (Yandex), Baiduspider (Baidu), and Slurp (Yahoo). These bots significantly contribute to SEO, indexing, and site visibility, so they should have unrestricted access to relevant areas.
SEO and Marketing Bots
SEO crawlers collect data about backlinks, keywords, and competitor sites. Examples include AhrefsBot, SEMrushBot, Rogerbot (Moz), MJ12Bot (Majestic), and SerpstatBot. While these tools can provide valuable insights for SEO analysis, they can overload servers when run excessively by competitors or third parties. Blocking overly aggressive or unnecessary SEO bots helps maintain server health and website performance.
AI Trainer Bots
AI trainer bots, such as OpenAI’s GPTBot, Common Crawl’s CCBot, Anthropic’s ClaudeBot, and Amazonbot, collect web content to train AI models. These bots provide no direct benefit and can significantly consume bandwidth and resources. Blocking them entirely is advisable to reduce unnecessary server load and associated costs.
Research and Archival Bots
Bots used by academic or research institutions (e.g., ia_archiver from Internet Archive and archive.org_bot) typically cause no harm but offer minimal direct benefit to your site. If your content is sensitive or you prefer it remains unarchived, blocking these bots is recommended.
Monitoring and Uptime Bots
Monitoring bots, such as UptimeRobot, PingdomBot, StatusCake, and NewRelic, verify site availability and performance. These bots are beneficial if you actively use their associated services. However, unknown monitoring bots or those not linked to your monitoring strategies might unnecessarily drain resources. Consider blocking these unknown or unnecessary crawlers to improve performance.
Example Combined robots.txt
Here is a detailed example of a robots.txt file that demonstrates how to block several common categories of unwanted bots effectively while keeping your important SEO traffic unaffected.
# Block SEO and Marketing Bots
User-agent: AhrefsBot
User-agent: BLEXBot
User-agent: Brightbot 1.0
User-agent: DotBot
User-agent: DataForSeoBot
User-agent: domainsproject.org
User-agent: keys-so-bot
User-agent: MJ12Bot
User-agent: rogerbot
User-agent: SEMrushBot
User-agent: SerpstatBot
# Block AI Trainer Bots
User-agent: anthropic-ai
User-agent: CCBot
User-agent: ChatGLM-Spider
User-agent: ChatGPT-User
User-agent: cohere-ai
User-agent: GPTBot
User-agent: meta-externalagent
User-agent: PerplexityBot
# Block Other Unwanted Crawlers
User-agent: BLEXBot
User-agent: EzoicBot
Disallow: /
This example serves as a solid foundation, but further steps are needed when bots ignore these directives.
What to Do When robots.txt Doesn’t Work
Since not all bots respect robots.txt, additional security measures become essential. On Nginx servers, using the map directive can effectively block unwanted user agents or IP ranges:
map $http_user_agent $bad_bots {
default 0;
~*SemaltBot 1;
~*Perplexity 1;
}
server {
if ($bad_bots) {
return 403;
}
}
Additionally, Cloudflare’s Web Application Firewall (WAF) can efficiently handle bots that ignore robots.txt. Implementing rate limiting, CAPTCHA challenges, and using regularly updated firewall rules significantly enhance protection against malicious traffic.
Final Thoughts
Using robots.txt is a crucial first step for managing bot traffic effectively. Always allow essential crawlers that boost your site’s SEO and functionality. Carefully manage or block aggressive SEO crawlers, AI trainer bots, and archival bots that offer minimal or no benefit. Finally, leverage advanced security tools like Nginx configurations and Cloudflare WAF to protect against persistent unwanted bots. This comprehensive, layered approach ensures optimal site performance, security, and cost-effective bandwidth management.

