robotstxt0.7.15 package

A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker

as.list.robotstxt_text

Method as.list() for class robotstxt_text

fix_url

fix_url

get_robotstxt_http_get

storage for http request response objects

get_robotstxt

downloading robots.txt file

get_robotstxts

function to get multiple robotstxt files

guess_domain

function guessing domain from path

http_domain_changed

http_domain_changed

http_subdomain_changed

http_subdomain_changed

http_was_redirected

http_was_redirected

is_suspect_robotstxt

is_suspect_robotstxt

is_valid_robotstxt

function that checks if file is valid / parsable robots.txt file

list_merge

Merge a number of named lists in sequential order

named_list

make automatically named list

null_to_defeault

null_to_defeault

parse_robotstxt

function parsing robots.txt

parse_url

parse_url

paths_allowed_worker_spiderbar

paths_allowed_worker spiderbar flavor

paths_allowed

check if a bot has permissions to access page(s)

pipe

re-export magrittr pipe operator

print.robotstxt_text

printing robotstxt_text

print.robotstxt

printing robotstxt

remove_domain

function to remove domain from path

request_handler_handler

request_handler_handler

robotstxt

Generate a representations of a robots.txt file

rt_cache

get_robotstxt() cache

rt_get_comments

extracting comments from robots.txt

rt_get_fields_worker

extracting robotstxt fields

rt_get_fields

extracting permissions from robots.txt

rt_get_rtxt

load robots.txt files saved along with the package

rt_get_useragent

extracting HTTP useragents from robots.txt

rt_list_rtxt

list robots.txt files saved along with the package

rt_request_handler

rt_request_handler

sanitize_path

making paths uniform

Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain.

  • Maintainer: Pedro Baltazar
  • License: MIT + file LICENSE
  • Last published: 2024-08-29