html_element function

Select elements from an HTML document

Select elements from an HTML document

html_element() and html_elements() find HTML element using CSS selectors or XPath expressions. CSS selectors are particularly useful in conjunction with https://selectorgadget.com/, which makes it very easy to discover the selector you need.

html_element(x, css, xpath) html_elements(x, css, xpath)

Arguments

  • x: Either a document, a node set or a single node.

  • css, xpath: Elements to select. Supply one of css or xpath

    depending on whether you want to use a CSS selector or XPath 1.0 expression.

Returns

html_element() returns a nodeset the same length as the input. html_elements() flattens the output so there's no direct way to map the output to the input.

CSS selector support

CSS selectors are translated to XPath selectors by the selectr

package, which is a port of the python cssselect library, https://pythonhosted.org/cssselect/.

It implements the majority of CSS3 selectors, as described in https://www.w3.org/TR/2011/REC-css3-selectors-20110929/. The exceptions are listed below:

  • Pseudo selectors that require interactivity are ignored: :hover, :active, :focus, :target, :visited.
  • The following pseudo classes don't work with the wild card element, *: *:first-of-type, *:last-of-type, *:nth-of-type, *:nth-last-of-type, *:only-of-type
  • It supports :contains(text)
  • You can use !=, [foo!=bar] is the same as :not([foo=bar])
  • :not() accepts a sequence of simple selectors, not just a single simple selector.

Examples

html <- minimal_html(" <h1>This is a heading</h1> <p id='first'>This is a paragraph</p> <p class='important'>This is an important paragraph</p> ") html %>% html_element("h1") html %>% html_elements("p") html %>% html_elements(".important") html %>% html_elements("#first") # html_element() vs html_elements() -------------------------------------- html <- minimal_html(" <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> </ul> ") li <- html %>% html_elements("li") # When applied to a node set, html_elements() returns all matching elements # beneath any of the inputs, flattening results into a new node set. li %>% html_elements("i") # When applied to a node set, html_element() always returns a vector the # same length as the input, using a "missing" element where needed. li %>% html_element("i") # and html_text() and html_attr() will return NA li %>% html_element("i") %>% html_text2() li %>% html_element("span") %>% html_attr("class")
  • Maintainer: Hadley Wickham
  • License: MIT + file LICENSE
  • Last published: 2024-02-12