How to avoid joining all text from Nodes when scraping
When I scrape several related nodes from HTML or XML to extract the text, all the text is joined into one long string, making it impossible to recover the individual text strings.
For instance:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
But what I want is:
["foo", "bar", "baz"]
The same happens when scraping XML:
doc = Nokogiri::XML(<<EOT)
<root>
<block>
<entries>foo</entries>
<entries>bar</entries>
<entries>baz</entries>
</block>
</root>
EOT
doc.search('entries').text # => "foobarbaz"
Why does this happen and how do I avoid it?
Solution 1:
This is an easily solved problem that results from not reading the documentation about how text
behaves when used on a NodeSet versus a Node (or Element).
The NodeSet documentation says text
will:
Get the inner text of all contained Node objects
Which is what we're seeing happen with:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
because:
doc.search('p').class # => Nokogiri::XML::NodeSet
Instead, we want to get each Node and extract its text:
doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"
which can be done using map
:
doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]
Ruby allows us to write that more concisely using:
doc.search('p').map(&:text) # => ["foo", "bar", "baz"]
The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.
A Node has several aliased methods for getting at its embedded text. From the documentation:
#content ⇒ Object
Also known as:
text
,inner_text
Returns the contents for this Node.