Complicated xml tags Jsoup

suppose you want to scrape <dc:creator> tag or <media:content>, how do you do it? I am a Scala coder, I am using jsoup in scala.

suppose you have an xml file like this (reference from http://www.theguardian.com/politics/rss)-

<item>
<title>
Remain campaigners step up efforts to secure ethnic minority votes
</title>
<description>
<p>Leave camp also targets BAME voters in recognition that they may be crucial in determining outcome of EU referendum</p>
</description>
<pubDate>Wed, 01 Jun 2016 15:00:27 GMT</pubDate>
<dc:creator>Anushka Asthana Political editor</dc:creator>
<dc:date>2016-06-01T15:00:27Z</dc:date>
</item>

The following code should then extract data –

val doc = Jsoup.connect(url1).parser(Parser.xmlParser()).userAgent(“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36”).get();

val element = doc.select(“item”).select(“dc|creator”).text;

println(element);

The Output is –

Anushka Asthana Political editor

Hope this helps! Happy Coding!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s