Overview
For Web/HTML scraping, Etlworks includes a Java library jsoup. Jsoup is one of the best HTML parsers around. It implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers do.
By calling the jsoup methods from the JavaScript and Python code, you can parse the webpage or HTML string and transform it into the DOM model, then traverse the DOM and find the required elements.
The examples below load the web page from https://www.eia.gov/analysis/projection-data.php
into the string variable, parse the HTML, extract the links to the Excel documents and link titles, and store them into the variable for further use by other Flows.
JavaScript example
var html = com.toolsverse.etl.core.task.common.FileManagerTask.
read(etlConfig, 'html', null);
doc = org.jsoup.Jsoup.parse(html);
files = new java.util.ArrayList();
etlConfig.setValue('files', files);
root = doc.select("h3 > a[name=annualproj]").get(0).parent().
nextElementSibling();
spans = root.select("ul.numbered li > span.formats");
for each (var span in spans) {
link = span.nextElementSibling();
href = link.attr("href").toString();
title = link.attr("title").toString();
title = title.substring(title.indexOf('.') + 1).trim();
files.add(new com.toolsverse.util.TypedKeyValue(href, title));
// log
etlConfig.log("href: " + href + ", title: " + title);
}
Python example
from org.jsoup import Jsoup
from com.toolsverse.etl.core.task.common import FileManagerTask
from com.toolsverse.util import TypedKeyValue, Utils
from java.util import ArrayList
html = FileManagerTask.read(etlConfig, 'html', None)
doc = Jsoup.parse(html)
files = ArrayList()
etlConfig.setValue('files', files)
root = doc.select("h3 > a[name=annualproj]").get(0).parent().nextElementSibling()
spans = root.select("ul.numbered li > span.formats")
for span in spans:
link = span.nextElementSibling()
href = link.attr("href")
title = link.attr("title")
title = title[title.index(".") + 1:].strip()
files.add(TypedKeyValue(href, title))
etlConfig.log("href: " + href + ", title: " + title)
Comments
0 comments
Please sign in to leave a comment.