Choose your language:
Australia
Belgium
Canada
China
France
Germany
Hong Kong
India
Ireland
Japan
Netherlands
New Zealand
Singapore
Sweden
United Kingdom
United States
May 3, 2018 | By Aditi Hedge
Digesting varied and vast amounts of data and synthesizing its meaning can be a complicated—but rewarding—undertaking. Extensible Markup Language (XML) is a data format popular in many industries, including semiconductor and manufacturing sectors, which captures and records the data from sensors. Processing XML can derive values and provide analytics and data forecasting.
XML has a set of rules that encodes the data to make it self-describing. This means that the structure of XML has the schema defined, making it both human- and computer-readable.
The basic element of XML is tags. An element of information is surrounded by start and end tag. The element name describes the content, whereas the tag describes its relationship with the content. An outermost root element contains all other elements in an XML document. XML supports nested elements and hierarchical structures.
XML is semi-structured. Since the structure of XML is variable by design, we cannot have defined mapping. Thus, to process the XML in Hadoop, you need to know the tags required to extract the data.
Apache Pig is a tool that can be used to analyse XML, and it represents them as data flows. Pig Latin is a scripting language that can do the operations of Extract, Transform, Load (ETL), ad hoc data analysis and iterative processing. The Pig scripts are internally converted to MapReduce jobs. Pig scripts are procedural and implement lazy evaluation, i.e., unless an output is required, the steps aren’t executed.
To process XMLs in Pig, piggybank.jar is essential. This jar contains a UDF called XMLLoader() that will be used to read the XML document.
Below is the flow diagram to describe the complete flow from extraction to analysis.
To use Piggybank jar in XML, first download the jar and register the path of the jar in Pig.
In the above example, feed is the root element and the tag to be extracted is food.
If all the elements are defined under root_element without parent tag, then the root element will be loaded using the XMLLoader()
Use the regular expressions to extract the data between the tags. Regular expressions can be used to determine simple tags in the document. [Tag <title> in the document]
For nested tags, writing regular expression will be tedious because if any small character is missed in the expression, it will give null output.
Dump the data to see the extracted data.
XPath uses path expressions to access a node.
The function for XPath UDF consists of a long string:org.apache.pig.piggybank.evaluation.xml. Thus, you should define a small temporary function name for simplicity and ease of use.
To access a particular element, start from loading the parent node and navigate to the required tag.
Note that every repeating parent and child nodes become separate rows and columns respectively. In the above file, the tag<IntervalReading> repeats in the file, thus, upon extraction, each tag <IntervalTag> becomes a new row with the tags under it becoming attributes.
Dump the data to see the extracted data.
Various transformation can be performed in Pig on the extracted data.
Below is an example to show the conversion of date and calculation of per-unit cost. If multiple files are present, there will be a need to add the key to the data. To add a unique key, load it separately from the XML into a dataset and create a new dataset with required columns.
Below is an example:
Dump and view the data
The datasets can be stored in a file with the required delimiter.
For the first time, create a table in Hive. Load the data into the table created.
The structured data can be visualized in any of the tools like OBIEE, BDD, Tableau or Kibana.