Research Journal of Engineering and Technology
Year : 2015, Volume : 6, Issue : 3
First page : ( 381) Last page : ( 386)
Print ISSN : 0976-2973. Online ISSN : 2321-581X.
Article DOI : 10.5958/2321-581X.2015.00060.4

Web Data Extraction and Alignment Tools: A survey

Nikam Pranali, Gote Yogita, Ghogare Vidhya*, Rapalli Jyothi

Student, Department of I.T, DYPIET, Pune

*Corresponding Author Email: ghogare.vidhya@gmail.com

Online published on 2 December, 2015.


Data extraction from the web pages is the process of analyzing and retrieving relevant data out of the data sources (usually unstructured or poorly structure) in a specific pattern for further processing, involves addition of metadata and data integration details for further process in the data workflow. This survey describes overview of the different web data extraction and data alignment techniques. Extraction techniques are DeLa, DEPTA, ViPER, and ViNT. Data alignment techniques are Pairwise QRR alignment, Holistic alignment, Nested structure processing. Query Result pages are generated by using Web database based on Users Query. The data from these query result pages should be automatically extracted which is very important for many applications, such as data integration, which are needed to cooperate with multiple web databases. New method is proposed for data extraction t that combines both tag and value similarity. It automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the segmented QRRs into a table. In which the data values from the same attribute are put into the same column. Data region identification method identify the noncontiguous QRRs that have the same parents according to their tag similarities. Specifically, we propose new techniques to handle the case when the QRRs are not contiguous, which may be due to presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested structure that may exist in the QRRs.



Combining Tag And Value Similarity(CTVS), Query result Record(QRR), Data extraction and label assignment for web database(DeLa), Data Extraction Based on Partial Tree Alignment(DEPTA), Visual Perception based Extraction of Records (ViPER), Visual information and Tag Structure based wrapper generator (ViNTS).


