DATA AUGMENTATION

The goal of the project was to use data collected by one of our previous scrapers and find additional information about the companies in Spain and people working in them. Company data was stored in MongoDB. We used company CIF as an identifier to find and grab missing company details (phones, annual accounts) etc on infocif.es website.

To grab NIF and website we built a new scraper, which crawled axesor.es website and stored required data in the main database. The final step was to define and save relationships between the companies, which was done with one more scraper for infoempresa.com website.

In all cases we used multi-thread scrapers to grab all necessary data as soon as possible. Total we processed more than ten million of records. To hide scraper activity and imitate human user behavior we used custom proxy solution.

HIGHLIGHTS

Multi-thread scraper

Human Simulation

MongoDB