The aim of this project is to scrape a websites, extract out useful information and export it in XML format.

Extract and result structure are made by a specific provider on ./providers/. Then the result is finally converted to XML on generic index.js.

Below an example of XML result produced by ./providers/TRI-events.js.

<?xml version="1.0" encoding="UTF-8"?>
   <event id="26500">
      <title>Lorem ipsum</title>
      <start_date>2023-11-14 00:00:00</start_date>
      <end_date>2023-11-15 00:00:00</end_date>
      <where />
      <description />

Getting started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.


Install packages

$ npm install

Setup env vars

$ echo "WEBPAGE_URL=https://webpage-to-scrap" >> .env

Run locally

$ node index.js

With Node.js v19.0.0 and v18.11.0+ you can run node in ‘watch’ mode using the node --watch option. Running in ‘watch’ mode restarts the process when an imported file is changed.

Running the tests

$ npm run test


See below how to deploy NodeJS service on Google Cloud Run.

Install the gloud cli

% python3 -V
Python 3.9.10
% cd ~/Downloads
Downloads % curl | bash
// When prompted do not install Python 3.7, it tooks long long time... to install.
Downloads % exec -l $SHELL

Initialize the gcloud cli

Select your Google cloud account and project

% gcloud init

Deploy on Google Cloud Run

Deploy new services and new revisions to Cloud Run directly from source code using a single gcloud CLI command, gcloud run deploy with the --source flag.

mmb-service.irmobi-scrap % gcloud config set run/region southamerica-east1 
mmb-service.irmobi-scrap % gcloud run deploy irmobi-scrap --source .
API [] not enabled on project [971061791161]. Would you like to enable and retry (this will take a few minutes)? (y/N)?  y

API [] not enabled on project [971061791161]. Would you like to enable and retry (this will take a few minutes)? (y/N)?  y

Enabling service [] on project [971061791161]...
Deploying from source requires an Artifact Registry Docker repository to store built containers. A repository named [cloud-run-source-deploy] in region [southamerica-east1] 
will be created.

Do you want to continue (Y/n)?  y

This command is equivalent to running `gcloud builds submit --tag [IMAGE] .` and `gcloud run deploy irmobi-scrap --image [IMAGE]`

Allow unauthenticated invocations to [irmobi-scrap] (y/N)?  y

Building using Dockerfile and deploying container to Cloud Run service [irmobi-scrap] in project [irmobi-trisul-a3126] region [southamerica-east1]
⠶ Building and deploying new service... Uploading sources.                                                                                                                   
  ✓ Creating Container Repository...                                                                                                                                         
  ✓ Uploading sources...                                                                                                                                                     
⠶ Building and deploying new service... Uploading sources.                                                                                                                   
✓ Building and deploying new service... Done.                                                                                                                                
  . Routing traffic...                                                                                                                                                       
  . Setting IAM Policy...                                                                                                                                                    
  ✓ Building Container... Logs are available at [].             
  ✓ Creating Revision... Creating Service.                                                                                                                                   
  ✓ Routing traffic...                                                                                                                                                       
  ✓ Setting IAM Policy...                                                                                                                                                    
Service [irmobi-scrap] revision [irmobi-scrap-00001-wal] has been deployed and is serving 100 percent of traffic.
Service URL:

Use environment variables on Google Cloud Run

% gcloud run services update irmobi-scrap --set-env-vars "WEBPAGE_URL=..." --set-env-vars "UUID_NAMESPACE=..."


