# 🕷️ ScrapeWave: Web Scraping using Puppeteer in a Docker container with a UI dashboard and database.

![ScrapeWave - first run](https://cdn.hashnode.com/res/hashnode/image/upload/v1715412647828/5c764f07-7099-4e05-93cc-5e60a540f400.png align="center")

%[https://github.com/git-bhanu/scrapewave] 

🕷️ **ScrapeWave** project was created to help me accomplish some scraping tasks.

The requirements of the task I was building were:

1. To be able to access a particular link.
    
2. Parse the content of the page and save it somewhere.
    
3. Log error in case something unexpected happens.
    
4. Ability to start and stop the scarping.
    
5. Ability to add more data into the tool.
    
6. Ability to import and export data into and from the tool.
    
7. And, provide a UI for the person who is administering the tool to be able to see the progress.
    
8. Easy & quick to deploy multiple instances of it.
    

This project uses multiple technologies to achieve the requirement.

1. Server
    
    1. [ViteExpress](https://github.com/szymmis/vite-express) - @vitejs integration module for @expressjs.
        
    2. [SQlite](https://github.com/kriasoft/node-sqlite) - SQLite client wrapper around sqlite3 for Node.js applications with SQL-based migrations API written in Typescript.
        
    3. [Puppeteer](https://github.com/puppeteer/puppeteer) - Node.js API for Chrome
        
    4. [https://github.com/socketio/socket.io](https://github.com/socketio/socket.io) - Realtime application framework (Node.JS server).
        
    5. Other packages worth mentioning are:-
        
        * axios
            
        * fast-csv
            
2. UI
    
    1. [VueJs](https://vuejs.org/) - Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web
        
    2. [Pinia](https://pinia.vuejs.org/) - Intuitive, type safe, light and flexible Store for Vue using the composition api with DevTools support.
        
    3. [Vuetify](https://vuetifyjs.com/en/getting-started/installation/) - Vue Component Framework
        
3. [Docker](https://www.docker.com/) - Accelerated Container Application Development
    

## ScrapeWave in action

### Adding data to your tool

Using CSV import you can add data to your scarper.

Head over to `upload` the page and upload the data.

You can use the sample.csv file attached to the repository.

%[https://github.com/git-bhanu/scrapewave/blob/main/sample.csv] 

As you can see we have three data which can be seen in the table at home.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1715413005102/2edb988d-7921-4c62-afe3-2adeaf06baca.png align="center")

### Starting the script

You can start the scraping process by clicking on the start button on the home page.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1715413221962/10aaf8ce-5db9-4285-923f-0e21a21c287f.png align="center")

As you can see from the image above scraping for the first two entries is done. We also nice logs in your terminal which shows the current status of the software.

## Deploying ScrapeWave in Cloud

Once you have created the changes as per your liking in the project (learn how to do that from the readme file of the repo) you can create a docker image.

You can name the image whatever you want.

```bash
docker build -t bhanu/scrapewave:latest .
```

Here, `bhanu` would be the docker hub username and `scrapewave` is the name of the image.

You can push the image to your docker hub using the following command.

*Make sure you are logged in your dokcer hub account.*

```bash
dokcer push bhanu/scrapewave:latest
```

Once it is done, go to the server using ssh and install two software, docker and Nginx.

* To install Docker follow the steps here:
    

[https://www.digitalocean.com/community/questions/how-to-install-and-run-docker-on-digitalocean-dorplet](https://www.digitalocean.com/community/questions/how-to-install-and-run-docker-on-digitalocean-dorplet)

* To install Nginx follow the steps here:
    

[https://www.digitalocean.com/community/tutorials/how-to-install-nginx-on-ubuntu-18-04](https://www.digitalocean.com/community/tutorials/how-to-install-nginx-on-ubuntu-18-04)

Once you have successfully installed both,

You can use `docker run -dit --name <instance-name> -p <port>:3000 krenovate/data-validator:latest` to create a container.

For eg.

```bash
docker run -dit -v instance-1:/usr/src/app/db --name instance-1 -p 8080:3000 --restart on-failure bhanu/scrapewave:latest
```

This command does quite a few things:

1. Creates a container using `latest` tag of `bhanu/scrapewave`
    
2. Mounts the SQlite database in your host filesystem, ensuring even if you recreate the the container with `instance-1` volume your data would still be persistent.
    
3. For any scenario if the container exists, it will try to restart the container so that your scraping doesn't stop.
    

I plan to keep on improving this project so that it helps me and other folks to use it as the base to get up and running with there scraping task.

Feel free to raise issues if you see anything which is broken - [https://github.com/git-bhanu/scrapewave/issues](https://github.com/git-bhanu/scrapewave/issues)

Also any kind of contribution is appreciated.
