Extracting Web Data with NodeJs and Puppeteer

Web scraping applications have been built these days for web indexing, web automation, data mining, and website change detection. It can also be used for some hack-like activities by collecting using these applications to view, retrieve hidden and in worst cases private data from websites, but it becomes really helpful in the aspect of automation like automating filling of a form and other web activities.

There are many web scraping frameworks out there built for non-programmers and programmers and different programming languages. Some of them are:

Puppeteer (NodeJs)
BeautifulSoup (Python)
Scrapy (Python)
Goutte (PHP)

In this tutorial, we would be working on how to implement extracting transcripts (captions) from youtube videos with NodeJs and a Node library puppeteer that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.

Installation and Setup

First, we would need to have NodeJs installed on our PC Link for installation :
Then we initialize an npm project:
```
npm init
```
Also, we need to run the installation command for the puppeteer npm package
```
npm install puppeteer
```
We would also be needing the node-fetch npm library for fetch requests.
```
npm install node-fetch
```

Working Example

In this tutorial, we would be navigating to youtube using a youtube video URL and extracting transcripts from the website. Sounding like a lot of hack and fun isn't it, let's get started...

const puppeteer = require('puppeteer');
const fetch = require('node-fetch');

async function getTranscripts(url){
let captionUrl = ""
  try{
   const browser = await puppeteer.launch({headless: false})    
   const page = await global_browser.newPage();   

    await page.goto(url);
    await page.click('.ytp-subtitles-button')
    let listener = new Promise((res)=>{
        page.on('response', (response) => {
            if (response.url().startsWith('https://www.youtube.com/api/timedtext')) {
                captionUrl = response.url()
                res()
             }  
        })

    })
    await listener;
    await browser.close();
}
catch(err){
    console.log(err)
}       
}

From the code above, we created a basic function that accepts a youtube URL and extracts the URL of the captions data.

Puppetter has a .launch() method which creates an instance of chrome or chromium-browser which accepts some parameters.
headless parameter determines if to launch a full version(display) of chrome. By default this parameter is false. A list of other parameters can be found here
.new Page() method creates a new tab in the chrome browser.
```.goto() accepts a URL argument and navigates to the URL specified
.click clicks on a button with a specific id to start the caption activity.
A listener was created to watch for responses received by the website and compare the URL with the youtube caption response URL. When found the listener stops and the URL is stored in a variable for further API requests.

The extracted captions URL is used to make a request using node-fetch to get the captions of the video.

        if(captionUrl.includes('https://www.youtube.com/api/timedtext')){
        const response = await fetch(captionUrl);
        const data = await response.json()
        const captions = data.events;

        let captions_data = [];
        captions.forEach(caption => {
            let word = {
                time: caption.tStartMs,
                text: JSON.stringify(caption.segs)
            }
            captions_data.push(word);
        });

        return captions_data
        }

    }
    catch(err){
        console.log(err)
    }

The captions received are processed and returned to the captions array to the function call.

The complete code:

const puppeteer = require('puppeteer');
const fetch = require('node-fetch');

async function getTranscripts(url){
let captionUrl  = ""
  try{
   const browser = await puppeteer.launch({headless: false})    
   const page = await global_browser.newPage();   

    await page.goto(url);
    await page.click('.ytp-subtitles-button')
    let listener = new Promise((res)=>{
        page.on('response', (response) => {
            if (response.url().startsWith('https://www.youtube.com/api/timedtext')) {
                captionUrl = response.url()
                res()
             }  
        })

    })
    await listener;
    await browser.close();

        if(captionUrl.includes('https://www.youtube.com/api/timedtext')){
        const response = await fetch(captionUrl);
        const data = await response.json()
        const captions = data.events;

        let captions_data = [];
        captions.forEach(caption => {
            let word = {
                time: caption.tStartMs,
                text: JSON.stringify(caption.segs)
            }
            captions_data.push(word);
        });

        return captions_data
        }

    }
    catch(err){
        console.log(err)
    }
}

We have successfully automated the extraction of restricted captions data from the youtube websites leveraging the power of puppeteer.

Wow...a little and nice hack I would call it, one out of many things that can be achieved using puppeteer and other web scraping frameworks.

I actually created a web app using this tool for searching for text in youtube videos, do well to check it out. Youtube word Search

Thanks.

Extracting Web Data with NodeJs and Puppeteer

Automating the extraction of transcripts from youtube videos

Installation and Setup

Working Example