Web scraping YouTube autocomplete with Nodejs

Web scraping YouTube autocomplete with Nodejs

A step-by-step tutorial on creating a YouTube autocomplete web scraper in Nodejs.

Table of contents

No heading

No headings in the article.

What will be scraped

what

📌Note: For now, we don't have an API that supports extracting autocomplete data.

This blog post is to show you way how you can do it yourself while we're working on releasing our proper API in a meantime. We'll update you on our Twitter once this API will be released.

Full code

If you don't need an explanation, have a look at the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const queries = ["javascript", "node", "web scraping"];
const URL = "https://www.youtube.com";

async function getYoutubeAutocomplete() {
  const browser = await puppeteer.launch({
    headless: false,
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);
  await page.waitForSelector("#contents");

  const autocompleteResults = [];
  for (query of queries) {
    await page.click("#search-input");
    await page.keyboard.type(query);
    await page.waitForTimeout(5000);
    const results = {
      query,
      autocompleteResults: await page.evaluate(() => {
        return Array.from(document.querySelectorAll(".sbdd_a li"))
          .map((el) => el.querySelector(".sbqs_c")?.textContent.trim())
          .filter((el) => el);
      }),
    };
    autocompleteResults.push(results);
    await page.click("#search-clear-button");
    await page.waitForTimeout(2000);
  }

  await browser.close();

  return autocompleteResults;
}

getYoutubeAutocomplete().then(console.log);

Preparation

First, we need to create a Node.js* project and add npm packages puppeteer, puppeteer-extra and puppeteer-extra-plugin-stealth to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.

To do this, in the directory with our project, open the command line and enter npm init -y, and then npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth.

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

📌Note: also, you can use puppeteer without any extensions, but I strongly recommended use it with puppeteer-extra with puppeteer-extra-plugin-stealth to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.

stealth

Process

SelectorGadget Chrome extension was used to grab CSS selectors by clicking on the desired element in the browser. If you have any struggles understanding this, we have a dedicated Web Scraping with CSS Selectors blog post at SerpApi.

The Gif below illustrates the approach of selecting different parts of the results.

how

Code explanation

Declare puppeteer to control Chromium browser from puppeteer-extra library and StealthPlugin to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth library:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

Next, we "say" to puppeteer use StealthPlugin, write search queries and YouTube URL:

puppeteer.use(StealthPlugin());

const queries = ["javascript", "node", "web scraping"];
const URL = "https://www.youtube.com";

Next, write a function to control the browser, and get information:

async function getYoutubeAutocomplete() {
  ...
}

In this function first we need to define browser using puppeteer.launch({options}) method with current options, such as headless: false and args: ["--no-sandbox", "--disable-setuid-sandbox"].

These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page:

  const browser = await puppeteer.launch({
    headless: false,
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();

Next, we change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout() method, go to URL with .goto() method and use .waitForSelector() method to wait until #contents selector is creating on the page.:

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);
  await page.waitForSelector("#contents");

Then, we define an array with the results, called autocompleteResults and starts for...of loop to iterate over all queries:

  const autocompleteResults = [];
  for (query of queries) {
    ...
  }

Next, in the loop we cick on #search-input (.click() method), type current query with page.keyboard.type(query) method and wait 5 seconds, using .waitForTimeout(5000) method:

    await page.click("#search-input");
    await page.keyboard.type(query);
    await page.waitForTimeout(5000);

Then, we make the results object that have query and autocompleteResults keys. We get autocompleteResults using page.evaluate() method to run code in the brackets in the browser context.

There we need to use .querySelectorAll() method which returns a static NodeList representing a list of the document's elements that match the css selectors in the brackets and convert result to an array with Array.from() method to iterate over that array.

After that we find element with class name .sbqs_c (.querySelector() method), get raw text (textContent property) and remove whitespace from both ends of a string with .trim() method from each of .sbdd_a li elements. Because sometimes we find empty nodes in the end we need to filter our array and leave true elements (.filter((el) => el)):

    const results = {
      query,
      autocompleteResults: await page.evaluate(() => {
        return Array.from(document.querySelectorAll(".sbdd_a li"))
          .map((el) => el.querySelector(".sbqs_c")?.textContent.trim())
          .filter((el) => el);
      }),
    };

Next, we push results object from current itaration step to the autocompleteResults array, click #search-clear-button to clear search input and wait 2 seconds before next itaration:

    autocompleteResults.push(results);
    await page.click("#search-clear-button");
    await page.waitForTimeout(2000);

And finally, we close the browser and return received data:

  await browser.close();

  return autocompleteResults;

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file

Output

[
   {
      "query":"javascript",
      "autocompleteResults":[
         "javascript",
         "javascript tutorial for beginners",
         "javascript full course",
         "javascript tutorial",
         "javascript dom",
         "javascript mastery",
         "javascript course",
         "javascript interview questions and answers",
         "javascript for beginners",
         "javascript с нуля",
         "javascript project",
         "javascript ninja",
         "javascript game",
         "javascript interview"
      ]
   },
   {
      "query":"node",
      "autocompleteResults":[
         "node js",
         "node js tutorial",
         "node",
         "node js project",
         "node js express",
         "node js interview",
         "node video tutorial",
         "node video",
         "node js interview questions",
         "node js event loop",
         "node js уроки",
         "nodemailer",
         "node red",
         "nodemcu"
      ]
   },
   {
      "query":"web scraping",
      "autocompleteResults":[
         "web scraping weather data python",
         "web scraping",
         "web scraping python",
         "web scraping javascript",
         "web scraping amazon product",
         "web scraping amazon price",
         "web scraping amazon",
         "web scraping amazon reviews",
         "web scraping amazon reviews python",
         "web scraping indeed",
         "web scraping flight prices",
         "web scraping using python",
         "web scraping tutorial"
      ]
   }
]

Extract suggestions from Google Autocomplete Client

Previous example was a "hard" way. Also you can parse data using following URL which will output a txt file:

"https://clients1.google.com/complete/search?client=youtube&hl=en&q=minecraft"

If you want to see some projects made with SerpApi, please write me a message.


Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞