Scrape Google Carousel Results with Python

Scrape Google Carousel Results with Python

Prerequisites

Install libraries:

pip install requests parsel google-search-results

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.

What will be scraped

image

📌 Note: only such layout will be covered in this blog post. There are at least 3 different Carousel results.


Full Code

import requests, lxml, re, json
from parsel import Selector

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
  "User-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36"
  }

params = {
  "q": "dune actors",  # search query
  "gl": "us",          # country to search from
  }


def parsel_get_top_carousel():
  html = requests.get('https://www.google.com/search', headers=headers, params=params)
  selector = Selector(text=html.text)

  carousel_name = selector.css(".yKMVIe::text").get()
  all_script_tags = selector.css("script::text").getall()

  data = {f"{carousel_name}": []}

  decoded_thumbnails = []

  for _id in selector.css("img.d7ENZc::attr(id)").getall():
    # https://regex101.com/r/YGtoJn/1
    thumbnails = re.findall(r"var\s?s=\'([^']+)\'\;var\s?ii\=\['{_id}'\];".format(_id=_id), str(all_script_tags))
    thumbnail = [
      bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in thumbnails
      ]
    decoded_thumbnails.append("".join(thumbnail))

  for result, image in zip(selector.css('.QjXCXd.X8kvh'), decoded_thumbnails):

    title = result.css(".JjtOHd::text").get()
    link = f"https://www.google.com{result.css('.QjXCXd div a::attr(href)').get()}"
    extensions = result.css(".ellip.AqEFvb::text").getall()

    if title and link and extensions is not None:
      data[carousel_name].append({
        "title": title,
        "link": link,
        "extensions": extensions,
        "thumbnail": image
        })

  print(json.dumps(data, indent=2, ensure_ascii=False))


parsel_get_top_carousel()

Output:

{
  "Dune": [
    {
      "title": "Zendaya", ... first results
      "link": "https://www.google.com/search?gl=us&q=Zendaya&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLSz9U3SElJM7So0BLKTrbST8vMyQUTVsmJxSWLWNmjUvNSEisTAY7G9vs7AAAA&sa=X&ved=2ahUKEwjp99fw1972AhXXXM0KHeWoAX4Q9OUBegQIAxAC",
      "extensions": [
        "Chani"
      ],
      "thumbnail": ""
    }, ... other results
    {
      "title": "Javier Bardem", ... last results
      "link": "https://www.google.com/search?gl=us&q=Javier+Bardem&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLUz9U3MDQ3NE7WEspOttJPy8zJBRNWyYnFJYtYeb0SyzJTixScEotSUnMBeUccjEAAAAA&sa=X&ved=2ahUKEwjp99fw1972AhXXXM0KHeWoAX4Q9OUBegQIAxAQ",
      "extensions": [
        "Stilgar"
      ],
      "thumbnail": ""
    }
  ]
}

Code Explanation

Thumbnail extraction

image

Parsing thumbnails from img.d7ENZc CSS selector to grab src attribute will bring a 1x1 placeholder, instead of actual thumbnail.

Thumbnails are located in the <script> tags. In order to grab them, we need to:

  1. Locate image element via Dev Tools.
  2. Copy id value. image
  3. Open page source CTRL+U, press CTRL+F and paste id value to find it.

Most likely you'll see two occurrences, and the second one will be somewhere in the <script> tags. That's what we're looking for.

Now we need to match image id with extracted data:image from the <script> elements to extract the right image:

selector = Selector(text=html.text)

# grabs every script element
all_script_tags = selector.css("script::text").getall()

# list to temporary store thumbnails data
decoded_thumbnails = []

# iterating over each image ID
# using _id because id is a Python build-in name
for _id in selector.css("img.d7ENZc::attr(id)").getall():
  # https://regex101.com/r/YGtoJn/1
  thumbnails = re.findall(r"var\s?s=\'([^']+)\'\;var\s?ii\=\['{_id}'\];".format(_id=_id), str(all_script_tags))
  thumbnail = [
    bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in thumbnails
    ]
  decoded_thumbnails.append("".join(thumbnail))
CodeExplanation
css("img.d7ENZc::attr(id)")to grab every image id.
getall()returns a list of matches.
re.findall()to find all matches via regular expression.
r"<expression>"a regular expression.
([^']+)is a regex capture group.
['{_id}'\]is a parsed image id that were passed to regular expression to match the correct image.
format(_id=_id)is a string placeholder. String interpolation would look a bit awkward.
bytes().deccode()to convert unicode characters to ascii characters.
"".join(thumbnail)to join (convert) each element from a list to a string.

Output from decoded_thumbnails:

# data:image is shortened on purpose, 
# so the output would not cover the entire page  
[
  '', 
  "other images ..."
]

The next step is to iterate over CSS container with title, link, and extensions and over decoded_thumbnails:

for result, image in zip(selector.css('.QjXCXd.X8kvh'), decoded_thumbnails):
  title = result.css(".JjtOHd::text").get()
  link = f"https://www.google.com{result.css('.QjXCXd div a::attr(href)').get()}"
  extensions = result.css(".ellip.AqEFvb::text").getall()
CodeExplanation
zip()allows to iterate over multiple iterables in a single for loop.
::texta parsel pseudo-element to extract textual node data which is identical to XPath <node>/text()
::attr(<attribute>)a parsel pseudo-element grab attribute data from the node which is identical to XPath <node>/@href
get()to return first element of actual data.
getall()to return list of all matches.

The next step is to check if extracted title, link and extensions have some values and append to temporary list and print the data:

data = {f"{carousel_name}": []}

if title and link and extensions is not None:
  data[carousel_name].append({
    "title": title,
    "link": link,
    "extensions": extensions,
    "thumbnail": image
    })

print(json.dumps(data, indent=2, ensure_ascii=False))

SerpApi is a paid API with a free plan which allows end-user to forget about figuring out how to bypass blocks from search entities and focus on the which data to extract.

from serpapi import GoogleSearch
import os, json

def serpapi_get_top_carousel():
    params = {
      # https://docs.python.org/3/library/os.html#os.getenv
      "api_key": os.getenv("API_KEY"), # your SerpApi key in the environment variable
      "engine": "google",              # search engine
      "q": "dune actors",              # search query
      "hl": "en",                      # language
      "gl": "us"                       # country
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    for result in results['knowledge_graph']['cast']:
        print(json.dumps(result, indent=2))


serpapi_get_top_carousel()

Part of the output:

{
  "name": "Timothée Chalamet",
  "extensions": [
    "Paul Atreides"
  ],
  "link": "https://www.google.com/search?hl=en&gl=us&q=Timoth%C3%A9e+Chalamet&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLSz9U3KDDKM0wr0BLKTrbST8vMyQUTVsmJxSWPGJcycgu8_HFPWGo246Q1J68xTmHkwqJOyJCLzTWvJLOkUkhQip8L1RIjEahAtll2hpFZXqHAwmWzGJWcjUx2XZp2jk1P8FkoA0Ndb4iDkiLnFCHrhswn7-wFXd__299ywsBBgkWBQYPB8JElq8P6KYwHtBgOMDI17VtxiI2Fg1GAwYpJg6mKiYOFZxGrUEhmbn5JxuGVqQrOGYk5ibmpJRPYGAHILgFT8gAAAA&sa=X&ved=2ahUKEwiMxLi-ksXzAhUAl2oFHf88AN0Q-BZ6BAgBEDQ",
  "image": "https://serpapi.com/searches/6165a3dcfa86759a4fa42ba4/images/94afec67f82aa614bb572a123ec09cf051cf10bde8e0bc8025daf21915c49798.jpeg"
} ... other results

Outro

If you have any questions or suggestions, or something isn't working correctly, reach out via Twitter at @dimitryzub or @serp_api.

Yours, Dimitry, and the rest of SerpApi Team.


Join us on Reddit | Twitter | YouTube

Add a Feature Request💫 or a Bug🐞