Scrape Google Play Search Apps with Python

Scrape Google Play Search Apps with Python

What will be scraped

Image description

📌Note: this blog post shows how to scrape 50 results without using pagination. The follow-up blog post will be about scraping all available results with pagination.

Prerequisites

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective and show the most common approaches of using CSS selectors when web scraping.

Separate virtual environment

In short, it's a thing that creates an independent set of installed libraries, including different Python versions that can coexist with each other in the same system, thus preventing libraries or Python version conflicts.

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

📌Note: this is not a strict requirement for this blog post.

Install libraries:

pip install requests lxml beautifulsoup4 google-search-results

Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.


Full Code

from bs4 import BeautifulSoup
import requests, json, lxml, re


def bs4_scrape_all_google_play_store_search_apps(
                                          query: str, 
                                          filter_by: str = "apps",
                                          country: str = "US"):
    params = {
        "q": query,     # search query
        "gl": country,  # country of the search. Different country display different apps.
        "c": filter_by  # filter to display list of apps. Other filters: apps, books, movies
    }

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
    }

    html = requests.get("https://play.google.com/store/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")

    apps_data = []

    for app in soup.select(".mpg5gc"):
        title = app.select_one(".nnK0zc").text
        company = app.select_one(".b8cIId.KoLSrc").text
        description = app.select_one(".b8cIId.f5NCO a").text
        app_link = f'https://play.google.com{app.select_one(".b8cIId.Q9MA7b a")["href"]}'
        developer_link = f'https://play.google.com{app.select_one(".b8cIId.KoLSrc a")["href"]}'
        app_id = app.select_one(".b8cIId a")["href"].split("id=")[1]
        developer_id = app.select_one(".b8cIId.KoLSrc a")["href"].split("id=")[1]

        try:
            # https://regex101.com/r/SZLPRp/1
            rating = re.search(r"\d{1}\.\d{1}", app.select_one(".pf5lIe div[role=img]")["aria-label"]).group()
        except:
            rating = None

        thumbnail = app.select_one(".yNWQ8e img")["data-src"]

        apps_data.append({
            "title": title,
            "company": company,
            "description": description,
            "rating": float(rating) if rating else rating, # float if rating is not None else rating or None
            "app_link": app_link,
            "developer_link": developer_link,
            "app_id": app_id,
            "developer_id": developer_id,
            "thumbnail": thumbnail
        })        

    print(json.dumps(apps_data, indent=2, ensure_ascii=False))


bs4_scrape_all_google_play_store_search_apps(query="maps", filter_by="apps", country="US")

Explanation

Import libraries:

from bs4 import BeautifulSoup
import requests, json, lxml, re
LibraryPurpose
BeautifulSoupto parse HTML/XML documents.
requeststo make a request to the destination website and pass the response to BeautifulSoup.
jsonto convert parsed data to JSON format.
lxmlfast HTML/XML parser that used by BeautifulSoup.
reto match parts of text via regular expression.

Define a function:

def bs4_scrape_all_google_play_store_search_apps(
                                          query: str, 
                                          filter_by: str = "apps",
                                          country: str = "US"):
    # further code
Fucntion argumentExplanation
query: strquery parameter should be a string
filter_by: str = "apps"filted_by should be a string with default value of "apss" which is defined with =
country: str = "US"country should be a string with default value of "US" which is defined with =

Create request headers and URL parameters that will be passed to the request:

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": query,     # search query
    "gl": country,  # country of the search. Different country display different apps.
    "c": filter_by  # filter to display a list of apps. Other filters: apps, books, movies
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}

Pass requests headers, URL parameters, make a request and pass response to BeautifulSoup:

html = requests.get("https://play.google.com/store/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
  • timeout=30 will tell reuests to stop waiting for response after 30 seconds.

Create a temporary list, iterate over the "container" with all the data, and extract it:

apps_data = []

for app in soup.select(".mpg5gc"):
    title = app.select_one(".nnK0zc").text
    company = app.select_one(".b8cIId.KoLSrc").text
    description = app.select_one(".b8cIId.f5NCO a").text
    app_link = f'https://play.google.com{app.select_one(".b8cIId.Q9MA7b a")["href"]}'
    developer_link = f'https://play.google.com{app.select_one(".b8cIId.KoLSrc a")["href"]}'
    app_id = app.select_one(".b8cIId a")["href"].split("id=")[1]
    developer_id = app.select_one(".b8cIId.KoLSrc a")["href"].split("id=")[1]

    try:
        # https://regex101.com/r/SZLPRp/1
        # Rated 3.9 stars out of five stars - > 3.9
        rating = re.search(r"\d{1}\.\d{1}", app.select_one(".pf5lIe div[role=img]")["aria-label"]).group()
    except:
        rating = None

    thumbnail = app.select_one(".yNWQ8e img")["data-src"]
CodeExmplanation
select()return a list of matches.
select_one()return a single match.
["href"]grabs href attribute.
re.search(r"\d{1}\.\d{1}"to grab just digit number from the whole string.
group()to return the string matched by the regular expression re.

Append extracted data to the temporary list as a dictionary and print extracted data:

apps_data.append({
        "title": title,
        "company": company,
        "description": description,
        "rating": float(rating) if rating else rating, # make rating float if the rating is not None else rating or None
        "app_link": app_link,
        "developer_link": developer_link,
        "app_id": app_id,
        "developer_id": developer_id,
        "thumbnail": thumbnail
    })        

print(json.dumps(apps_data, indent=2, ensure_ascii=False))
CodeExplanation
json.dumps()to convert serialize obj e.g. list, dict to JSON string using such conversion table.
indent=2to specify the indentation value.
ensure_ascii=Falseto display ASCII characters as-is. For example, to output Chinese characters (漢字) and not unicode chars.

Part of the output:

[
  {
    "title": "Google Maps",
    "company": "Google LLC",
    "description": "Real-time GPS navigation & local suggestions for food, events, & activities",
    "rating": 3.9,
    "app_link": "https://play.google.com/store/apps/details?id=com.google.android.apps.maps",
    "developer_link": "https://play.google.com/store/apps/dev?id=5700313618786177705",
    "app_id": "com.google.android.apps.maps",
    "developer_id": "5700313618786177705",
    "thumbnail": "https://play-lh.googleusercontent.com/Kf8WTct65hFJxBUDm5E-EpYsiDoLQiGGbnuyP6HBNax43YShXti9THPon1YKB6zPYpA=s128-rw"
  },
  {
    "title": "Google Maps Go",
    "company": "Google LLC",
    "description": "Get real-time traffic, directions, search and find places",
    "rating": 4.3,
    "app_link": "https://play.google.com/store/apps/details?id=com.google.android.apps.mapslite",
    "developer_link": "https://play.google.com/store/apps/dev?id=5700313618786177705",
    "app_id": "com.google.android.apps.mapslite",
    "developer_id": "5700313618786177705",
    "thumbnail": "https://play-lh.googleusercontent.com/0uRNRSe4iS6nhvfbBcoScHcBTx1PMmxkCx8rrEsI2UQcQeZ5ByKz8fkhwRqR3vttOg=s128-rw"
  },
  {
    "title": "Waze - GPS, Maps, Traffic Alerts & Live Navigation",
    "company": "Waze",
    "description": "Save time on every drive. Waze tells you about traffic, police, crashes & more",
    "rating": 4.4,
    "app_link": "https://play.google.com/store/apps/details?id=com.waze",
    "developer_link": "https://play.google.com/store/apps/developer?id=Waze",
    "app_id": "com.waze",
    "developer_id": "Waze",
    "thumbnail": "https://play-lh.googleusercontent.com/muSOyE55_Ra26XXx2IiGYqXduq7RchMhosFlWGc7wCS4I1iQXb7BAnnjEYzqcUYa5oo=s128-rw"
  }, ... other results
]

As an alternative solution could be to use Google Play Apps Store API.

The difference is that there's no need to figure out how to make a parser from, maintain it, figure out how to bypass blocks from the websites, and understand how to scale it.

Code to integrate:

from serpapi import GoogleSearch
import json, os


def serpapi_scrape_all_google_play_store_apps():
    params = {
        "api_key": os.getenv("API_KEY"),  # your serpapi api key
        "engine": "google_play",          # search engine
        "hl": "en",                       # language
        "store": "apps",                  # apps search
        "gl": "us",                       # contry to search from. Different country displays different.
        "q": "weather"                    # search qeury
    }

    search = GoogleSearch(params)  # where data extracts
    results = search.get_dict()    # JSON -> Python dictionary

    apps_data = []

    for apps in results["organic_results"]:
        for app in apps["items"]:
            apps_data.append({
                "title": app.get("title"),
                "link": app.get("link"),
                "description": app.get("description"),
                "product_id": app.get("product_id"),
                "rating": app.get("rating"),
                "thumbnail": app.get("thumbnail"),
                })

    print(json.dumps(apps_data, indent=2, ensure_ascii=False))

Part of the output:

[
  {
    "title": "Google Maps",
    "link": "https://play.google.com/store/apps/details?id=com.google.android.apps.maps",
    "description": "Real-time GPS navigation & local suggestions for food, events, & activities",
    "product_id": "com.google.android.apps.maps",
    "rating": 3.9,
    "thumbnail": "https://play-lh.googleusercontent.com/Kf8WTct65hFJxBUDm5E-EpYsiDoLQiGGbnuyP6HBNax43YShXti9THPon1YKB6zPYpA=s128-rw"
  },
  {
    "title": "Google Maps Go",
    "link": "https://play.google.com/store/apps/details?id=com.google.android.apps.mapslite",
    "description": "Get real-time traffic, directions, search and find places",
    "product_id": "com.google.android.apps.mapslite",
    "rating": 4.3,
    "thumbnail": "https://play-lh.googleusercontent.com/0uRNRSe4iS6nhvfbBcoScHcBTx1PMmxkCx8rrEsI2UQcQeZ5ByKz8fkhwRqR3vttOg=s128-rw"
  },
  {
    "title": "Waze - GPS, Maps, Traffic Alerts & Live Navigation",
    "link": "https://play.google.com/store/apps/details?id=com.waze",
    "description": "Save time on every drive. Waze tells you about traffic, police, crashes & more",
    "product_id": "com.waze",
    "rating": 4.4,
    "thumbnail": "https://play-lh.googleusercontent.com/muSOyE55_Ra26XXx2IiGYqXduq7RchMhosFlWGc7wCS4I1iQXb7BAnnjEYzqcUYa5oo=s128-rw"
  }, ... other results
]


Outro

If you have anything to share, any questions, suggestions, or something that isn't working correctly, reach out via Twitter at @dimitryzub or @serp_api.

Yours, Dmitriy, and the rest of the SerpApi Team.

Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞