What will be scraped
Prerequisites
Basic knowledge scraping with CSS selectors
CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.
If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective and show the most common approaches of using CSS selectors when web scraping.
Separate virtual environment if it will be a project
In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other in the same system thus preventing libraries or Python version conflicts.
If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.
📌Note: this is not a strict requirement for this blog post.
Install libraries:
pip install requests parsel
Reduce the chance of being blocked
There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.
How filtering works
To filter results, you need to use source:
operator which restricts search results to documents published by sources containing "NIPS"
in their name.
This operator can be used in addition to OR
operator i.e source:NIPS OR source:"Neural Information"
. So the search query would become:
search terms source:NIPS OR source:"Neural Information"
Full Code
from parsel import Selector
import requests, json, os
def scrape_conference_publications(query: str, source: list[str]):
if source:
# source:NIPS OR source:Neural Information
sources = " OR ".join([f'source:{item}' for item in source])
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": f'{query.lower()} {sources}', # search query
"hl": "en", # language of the search
"gl": "us" # country of the search
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
publications = []
for result in selector.css(".gs_r.gs_scl"):
title = result.css(".gs_rt").xpath("normalize-space()").get()
link = result.css(".gs_rt a::attr(href)").get()
result_id = result.attrib["data-cid"]
snippet = result.css(".gs_rs::text").get()
publication_info = result.css(".gs_a").xpath("normalize-space()").get()
cite_by_link = f'https://scholar.google.com/scholar{result.css(".gs_or_btn.gs_nph+ a::attr(href)").get()}'
all_versions_link = f'https://scholar.google.com/scholar{result.css("a~ a+ .gs_nph::attr(href)").get()}'
related_articles_link = f'https://scholar.google.com/scholar{result.css("a:nth-child(4)::attr(href)").get()}'
pdf_file_title = result.css(".gs_or_ggsm a").xpath("normalize-space()").get()
pdf_file_link = result.css(".gs_or_ggsm a::attr(href)").get()
publications.append({
"result_id": result_id,
"title": title,
"link": link,
"snippet": snippet,
"publication_info": publication_info,
"cite_by_link": cite_by_link,
"all_versions_link": all_versions_link,
"related_articles_link": related_articles_link,
"pdf": {
"title": pdf_file_title,
"link": pdf_file_link
}
})
# or return publications instead
# return publications
print(json.dumps(publications, indent=2, ensure_ascii=False))
scrape_conference_publications(query="anatomy", source=["NIPS", "Neural Information"])
Code explanation
Define a function:
def scrape_conference_publications(query: str, source: list[str]):
# further code...
Code | Explanation |
query: str | tells Python that query argument should be a string . |
source: list[str] | tells Python that source argument should be a list of strings . |
Check if the source is available and transform the received source
argument:
if source:
# iterates via list comprehension over recieved list of sources (strings)
# and joins() them with " OR "
sources = " OR ".join([f'source:{item}' for item in source])
# becomes: source:NIPS OR source:Neural Information, which can be used in a search query
Create URL parameters, user-agent
and pass them to a request:
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": f'{query.lower()} {sources}', # search query
"hl": "en", # language of the search
"gl": "us" # country of the search
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
Code | Explanation |
timeout | to tell requests to stop waiting for a response after 30 seconds. |
Selector | is a HTML/XML processor that parses data. Like BeautifulSoup() . |
user-agent | is used to act as a "real" user visit. Default requests user-agent is a python-requests so websites understand that it's a script that sends a request and might block it. Check what's your user-agent . |
Create a temporary list
to store the data, and iterate over organic results:
publications = []
for result in selector.css(".gs_r.gs_scl"):
title = result.css(".gs_rt").xpath("normalize-space()").get()
link = result.css(".gs_rt a::attr(href)").get()
result_id = result.attrib["data-cid"]
snippet = result.css(".gs_rs::text").get()
publication_info = result.css(".gs_a").xpath("normalize-space()").get()
cite_by_link = f'https://scholar.google.com/scholar{result.css(".gs_or_btn.gs_nph+ a::attr(href)").get()}'
all_versions_link = f'https://scholar.google.com/scholar{result.css("a~ a+ .gs_nph::attr(href)").get()}'
related_articles_link = f'https://scholar.google.com/scholar{result.css("a:nth-child(4)::attr(href)").get()}'
pdf_file_title = result.css(".gs_or_ggsm a").xpath("normalize-space()").get()
pdf_file_link = result.css(".gs_or_ggsm a::attr(href)").get()
Code | Explanation |
css(<selector>) | to extarct data from a given CSS selector. In the background parsel translates every CSS query into XPath query using cssselect . |
xpath("normalize-space()") | to get blank text nodes as well. By default, blank text nodes will be skipped resulting not a complete output. |
::text /::attr() | is a parsel pseudo-elements to extract text or attribute data from the HTML node. |
get() | to get actual data. |
Append the results as a dictionary to a temporary list, return or print extracted data:
publications.append({
"result_id": result_id,
"title": title,
"link": link,
"snippet": snippet,
"publication_info": publication_info,
"cite_by_link": cite_by_link,
"all_versions_link": all_versions_link,
"related_articles_link": related_articles_link,
"pdf": {
"title": pdf_file_title,
"link": pdf_file_link
}
})
# return publications
print(json.dumps(publications, indent=2, ensure_ascii=False))
scrape_conference_publications(query="anatomy", source=["NIPS", "Neural Information"])
Outputs:
[
{
"result_id": "hjgaRkq_oOEJ",
"title": "Differential representation of arm movement direction in relation to cortical anatomy and function",
"link": "https://iopscience.iop.org/article/10.1088/1741-2560/6/1/016006/meta",
"snippet": "… ",
"publication_info": "T Ball, A Schulze-Bonhage, A Aertsen… - Journal of neural …, 2009 - iopscience.iop.org",
"cite_by_link": "https://scholar.google.com/scholar/scholar?cites=16258204980532099206&as_sdt=2005&sciodt=0,5&hl=en",
"all_versions_link": "https://scholar.google.com/scholar/scholar?cluster=16258204980532099206&hl=en&as_sdt=0,5",
"related_articles_link": "https://scholar.google.com/scholar/scholar?q=related:hjgaRkq_oOEJ:scholar.google.com/&scioq=anatomy+source:NIPS+OR+source:Neural+Information&hl=en&as_sdt=0,5",
"pdf": {
"title": "[PDF] psu.edu",
"link": "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.324.1523&rep=rep1&type=pdf"
}
}, ... other results
]
Google Scholar Organic Results API
Alternatively, you can achieve it using Google Scholar Organic Results API from SerpApi.
The biggest difference is that you don't need to create a parser from scratch, maintain it, figure out how to scale it, and most importantly, how to bypass blocks from Google thus figuring out how to set up proxies and CAPTCHA solving solutions.
# pip install google-search-results
import os, json
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your serpapi API key
"engine": "google_scholar", # search engine
"q": "AI source:NIPS", # search query
"hl": "en", # language
# "as_ylo": "2017", # from 2017
# "as_yhi": "2021", # to 2021
"start": "0" # first page
}
search = GoogleSearch(params)
publications = []
publications_is_present = True
while publications_is_present:
results = search.get_dict()
print(f"Currently extracting page #{results.get('serpapi_pagination', {}).get('current')}..")
for result in results["organic_results"]:
position = result["position"]
title = result["title"]
publication_info_summary = result["publication_info"]["summary"]
result_id = result["result_id"]
link = result.get("link")
result_type = result.get("type")
snippet = result.get("snippet")
publications.append({
"page_number": results.get("serpapi_pagination", {}).get("current"),
"position": position + 1,
"result_type": result_type,
"title": title,
"link": link,
"result_id": result_id,
"publication_info_summary": publication_info_summary,
"snippet": snippet,
})
if "next" in results.get("serpapi_pagination", {}):
# splits URL in parts as a dict and passes it to a GoogleSearch() class.
search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
else:
papers_is_present = False
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Links
Add a Feature Request💫 or a Bug🐞