I’m attempting to scrape links from listings using Python Selenium. However, I’ve encountered an issue where the scraping returns sometimes None. I’ve noticed it’s for links enclosed within elements having tabindex=”0″. Strangely, when the same elements don’t contain tabindex=”0″, the scraping works perfectly.
Here’s a snippet of the HTML structure for context that would scrape fine:
<a class="hz-Link hz-Link--block hz-Listing-coverLink" href="/v/auto-s/toyota/m2063540046-toyota-yaris-1-3-16v-5dr-2001-blauw?correlationId=51249a0b-2b07-4493-8cea-5a96d6f81cd6"></a>
Here’s a snippet of the HTML structure for context that would return None
:
<a class="hz-Link hz-Link--block hz-Listing-coverLink" tabindex="0" href="/v/auto-s/toyota/m2063540046-toyota-yaris-1-3-16v-5dr-2001-blauw?correlationId=51249a0b-2b07-4493-8cea-5a96d6f81cd6"></a>
I’m using the following code to scrape the links:
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options as EdgeOptions
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from bs4 import BeautifulSoup
os.environ['WDM_LOG'] = '0'
os.environ['WDM_LOCAL'] = '1'
options = EdgeOptions()
options.add_argument("inprivate")
options.add_argument("headless")
try:
service = Service(executable_path=r'PATH')
driver = webdriver.Edge(service=service, options=options)
driver.maximize_window()
driver.get("https://www.marktplaats.nl/l/auto-s/mercedes-benz/#f:10882|offeredSince:Gisteren|PriceCentsTo:500000|constructionYearFrom:1990|constructionYearTo:2012|sortBy:SORT_INDEX|sortOrder:DECREASING")
driver.implicitly_wait(10)
driver.execute_script("window.scrollTo(0,document.body.scrollHeight,)")
scrape_coverlink = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, "//*[@class="hz-Listing hz-Listing--list-item-cars hz-Listing--list-item-cars-BNL16952"]//*[@class="hz-Link hz-Link--block hz-Listing-coverLink"]"))
)
for links in scrape_coverlink:
link = links.get_attribute("href")
print(link)
driver.quit()
finally:
driver.quit()
I’ve ensured that the elements are present in the DOM and have tried various selectors, but scraping links from elements with tabindex=”0″ consistently returns None.
What could be causing this behavior? Is there a workaround or an alternative approach to scrape links from elements with tabindex=”0″ listings?
My expected output should always be links in this format:
https://www.marktplaats.nl/v/auto-s/mercedes-benz/m2062894144-mercedes-benz-a-klasse-150-avantgarde
But sometimes when the elements contain tabindex=”0″ the output is:
None