Uncategorized

html – Unable to Scrape Links from Listings Using Python Selenium


I’m attempting to scrape links from listings using Python Selenium. However, I’ve encountered an issue where the scraping returns sometimes None. I’ve noticed it’s for links enclosed within elements having tabindex=”0″. Strangely, when the same elements don’t contain tabindex=”0″, the scraping works perfectly.

Here’s a snippet of the HTML structure for context that would scrape fine:

<a class="hz-Link hz-Link--block hz-Listing-coverLink" href="/v/auto-s/toyota/m2063540046-toyota-yaris-1-3-16v-5dr-2001-blauw?correlationId=51249a0b-2b07-4493-8cea-5a96d6f81cd6"></a>

Here’s a snippet of the HTML structure for context that would return None:

<a class="hz-Link hz-Link--block hz-Listing-coverLink" tabindex="0" href="/v/auto-s/toyota/m2063540046-toyota-yaris-1-3-16v-5dr-2001-blauw?correlationId=51249a0b-2b07-4493-8cea-5a96d6f81cd6"></a>

I’m using the following code to scrape the links:

from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options as EdgeOptions
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from bs4 import BeautifulSoup

os.environ['WDM_LOG'] = '0'
os.environ['WDM_LOCAL'] = '1'

options = EdgeOptions()
options.add_argument("inprivate")
options.add_argument("headless")

try:
    service = Service(executable_path=r'PATH')
    driver = webdriver.Edge(service=service, options=options)
    driver.maximize_window()
    driver.get("https://www.marktplaats.nl/l/auto-s/mercedes-benz/#f:10882|offeredSince:Gisteren|PriceCentsTo:500000|constructionYearFrom:1990|constructionYearTo:2012|sortBy:SORT_INDEX|sortOrder:DECREASING")
    driver.implicitly_wait(10)

    driver.execute_script("window.scrollTo(0,document.body.scrollHeight,)")

    scrape_coverlink = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.XPATH, "//*[@class="hz-Listing hz-Listing--list-item-cars hz-Listing--list-item-cars-BNL16952"]//*[@class="hz-Link hz-Link--block hz-Listing-coverLink"]"))
    )

    for links in scrape_coverlink:
        link = links.get_attribute("href")
        print(link)

    driver.quit()

finally:
    driver.quit()

I’ve ensured that the elements are present in the DOM and have tried various selectors, but scraping links from elements with tabindex=”0″ consistently returns None.

What could be causing this behavior? Is there a workaround or an alternative approach to scrape links from elements with tabindex=”0″ listings?

My expected output should always be links in this format:

https://www.marktplaats.nl/v/auto-s/mercedes-benz/m2062894144-mercedes-benz-a-klasse-150-avantgarde

But sometimes when the elements contain tabindex=”0″ the output is:

None



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *