Uncategorized

Selenium webdriver skipping some links while iterating through df – python


I’m trying to scrape some data from around 1800 webpages with a similar table format using selenium python. I have a dataframe full of the links to each of the necessary pages. However, when I use selenium to grab data (namely the presence of specific products) from each page, certain links are simply skipped by the program.

Here is a sample webpage with data I’m using (https://www.xpel.com/clearbra-installers/united-states/arizona/tempe). For each of the 5 stores on this page, I’m pulling the name of the store, its address, and the class name of the 4 available products (which either have the attribute ‘active’ or not active).

I’ve tried the following solution, which gets each of the links from df_in, navigates to each page, finds the number of stores in each city, and pulls the necessary data for each store.

for link in df_in['Link']:
    # get link, then wait for page to load
    driver.get(link)
    wait.until(lambda d: d.execute_script('return document.readyState') == 'complete')
    # find the number of stores
    dealership_options = driver.find_elements(By.CLASS_NAME, "dealer-list-cell")
    num_dealership_options = len(dealership_options)

    for i in range (1, (num_dealership_options + 1)):
        # for each store, collect the name, address, and products
        name = driver_two.find_element(By.XPATH, "(//div[@class="dealer-list-cell-name"])[" + str(i) + "]")
        address = driver.find_element(By.XPATH, "(//div[@class="dealer-list-cell-address"])[" + str(i) + "]")
        wait.until(EC.presence_of_element_located((By.XPATH, "(//div[@class="dealer-list-cell-xpel-logos"])[" + str(i) + "]/div[1]")))
        p1 = driver.find_element(By.XPATH, "(//div[@class="dealer-list-cell-xpel-logos"])[" + str(i) + "]/div[1]")
        p2 = driver.find_element(By.XPATH, "(//div[@class="dealer-list-cell-xpel-logos"])[" + str(i) + "]/div[2]")
        p3 = driver.find_element(By.XPATH, "(//div[@class="dealer-list-cell-xpel-logos"])[" + str(i) + "]/div[3]")
        p4 = driver.find_element(By.XPATH, "(//div[@class="dealer-list-cell-xpel-logos"])[" + str(i) + "]/div[4]")

        # add the new data to df_out
        new_row = {"Link": link, 
                   "Address": address.text, 
                   "Name": name.text, 
                   "P1": p1.get_attribute("class"), 
                   "P2": p2.get_attribute("class"), 
                   "P3": p3.get_attribute("class"), 
                   "P4": p4.get_attribute("class")}
        df_out = pd.concat([df_out, pd.DataFrame([new_row])], ignore_index=True)

        # print the link, just to keep track of where we are
        print(link + " is done")

Most of the time, this program works perfectly. However, everyone once in awhile it seems to get caught in a snag where it stops printing “link is done” for about 2 minutes, then continues to run again using links further down the list. For instance, in one run it printed “…/new-mexico/clovis is done”, stopped printing anything for a few minutes, then printed
“…/united-states/south-carolina/ridgeland is done”. It must have kept running through the states New York to Rhode Island (alphabetically) in the process, but never printed these names, collected the appropriate data, or added them to df_out.

I’ve already tried playing around with wait.until a bit. I’ve also tried adding explicit waits. I’ve even tried sifting through the list small sections at a time. But nothing seems to work for seemingly random chunks of states.

What could possibly be happening? Is this a wait time issue? And if so, why is it still running in the background, picking up perfectly fine after a few minutes?



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *