Uncategorized

Python, webscrape. Finding the complete URL, or final html behind javascript w/o selenium


I’m learning Python using AI chat.
As a first challenge, I am making a program that fetches the price of books.
But one of the bookstores are tricky and I’ve been trying different things for days.

They use “reverse” url where the ISBN is at the end, example:
https://www.akademibokhandeln.se/bok/en-varld-utan-slut/9789100172664
and i need to know the middle part of the url.

I have tried

  1. https://www.akademibokhandeln.se/?q=9789100172664
    Then I end up on the start page.

  2. Let python use the search input field, then it shows the book on a temporary search page hidden behind javascript so I can’t webscrape. In firefox inspect mode I see all I need, but not when in the page source mode. https://www.akademibokhandeln.se/sok?sokfraga=9789100172664

  3. The AI-Chat let me test a code to look on the server for easy-to-see urls with the right numbers, but this rarely worked. And I can’t make it present same solution again.

Robots.txt User agent: * Allow: / Sitemap:
https://www.akademibokhandeln.se/sitemap.xml

I’m trying to search all the sitemap files (~20 x 2,5mb) but its often very slow process (72seconds).

import requests
from bs4 import BeautifulSoup

# URL of the sitemap index
sitemap_index_url = "https://www.akademibokhandeln.se/sitemap.xml"

response = requests.get(sitemap_index_url)
soup_index = BeautifulSoup(response.content, "lxml-xml")

# Flag to indicate whether the URL has been found
found = False

# Look for the sitemap URLs in the sitemap index
for sitemap in soup_index.find_all("loc"):
    if found:
        break

    sitemap_url = sitemap.text
    response = requests.get(sitemap_url)
    soup = BeautifulSoup(response.content, "lxml-xml")

    # Look for the URL that ends with "/9789113130378" in each sitemap
    for url in soup.find_all("loc"):
        if url.text.endswith("/9789100172664"):
            print(f"The complete URL is: {url.text}")
            found = True
            break
  1. I tried to webscrape duckduckgo and use the url then my other code.
    I tried to rewrite the duckduckgo url so it becomes a redirect without webscraping.
    https://lite.duckduckgo.com/lite/?kp=-1&kl=se-sv&q=\\site:www.akademibokhandeln.se+9789100172664

And now the process are down to 1,5seconds + secondary code.
BUT, duckduckgo are not up to date so new books will make my code to fail.
Google is more up to date but not fully.

  1. Ai-chat often suggests selenium and webdriver as a final solution but I do not want to run more programs just to make this work.

  2. The missing middle part of the url is the book title, I have considered to search for the book title on other place and create the complete url by so, but I have not tested this since it would be slow and not safe.

I believe there is another way. So I can go back to a fast and basic solution which runs on 0,5-1 second, I need to get price directly from searchpage or finalpage

The search page hide the content behind javascript or something, else i could saved the price from here.
The startpage and final page are normal html but i’m missing the full url to get there.

Thanks



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *