Uncategorized

regex – How to get each table row result in new list? Python webscraping


Here is and example how you can get the data from the page (table, title,…) into a pandas dataframe and then automatically follow the next link:

from io import StringIO

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://www.afm.nl/nl-nl/sector/registers/vergunningenregisters/financiele-dienstverleners/details?id=C18B1D63-774C-E811-80D9-005056BB0C82"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0"
}

all_dfs = []
while True:
    soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

    for tag in soup.select(".cc-mobile-title"):
        tag.extract()

    df = pd.read_html(StringIO(str(soup)))[0]
    df["title"] = soup.h1.get_text(strip=True)

    for label, value in zip(
        soup.select(".cc-em--detail-list__label"),
        soup.select(".cc-em--detail-list__value"),
    ):
        df[label.get_text(strip=True)] = value.get_text(strip=True)

    print(df)
    all_dfs.append(df)

    next_url = soup.select_one('a:-soup-contains("Volgende register resultaat")')
    if not next_url:
        break

    url = "https://www.afm.nl/" + next_url["href"]

final_df = pd.concat(all_dfs)
print(final_df)
final_df.to_csv('data.csv', index=False)

Prints:

  Financiele Dienst                          Product   Begindatum  Einddatum         title Statutaire naam   Handelsnaam Vergunningnummer                                                                          0         Adviseren            Inkomensverzekeringen  01 jan 2006        NaN  Michal Tomeš    Michal Tomeš  Michal Tomeš         12045811
1         Adviseren  Schadeverzekeringen particulier  01 jan 2006        NaN  Michal Tomeš    Michal Tomeš  Michal Tomeš         12045811                                                                          2         Adviseren     Schadeverzekeringen zakelijk  01 jan 2006        NaN  Michal Tomeš    Michal Tomeš  Michal Tomeš         12045811
3         Adviseren                         Vermogen  01 jan 2006        NaN  Michal Tomeš    Michal Tomeš  Michal Tomeš         12045811
4        Bemiddelen            Inkomensverzekeringen  01 jan 2006        NaN  Michal Tomeš    Michal Tomeš  Michal Tomeš         12045811
5        Bemiddelen  Schadeverzekeringen particulier  01 jan 2006        NaN  Michal Tomeš    Michal Tomeš  Michal Tomeš         12045811
6        Bemiddelen     Schadeverzekeringen zakelijk  01 jan 2006        NaN  Michal Tomeš    Michal Tomeš  Michal Tomeš         12045811
7        Bemiddelen                         Vermogen  01 jan 2006        NaN  Michal Tomeš    Michal Tomeš  Michal Tomeš         12045811
  Financiele Dienst                          Product   Begindatum  Einddatum         title Statutaire naam   Handelsnaam Vergunningnummer
0         Adviseren            Inkomensverzekeringen  01 jan 2007        NaN  Michal Treml    Michal Treml  Michal Treml         12045973
1         Adviseren  Schadeverzekeringen particulier  01 jan 2007        NaN  Michal Treml    Michal Treml  Michal Treml         12045973
2         Adviseren     Schadeverzekeringen zakelijk  01 jan 2007        NaN  Michal Treml    Michal Treml  Michal Treml         12045973

...

The final_df at the end concatenates all DataFrames to one final dataframe.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *