Uncategorized

Scrape redfin.com using Python & Download Data To CSV file


For businesses, investors, and even curious individuals, real-time insights into the housing market can be invaluable. Redfin, a prominent player in the real estate sector, offers a mine of such data, spanning across more than 100 markets in both the United States and Canada. With a commendable 0.80% market share in the U.S. (Wikipedia), as gauged by the number of units sold, and boasting a network of approximately 2,000 dedicated lead agents, Redfin stands as a significant source of real estate intelligence.

In this blog, we will see how we can scrape data from Redfin using Python, further, I will show you how you can scale this process.

Let’s start!!

Collecting all the Ingredients for Scraping Redfin

Assuming that you have already installed Python 3.x on your machine and if not then please install it from here. Once this is done create a folder in which we will keep our Python scripts.

mkdir redfin
cd redfin

Once you are inside your folder install these public Python libraries.

  • Requests– This will be used for making the HTTP connection with redfin.com. Using this library we will install the raw HTML of the target page.
  • BeautifulSoup– Using this we will parse the important data from the raw HTML downloaded using the requests library.
pip install requests
pip install beautifulsoup4

Now create a python file inside this folder where you can write the script. I am naming the file as redfin.py.

With this our project setup is complete and now we can proceed with the scraping.

What are we going to scrape?

In this tutorial, we are going to scrape two types of pages from redfin.com.

  1. Redfin Search Page
  2. Redfin Property Page

Scraping Redfin Search Page

It is always a great practice to decide in advance what data you want from the page. For this tutorial, we are going to scrape this page.

extracting price, property configuration, address, agent information, property link

Download Raw HTML from the Page

Our first task would be to download the raw HTML from the target web page. For this, we are going to use the requests library.

import requests
from bs4 import BeautifulSoup


l=[]
o={}



head={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
target_url="https://www.redfin.com/city/30749/NY/New-York/filter/status=active"



resp = requests.get(target_url,headers=head,verify=False)
print(resp.status_code)

First, we imported all the libraries that we installed earlier. Then I declared one empty list and one empty object.

The head variable is a dictionary containing the User-Agent header. The target_url variable contains the URL of the webpage to be scraped.

The requests.get function is used to send an HTTP GET request to the specified URL (target_url). The headers parameter is set to include the User-Agent header from the head dictionary. The verify=False parameter disables SSL certificate verification. The response object (resp) contains the server’s response to the request.

Once you run this code and see a 200 on the logs then that means you have successfully scraped the target web page.

Now, we can parse the data using BS4.

Parsing the Raw HTML

BeautifulSoup will now help us extract all the data points from the raw HTML downloaded in the previous section. But before we start coding we have to identify the DOM location of each element.

We will use the Chrome developer tool to find the DOM location. If you inspect and analyze the design of the page then you will find that all the property box is inside the div tag with the class HomeCardContainer. So, first, we should find all these elements using find_all() method of BS4.

soup=BeautifulSoup(resp.text,'html.parser')

allBoxes = soup.find_all("div",{"class":"HomeCardContainer"})

The BeautifulSoup constructor is used to create a BeautifulSoup object (soup). The find_all method of the BeautifulSoup object is used to find all HTML elements that match the class HomeCardContainer.

allBoxes is a list that contains all the property data elements. Using for loop we are going to reach every property container and extract the details. But before we write our for loop let’s find the DOM location of each data point.

Let’s start with the property price.

Once you right-click on the price you will see that the price is stored inside the span tag with the class homecardV2Price.

Similarly, the configuration of the property can be found inside the div tag with class HomeStatsV2.

Agent/Broker name can be found inside the div tag with the class brokerageDisclaimerV2.

Individual property links can be found inside the a tag. This a tag is the only a tag inside each property container.

for box in allBoxes:
    try:
        o["property-price"]=box.find("span",{"class":"homecardV2Price"}).text.strip()
    except:
        o["property-price"]=None

    try:
        o["property-config"]=box.find("div",{"class":"HomeStatsV2"}).text.strip()
    except:
        o["property-config"]=None

    try:
        o["property-address"]=box.find("div",{"class":"homeAddressV2"}).text.strip()
    except:
        o["property-address"]=None

    try:
        o["property-broker"]=box.find("div",{"class":"brokerageDisclaimerV2"}).text.strip()
    except:
        o["property-broker"]=None

    try:
        o["property-link"]="https://www.redfin.com"+box.find("a").get('href')
    except:
        o["property-link"]=None

    l.append(o)
    o={}


print(l)

For each home card container, it extracts specific pieces of information, such as property price, configuration, address, broker details, and a link to the property.

for loop iterates through each element (box) in the list of home card containers. For each piece of information (property price, configuration, address, broker, link), a try block attempts to find the corresponding HTML element within the current home card container (box). If successful, it extracts the text content, strips leading and trailing whitespaces, and assigns it to the corresponding key in the dictionary (o). If the extraction fails (due to an attribute not being present or other issues), the except block sets the value to None.

After extracting information from the current home card container, the dictionary o is appended to the list l. Then, the dictionary o is reset to an empty dictionary for the next iteration.

Once you run this code you will get this response.

Saving the data to a CSV file

For better visibility of this data, we are going to save this data to a CSV file. For this task, we are going to use the pandas library.

df = pd.DataFrame(l)
df.to_csv('properties.csv', index=False, encoding='utf-8')

The code uses the pandas library to create a DataFrame (df) from the list of dictionaries (l) that contains the scraped data. After creating the DataFrame, it is then exporting the DataFrame to a CSV file named 'properties.csv'.

After running the code you will find a CSV file inside your working folder by the name properties.csv.

Saving the data from a list to a CSV file was super simple with Pandas.

Complete Code

You can scrape many more details from the page but for now, the code will look like this.

import requests
from bs4 import BeautifulSoup
import pandas as pd

l=[]
o={}



head={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
target_url="https://www.redfin.com/city/30749/NY/New-York/filter/status=active"



resp = requests.get(target_url,headers=head,verify=False)
print(resp.status_code)
soup=BeautifulSoup(resp.text,'html.parser')

allBoxes = soup.find_all("div",{"class":"HomeCardContainer"})

for box in allBoxes:
    try:
        o["property-price"]=box.find("span",{"class":"homecardV2Price"}).text.strip()
    except:
        o["property-price"]=None

    try:
        o["property-config"]=box.find("div",{"class":"HomeStatsV2"}).text.strip()
    except:
        o["property-config"]=None

    try:
        o["property-address"]=box.find("div",{"class":"homeAddressV2"}).text.strip()
    except:
        o["property-address"]=None

    try:
        o["property-broker"]=box.find("div",{"class":"brokerageDisclaimerV2"}).text.strip()
    except:
        o["property-broker"]=None

    try:
        o["property-link"]="https://www.redfin.com"+box.find("a").get('href')
    except:
        o["property-link"]=None

    l.append(o)
    o={}


print(l)
df = pd.DataFrame(l)
df.to_csv('properties.csv', index=False, encoding='utf-8')

Scraping Redfin Property Page

From the property page, we are going to gather this information.

  • Property Price
  • Property Address
  • Is it still available(True/False)
  • About section of the property

Download Raw HTML from the Page

import requests
from bs4 import BeautifulSoup



l=[]
o={}
available=False


head={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
target_url="https://www.redfin.com/NY/New-York/112-E-35th-St-10016/home/45333496"



resp = requests.get(target_url,headers=head,verify=False)
print(resp.status_code)

This Python code performs web scraping on a Redfin property page using the requests library to make an HTTP GET request and the BeautifulSoup library to parse the HTML content. The script initializes empty data structures (l and o) to store scraped information and sets a User-Agent header to simulate a Chrome browser request. The target URL is specified, and an HTTP GET request is sent with SSL certificate verification disabled.

After running the code if you get 200 on your console then that means your code was able to scrape the raw HTML from the target web page.

Let’s use BS4 to parse the data.

Parsing the Raw HTML

As usual, we have to first find the location of each element inside the DOM.

Price is stored inside the div tag with class statsValue.

The address is stored inside the h1 tag with the class full-address.

Property sale status is located inside the div tag with the class ListingStatusBannerSection.

About section of the property can be found inside the div tag with id marketing-remarks-scroll.

soup=BeautifulSoup(resp.text,'html.parser')

try:
    o["property-price"]=soup.find("div",{"class":"statsValue"}).text.strip()
except:
    o["property-price"]=None

try:
    o["property-address"]=soup.find("h1",{"class":"full-address"}).text.strip()
except:
    o["property-address"]=None

check = soup.find("div",{"class":"ListingStatusBannerSection"}).text.strip()
if "ACTIVE" in check:
    available=True
else:
    available=False

try:
    o["property-available"]=available
except:
    o["property-available"]=False

try:
    o["about-property"]=soup.find("div",{"id":"marketing-remarks-scroll"}).text.strip()
except:
    o["about-property"]=None

print(l)

By default available is set to False and it is set to True if the string ACTIVE is present inside the check string. We have used strip() function to remove the unwanted spaces from the text value.

Once you run the code you should get this.

Finally, we were able to extract all the desired information from the target page.

Complete Code

The complete code for this property page will look like this.

import requests
from bs4 import BeautifulSoup


l=[]
o={}
available=False


head={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
target_url="https://www.redfin.com/NY/New-York/112-E-35th-St-10016/home/45333496"



resp = requests.get(target_url,headers=head,verify=False)
print(resp.status_code)
soup=BeautifulSoup(resp.text,'html.parser')

try:
    o["property-price"]=soup.find("div",{"class":"statsValue"}).text.strip()
except:
    o["property-price"]=None

try:
    o["property-address"]=soup.find("h1",{"class":"full-address"}).text.strip()
except:
    o["property-address"]=None

check = soup.find("div",{"class":"ListingStatusBannerSection"}).text.strip()
if "ACTIVE" in check:
    available=True
else:
    available=False

try:
    o["property-available"]=available
except:
    o["property-available"]=False

try:
    o["about-property"]=soup.find("div",{"id":"marketing-remarks-scroll"}).text.strip()
except:
    o["about-property"]=None
l.append(o)

print(l)

Bonus Section

While scrolling down on the product page you will find information regarding agents, down payment, calculator, etc. This information loads through an AJAX injection.

This cannot be scraped through a normal XHR request. At this point, many of you will think that this information can be scraped easily through a headless browser but the problem is that these headless browsers consume too much CPU resources. Well, let me share the alternate for this.

Redfin renders this data from the API calls it makes from the second last script tag of any property page. Let me explain to you what I mean over here.

The raw HTML you get after making the GET request will have a script tag in which all this data will be stored.

The script tag you see above is the second last script tag of the raw HTML downloaded from the target property page. Here is how using regular expression you can access the data from this tag.

try:
    o["other-details"]=soup.find_all('script')[-2]
except:
    o["other-details"]=None

config_match = re.search(r'reactServerState\.InitialContext\s*=\s*({.*?});', str(o["other-details"]))

if config_match:
    config_data = config_match.group(1)
    print(config_data)

Using regular expression we are finding a string that matches the pattern reactServerState\.InitialContext\s*=\s*({.*?});

Once you run this code you will find all the information inside this string.

How to scrape Redfin at scale?

The above approach is fine until you are scraping a few hundred pages but this approach will fall flat when your scraping demands reach millions. Redfin will start throwing captcha pages like this.

To avoid this situation you have to use a web scraping API like Scrapingdog. This API will handle proxy rotations for you. Proxy rotation will help you maintain the data pipeline.

You can sign up for the free account from here. The free account comes with a generous 1000 credits which is enough for testing the API.

Once you are on the dashboard you will find an API key that will be used in the below code.

For this example, I am again using the Redfin search page.

import requests
from bs4 import BeautifulSoup
import pandas as pd

l=[]
o={}


target_url="https://api.scrapingdog.com/scrape?dynamic=false&api_key=YOUR-API-KEY&url=https://www.redfin.com/city/30749/NY/New-York/filter/status=active"

resp = requests.get(target_url)
print(resp.status_code)
soup=BeautifulSoup(resp.text,'html.parser')

allBoxes = soup.find_all("div",{"class":"HomeCardContainer"})

for box in allBoxes:
    try:
        o["property-price"]=box.find("span",{"class":"homecardV2Price"}).text.strip()
    except:
        o["property-price"]=None

    try:
        o["property-config"]=box.find("div",{"class":"HomeStatsV2"}).text.strip()
    except:
        o["property-config"]=None

    try:
        o["property-address"]=box.find("div",{"class":"homeAddressV2"}).text.strip()
    except:
        o["property-address"]=None

    try:
        o["property-broker"]=box.find("div",{"class":"brokerageDisclaimerV2"}).text.strip()
    except:
        o["property-broker"]=None

    try:
        o["property-link"]="https://www.redfin.com"+box.find("a").get('href')
    except:
        o["property-link"]=None

    l.append(o)
    o={}

print(l)
df = pd.DataFrame(l)
df.to_csv('properties.csv', index=False, encoding='utf-8')

Did you notice something? The code is almost the same as above we just replaced the target URL with the Scrapingdog API URL. Of course, you have to use your personal API key above to run this program successfully.

It is a very economical solution for large-scale scraping. You just have to focus on data collection and the rest will be managed by Scrapingdog.

Conclusion

In this blog, I have scraped two distinct types of pages on Redfin: the search page and the property page. Moreover, I have included a bonus section that sheds light on extracting information that’s dynamically loaded through AJAX injections.

Just like Redfin, I have extracted data from other real estate giants. (find their links below)

  1. Scraping Zillow Real Estate Property Data using Python
  2. Scraping Idealista.com using Python
  3. Web Scraping Realtor Property Data using Python

If this article resonates with you and you appreciate the effort put into this research, please share it with someone who might be on the lookout for scalable real estate data extraction solutions from property sites.

In the future, I will be making more such articles. If you found this article helpful, please share it. Thanks for reading!



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *