Skip to content

josephwccheng/nsw_property_extractor

Repository files navigation

Extraction of NSW Property Data

Date: 24/04/2023

Last Updated: 06/03/2024

Author: Joseph Cheng

Programming Language: Python3.8 and above

Disclaimer

  • This is a personal, non-profit project that is intended for the public to access datasets, which can potentially help people make decisions when analysing on the property market.
  • If the owner / government of this data source requires me to take down this project I will take it down immediately.

Main Objective

  • Property data is difficult to gather through these days. Luckily in New South Wales - Australia, the NSW State Government has provided public dataset of the transactional property sales data (See link below)
  • The objective is to create a clean / comprehenable dataset with historical information of the property information in NSW Australia, based on the raw data provided by the government
  • Please reach out to me to provide any feedbacks / improvements and I will try my best to update the dataset as soon as possible

Personal Remarks

  • I am also a first home buyer looking for optimisation to find opportunites in the property market. I hope that by sharing this code repository more people can access to property data and help them find their dream home easier.

How to Run

  • Download the data from "NSW data source"
  • unzip all the folders and save them in the appropriate location (TODO)
  • pip install all the required python library
  • run the main file

Resources

  1. Australian property heat map application
  2. NSW Property Sales Information
  3. NSW Bulk Land Value Information
  4. NSW raw data fields
  5. NSW fields descirption
  6. Multiprocessing with Python
  7. Python Web Scrapper
  8. NSW Postcodes
  9. Headers and Cookies for Web Scraping
  10. Web Scraping best Practices
  11. 10 tips for web scraping

Anti-scraping Techniques

  • number of requests: too many requests within a particular time frame or there are too many parallel requests from the same IP
  • number of repetitions and find request patterns (X number of requests at every Y seconds)
  • Honeypots are link traps webmasters can add to the HTML file that are hidden from humans
  • Redirecting the request to a page with a CAPTCHA
  • javascript checks
  • anti-bot mechanisms can spot patterns in the number of clicks, clicks’ location, the interval between clicks, and other metrics

Todos

  • Set Your Timeout to at Least 60 seconds
  • Don’t Set Custom Headers Unless You 100% Need To
  • Always Send Your Requests to the HTTPS Version
  • Avoid Using Sessions Unless Completely Necessary
  • Manage Your Concurrency Properly
  • Verify if You Need Geotargeting Before Running Your Scraper
  • If you want to be able to interact with the page (click on a button, scroll, etc.) then you will need to use your own Selenium, Puppeteer, or Nightmare headless browser

Tips

  • Set Random Intervals In Between Your Requests
  • Set a Referrer
  • Use a Headless Browser
  • Avoid Honeypot Traps
  • Detect Website Changes

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published