Date: 24/04/2023
Last Updated: 06/03/2024
Author: Joseph Cheng
Programming Language: Python3.8 and above
- This is a personal, non-profit project that is intended for the public to access datasets, which can potentially help people make decisions when analysing on the property market.
- If the owner / government of this data source requires me to take down this project I will take it down immediately.
- Property data is difficult to gather through these days. Luckily in New South Wales - Australia, the NSW State Government has provided public dataset of the transactional property sales data (See link below)
- The objective is to create a clean / comprehenable dataset with historical information of the property information in NSW Australia, based on the raw data provided by the government
- Please reach out to me to provide any feedbacks / improvements and I will try my best to update the dataset as soon as possible
- I am also a first home buyer looking for optimisation to find opportunites in the property market. I hope that by sharing this code repository more people can access to property data and help them find their dream home easier.
- Download the data from "NSW data source"
- unzip all the folders and save them in the appropriate location (TODO)
- pip install all the required python library
- run the main file
- Australian property heat map application
- NSW Property Sales Information
- NSW Bulk Land Value Information
- NSW raw data fields
- NSW fields descirption
- Multiprocessing with Python
- Python Web Scrapper
- NSW Postcodes
- Headers and Cookies for Web Scraping
- Web Scraping best Practices
- 10 tips for web scraping
- number of requests: too many requests within a particular time frame or there are too many parallel requests from the same IP
- number of repetitions and find request patterns (X number of requests at every Y seconds)
- Honeypots are link traps webmasters can add to the HTML file that are hidden from humans
- Redirecting the request to a page with a CAPTCHA
- javascript checks
- anti-bot mechanisms can spot patterns in the number of clicks, clicks’ location, the interval between clicks, and other metrics
- Set Your Timeout to at Least 60 seconds
- Don’t Set Custom Headers Unless You 100% Need To
- Always Send Your Requests to the HTTPS Version
- Avoid Using Sessions Unless Completely Necessary
- Manage Your Concurrency Properly
- Verify if You Need Geotargeting Before Running Your Scraper
- If you want to be able to interact with the page (click on a button, scroll, etc.) then you will need to use your own Selenium, Puppeteer, or Nightmare headless browser
- Set Random Intervals In Between Your Requests
- Set a Referrer
- Use a Headless Browser
- Avoid Honeypot Traps
- Detect Website Changes