No scrapping via AWS or Office IPs
As mentioned in #technology earlier, AirAsia recently filed an abuse report to Amazon and blacklisted all AWS IPs (not just ours) due to our scrapping load. This also put us in violation of AWS Acceptable Use Policy, which can have a very serious consequences. We’re taking a few immediate actions to mitigate this.
1. Route-off AirAsia on all staging machines [pic: @chandrawibowo] (done)
2. For “shared” staging machines (i.e.: staging06, staging19, testing-app-01):
- We need to continue to scrap our providers (e.g.: scraping airlines as part of normal flight search) for testing purposes.
- We must use a 3rd-party proxies for scraping with random IPs
- We shouldn’t use hotel-proxy (fb01), connect directly from AWS IP, or directly from office IP.
- 3rd-party proxies will to need whitelist these staging machines.
- Site-Infra is setting up static IPs for these staging machines (done).
- Once all is configured, we’ll route-on AirAsia again on these staging machines.
3. For the other staging machines (e.g.: staging01, etc):
- We must minimize all kind of scraping, only do this when needed and only for a short duration with low traffic.
- AirAsia specifically will continue to be routed-off by default.
- 3rd-party proxy is not an option here because they can’t whitelist too many IPs due to their own limitation.
4. For the development machines:
- We must minimize all kind of scraping, should only do this when needed and only for a short duration with low traffic.
- AirAsia specifically should be routed-off by default. Please follow @felix's post for an instruction on how to update the config on your local/development machine.
Aside from the immediate actions above, we’re also pursuing the following high-priority tasks:
- Document list of connections where our IPs need to be whitelisted by any 3rd-party providers. This should include any scraping or API calls to providers. Please help us crowd-source this google doc from all teams [pic: @shvetsov]
- Document list of proxy servers that each product/team should deploy in the near term. These proxy servers will be used for our API calls to 3rd-party providers [pic: @kevin]
- Deploy the proxy servers [pic: each team], open Site-Infra request for production deployment
- White list per-product proxy servers with respective 3rd-party providers [pic: each team]
- Each team migrates from hotel-proxy to the per-product proxy [pic: each team]
- Retire / lock-down hotel-proxy [pic: site-infra]
Note:
If your team has any kind of mission-critical scraping needs that are currently running from our office or AWS IP, and you are not sure what to do, please add your use case(s) in this spreadsheet for us to take a look.