2021-08-13 GCP Governance
Attendees: Site Infra, CDE, TPM
Issues
- Lack of clarity on teams' responsibility and "code of conduct" for GCP especially on aspects of governance: access control, project creation, policies, standards and conventions
- We have identified specific GCP security issues, but find it hard to move as we're afraid to break workloads
Notes
- CDE is responsible for overall GCP, but Data teams generally lack engineering resources and/or capability. In actuality, CDE remit is not only to set policies but also execute them and enforce compliance on behalf of all Data teams
- Current proposal scope is to tackle 2 out of 5 issues highlighted in CDE <> SecOps document. The 2 issues are
GCP Project Cleanup
and Access and Permissions
. Other plans are in place to tackle the other issues e.g. keyless service accounts for service account key rotation issue
- CDE is adopting PDAs as much as we can (GCP projects, Data Ops Platform), but PDA may not be the correct identifier unit of who owns or manages a GCP project. For example, some projects are listed as under Accom PDA but actually managed through our iac repo which the Accom PDA team members do not currently have access to
- On project provisioning: number of accounts in AWS is limited due to overhead associated with accounts. It may be ok for PDAs to have multiple projects as GCP project creation cost may be minimal. But use conventions and structures (implicit or explicit) that allow us to identify the owners
- Q: How to ensure IAM enforcement of least privilege principle in AWS?
- A: Not easy! Privilege proliferation problem. Enforcement is difficult, but at a minimum clear accountability needs to be assigned across all projects. So if there's a problem at least someone is the hook. Need tooling e.g. for scanning IAM excess privileges and other policy enforcements such as SCPs in AWS. Fundamental understanding of available services in cloud provider is key to tackle various security challenges
- Q: What if our team clean up some unowned services, and it causes critical damage e.g. site-wide crash?
- A: Then we have actually done Traveloka a great service in identifying a critical vulnerability over a service that has no owners or maintainers. In such cases a critical part of infrastructure would be unearthed and a dedicated team would be identified to take over the service. But more generally, when retiring unowned resources a rollback plan should always be in place and if not possible, a more gradual deescalation of that asset or service should be pursued
- Obtaining executive sponsorship (at Doan or Ray level) is highly recommended to ensure teams understand and accept their responsibilities over projects, and also to minimize potential political fallout in case of service disruptions
Next Steps
- Proceed with new Google Group structure and terraform-managed permissioning on Groups and cleanup all
user:
iam-bindings across iac repo
- Set and define standards, policies and governance that all teams must follow to use GCP. Sergei created this doc https://docs.google.com/document/d/1VnEAX4lRBFTjhJAkRgRzvxG3ZBQWMeqVqlhTfvLezoU/edit which we can use a launching point to codify a "contract" that all teams that manage GCP projects should follow
- Obtain executive sponsorship for GCP governance initiative