MoM 200519

A. USECASE DIAGRAM
https://meet.google.com/linkredirect?authuser=0&dest=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1GNasPn3yNugJj1OvoyWOGT1zbig9LeDO%2Fview%3Fusp%3Dsharing


B. DATA SCHEMA
lookup
url         VARCHAR
category      VARCHAR
percentage     FLOAT
execution_id    INT
last_update_time  TIMESTAMP
hashed_url
hash_method -> gin, md5, checksum

metadata
execution_id   INT
script_name    VARCHAR
execution_time  TIMESTAMP
commit_id     VARCHAR


C. IMPLEMETATION
Use psycopg
1. index?
query -> experiment!
a. bulk

SELECT url, category, percentage
FROM table
WHERE
  url IN ({urls})
  AND hashed_url IN ({hashed_urls})

b. for url in urls:

SELECT url, category, percentage
FROM table
WHERE
  url IN ({url})
  AND hashed_url IN ({hashed_url})

maybe slower than a.

2. category = "ERROR"
needed? yes
don't download

3. stg? (+ prod?)
see A. USECASE DIAGRAM first
we prefer to split by env, because the backfill case is rare, we afraid it's a waste of resource
to handle not similar service in different env:
add parameter -> use_lookup (default=False)
  if True & env=prod -> error
more concern? let's discuss it in the comment


D. RESOURCE
https://www.endpoint.com/blog/2015/12/31/postgres-checksum-performance-impact
https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/
https://ankane.org/large-text-indexes


E. TASK
o add lookup capability in stg [Devina - 15/6]
 add cloudsql sidecar container in the service
 initial implementation, use for (not bulk)
 add parameter -> use_lookup (default=False)
  if True & env=prod -> error

o create tables [Mirza]

o insert data to DB [Mirza]
 based on B. DATA SCHEMA
 create a notebook for that

o experiment query [Mirza]
 which is faster:
 - bulk/for?
 - index?