A. USECASE DIAGRAM
https://meet.google.com/linkredirect?authuser=0&dest=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1GNasPn3yNugJj1OvoyWOGT1zbig9LeDO%2Fview%3Fusp%3Dsharing
B. DATA SCHEMA
lookup
url VARCHAR
category VARCHAR
percentage FLOAT
execution_id INT
last_update_time TIMESTAMP
hashed_url
hash_method -> gin, md5, checksum
metadata
execution_id INT
script_name VARCHAR
execution_time TIMESTAMP
commit_id VARCHAR
C. IMPLEMETATION
Use psycopg
1. index?
query -> experiment!
a. bulk
SELECT url, category, percentage
FROM table
WHERE
url IN ({urls})
AND hashed_url IN ({hashed_urls})
b. for url in urls:
SELECT url, category, percentage
FROM table
WHERE
url IN ({url})
AND hashed_url IN ({hashed_url})
maybe slower than a.
2. category = "ERROR"
needed? yes
don't download
3. stg? (+ prod?)
see A. USECASE DIAGRAM first
we prefer to split by env, because the backfill case is rare, we afraid it's a waste of resource
to handle not similar service in different env:
add parameter -> use_lookup (default=False)
if True & env=prod -> error
more concern? let's discuss it in the comment
D. RESOURCE
https://www.endpoint.com/blog/2015/12/31/postgres-checksum-performance-impact
https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/
https://ankane.org/large-text-indexes
E. TASK
o add lookup capability in stg [Devina - 15/6]
add cloudsql sidecar container in the service
initial implementation, use for (not bulk)
add parameter -> use_lookup (default=False)
if True & env=prod -> error
o create tables [Mirza]
o insert data to DB [Mirza]
based on B. DATA SCHEMA
create a notebook for that
o experiment query [Mirza]
which is faster:
- bulk/for?
- index?