NER Weekly Sync 20200211

Weighted gazette

Currently achieve the best performance when using the P(t|w)
Aldrian gave some feedback about how to combine the probability of a word that has several contexts.

Because currently we use unigram, bigram, and trigram for the context, we cannot just sum them all as they represent different features
2 possible ways of combining the probability

Backoff: if the count(c_trigram, w) is high, then we rely on p(t | c_trigram, w). If not, then we use p(t | c_bigram, w)
Interpolation: lambda_1 * p(t | c_trigram, w) + lambda_2 * p(t | c_bigram, w) + (1-lambda_1-lambda_2) * p(t | c_unigram, w)

Lambda is a parameter which can be determined using empirical approach or the data itself (will need to ask Aldrian further if we want to use this to determine the lambda)

Currently, will try to implement the backoff approach as it is easier to be implemented.

Cost estimation

Need to check the the CPU usage for Postgres in Cloud SQL (run current script in compute engine)
Clarification about how frequent we should update the gazette DB (maintenance pipeline)

Gazette will help if entity is rare accessed or new product -> better if we can have near real time or frequent batch process to update Gazette
The cost estimation for maintenance will be elaborated further later.

More details on the cost estimation can be seen here.