NER Weekly Sync 20200211
- Weighted gazette
- Currently achieve the best performance when using the P(t|w)
- Aldrian gave some feedback about how to combine the probability of a word that has several contexts.
- Because currently we use unigram, bigram, and trigram for the context, we cannot just sum them all as they represent different features
- 2 possible ways of combining the probability
- Backoff: if the count(c_trigram, w) is high, then we rely on p(t | c_trigram, w). If not, then we use p(t | c_bigram, w)
- Interpolation: lambda_1 * p(t | c_trigram, w) + lambda_2 * p(t | c_bigram, w) + (1-lambda_1-lambda_2) * p(t | c_unigram, w)
- Lambda is a parameter which can be determined using empirical approach or the data itself (will need to ask Aldrian further if we want to use this to determine the lambda)
- Currently, will try to implement the backoff approach as it is easier to be implemented.
- Cost estimation
- Need to check the the CPU usage for Postgres in Cloud SQL (run current script in compute engine)
- Clarification about how frequent we should update the gazette DB (maintenance pipeline)
- Gazette will help if entity is rare accessed or new product -> better if we can have near real time or frequent batch process to update Gazette
- The cost estimation for maintenance will be elaborated further later.
- More details on the cost estimation can be seen here.