add Action Dropout on minerva WN18RR #12

David-Lee-1990 · 2019-08-19T14:24:00Z

Hi, I tried action dropout trick on original minerva code on WN18RR. However, hit@10 decreased from 0.47 to 0.37 when action dropout rate changed from 1.0 to 0.9. Is there any other auxiliary tricks for action dropout?
The following is action drop code, where self.params['flat_epsilon'] = float(np.finfo(float).eps):

pre_distribution = tf.nn.softmax(scores)
if mode == "train":
pre_distribution = pre_distribution * dropout_mask + self.params['flat_epsilon'] * (1 - dropout_mask)
dummy_scores_1 = tf.zeros_like(prelim_scores)
pre_distribution = tf.where(mask, dummy_scores_1, pre_distribution)

dist = tf.distributions.Categorical(probs=pre_distribution)
action = tf.to_int32(dist.sample())

And the dropout mask is given as follows:
rans = np.random.random(size=[self.batch_size * self.num_rollouts, self.max_num_actions])
dropout_mask = np.greater(rans, 1.0 - self.score_keepRate)

Another mask in above code is for padding other unavailable actions.

And I also calculate the final softmax loss by the original distribution as follows
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=scores, labels=label_action)

and when I change self.score_keepRate from 1.0 and 0.9, while training 100 batches with batch size 128, and the hits@k values on the dev set is as follows:

Score for iteration 100
Hits@1: 0.3771
Hits@3: 0.4370
Hits@5: 0.4591
Hits@10: 0.4796
Hits@20: 0.4852
MRR: 0.4113

Score for iteration 100
Hits@1: 0.3250
Hits@3: 0.3464
Hits@5: 0.3616
Hits@10: 0.3711
Hits@20: 0.3744
MRR: 0.3395

For 1000 batches training,

the MRR of dev set variates as follows:

the hit@1 of training batches variates as follows:

The text was updated successfully, but these errors were encountered:

todpole3 · 2019-08-19T16:27:53Z

You mean changing action dropout rate from 0.0 to 0.1? 0.9 is very aggressive dropout and 1.0 implies dropout everything and randomly sample an edge.

If so 0.1 dropout rate shouldn't make such a huge difference. Would you mind posting the action dropout code you added to the original minerva code? And how many iterations did you train to observe this result difference? It would be great if you can plot the training curve before and after adding action dropout for comparison.

David-Lee-1990 · 2019-08-20T03:11:30Z

@todpole3 I re-edited my issue to show more detailed information of the training results.

todpole3 · 2019-08-20T06:21:27Z

@David-Lee-1990 In MINERVA code, is pre_distribution used in gradient computation?

For us we only use action dropout to encourage diverse sampling but the policy gradient is still computed using the original probability vector.

David-Lee-1990 · 2019-08-20T06:47:23Z

In MINERVA, there is no dropout and I try to add this advancement on it. Follow your idea, I use dropout to encourage diverse sampling and the policy gradient is still compuated using the original distribution. I tested this for two versions: one is relation-only and the other not. Both versions show the similar results as i stated in the issue.

todpole3 · 2019-08-20T06:52:50Z

@David-Lee-1990 My question is, after adding "action dropout", did you use the updated probability vector pre_distribution to compute the policy gradient?

David-Lee-1990 · 2019-08-20T06:57:44Z

No, i use the original one.

todpole3 · 2019-08-20T07:10:03Z

@David-Lee-1990 I cannot spot anything wrong with the code snippet you posted. Thanks for sharing. It might have something to do with the integration with the rest of MINERVA code.

Technically you only disturbed the sampling prob by a small factor (and your policy gradient computation still follows the traditional formula) so the result shouldn't change so significantly no matter what.

Would you mind running a sanity-checking experiments by setting the dropout rate to 0.01 and see how the result turned out? Technically the change should be very small. Then maybe try 0.02 and 0.05 and see if the results change gradually?

David-Lee-1990 · 2019-08-20T08:32:18Z

hi, i run a sanity-checking experiments by setting the keep rate in [1.0, 0.99, 0.98, 0.97, 0.95, 0.93, 0.90]. The results of hits@1 on training batch is as follows:

1.0 VS 0.99

1.0 VS 0.98

1.0 VS 0.97

1.0 VS 0.95

1.0 VS 0.93

1.0 VS 0.90

todpole3 · 2019-08-20T18:13:12Z

@David-Lee-1990 Very interesting. I want to look deeper into this issue.

The most noticeable difference is that the dev result you reported without action dropout is close to what we have with 0.1 action dropout and significantly higher than what we have without action dropout.

Besides action dropout rate, did you use the same set of hyperparameters as we did in the configuration files?
If not, would you mind sharing your set of hyperparameters? I want to see if I can reproduce the same results on our code repo.

And one more question, did you observe similar trend on other datasets using MINERVA code + action dropout?

todpole3 · 2019-08-20T20:04:38Z

I tested this for two versions: one is relation-only and the other not. Both versions show the similar results as i stated in the issue.

@David-Lee-1990 Are the plots shown above generated with relation-only or not?

David-Lee-1990 · 2019-08-21T01:07:59Z

The most noticeable difference is that the dev result you reported without action dropout is close to what we have with 0.1 action dropout and significantly higher than what we have without action dropout.

@todpole3 about the dev result, i need to clarify that i used "sum" method, which is different with "max" method as you used, when calculating hit@k and MRR. "sum" method ranks the predicted entity by adding those probalibily, which predict the same end entity, up. The following code is from MINERVA, where lse is calculating the log sum.

And I also test "max" method on WN18RR, MRR on dev set is as follows.

as comparison, i paste the result of "sum" and "max "method together here:

David-Lee-1990 · 2019-08-21T01:08:21Z

I tested this for two versions: one is relation-only and the other not. Both versions show the similar results as i stated in the issue.

@David-Lee-1990 Are the plots shown above generated with relation-only or not?

relation only

David-Lee-1990 · 2019-08-21T01:19:52Z

Besides action dropout rate, did you use the same set of hyperparameters as we did in the configuration files?
If not, would you mind sharing your set of hyperparameters? I want to see if I can reproduce the same results on our code repo.

I give my hyperparameters in the form of your notation as follows:

group_examples_by_query="False"
use_action_space_bucketing="False"
bandwidth=200
entity_dim=100
relation_dim=100
history_dim=100
history_num_layers=1
train_num_rollouts=20
dev_num_rollouts=40
num_epochs=1000 # follow minerva, i randomly choose training data which has batch_size samples
train_batch_size=128
dev_batch_size=128
learning_rate=0.001
grad_norm=5
emb_dropout_rate=0
ff_dropout_rate=0
action_dropout_rate=1.0
beta=0.05
relation_only="True"
beam_size=100

David-Lee-1990 · 2019-08-21T05:06:33Z

And one more question, did you observe similar trend on other datasets using MINERVA code + action dropout?
@todpole3 Follow your advice, I test action dropout on nell-995 today, its performance is as follows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Action Dropout on minerva WN18RR #12

add Action Dropout on minerva WN18RR #12

David-Lee-1990 commented Aug 19, 2019 •

edited

todpole3 commented Aug 19, 2019 •

edited

David-Lee-1990 commented Aug 20, 2019

todpole3 commented Aug 20, 2019

David-Lee-1990 commented Aug 20, 2019

todpole3 commented Aug 20, 2019

David-Lee-1990 commented Aug 20, 2019

todpole3 commented Aug 20, 2019 •

edited

David-Lee-1990 commented Aug 20, 2019 •

edited

todpole3 commented Aug 20, 2019

todpole3 commented Aug 20, 2019 •

edited

David-Lee-1990 commented Aug 21, 2019 •

edited

David-Lee-1990 commented Aug 21, 2019

David-Lee-1990 commented Aug 21, 2019

David-Lee-1990 commented Aug 21, 2019 •

edited

add Action Dropout on minerva WN18RR #12

add Action Dropout on minerva WN18RR #12

Comments

David-Lee-1990 commented Aug 19, 2019 • edited

todpole3 commented Aug 19, 2019 • edited

David-Lee-1990 commented Aug 20, 2019

todpole3 commented Aug 20, 2019

David-Lee-1990 commented Aug 20, 2019

todpole3 commented Aug 20, 2019

David-Lee-1990 commented Aug 20, 2019

todpole3 commented Aug 20, 2019 • edited

David-Lee-1990 commented Aug 20, 2019 • edited

todpole3 commented Aug 20, 2019

todpole3 commented Aug 20, 2019 • edited

David-Lee-1990 commented Aug 21, 2019 • edited

David-Lee-1990 commented Aug 21, 2019

David-Lee-1990 commented Aug 21, 2019

David-Lee-1990 commented Aug 21, 2019 • edited

David-Lee-1990 commented Aug 19, 2019 •

edited

todpole3 commented Aug 19, 2019 •

edited

todpole3 commented Aug 20, 2019 •

edited

David-Lee-1990 commented Aug 20, 2019 •

edited

todpole3 commented Aug 20, 2019 •

edited

David-Lee-1990 commented Aug 21, 2019 •

edited

David-Lee-1990 commented Aug 21, 2019 •

edited