Vector Space Proximity Based Document Retrieval For Document Embeddings Built By Transformers

Title:

Vector Space Proximity Based Document Retrieval For Document Embeddings Built By Transformers

Khloponin, Pavel (2022) Vector Space Proximity Based Document Retrieval For Document Embeddings Built By Transformers. Masters thesis, Concordia University.

[thumbnail of Khloponin_MCompSc_F2022.pdf]

Preview

Text (application/pdf)
Khloponin_MCompSc_F2022.pdf - Accepted Version
Available under License Spectrum Terms of Access.

5MB

Abstract

Internet publications are staying atop of local and international events, generating hundreds,
sometimes thousands of news articles per day, making it difficult for readers to navigate this stream
of information without assistance. Competition for the reader’s attention has never been greater.
One strategy to keep readers’ attention on a specific article and help them better understand its
content is news recommendation, which automatically provides readers with references to relevant
complementary articles. However, to be effective, news recommendation needs to select from a
large collection of candidate articles only a handful of articles that are relevant yet provide diverse
information.
In this thesis, we propose and experiment with three methods for news recommendation and
evaluate them in the context of the NIST News Track. Our first approach is based on the classic
BM25 information retrieval approach and assumes that relevant articles will share common key-
words with the current article. Our second approach is based on novel document embedding repre-
sentations and uses various proximity measures to retrieve the closest documents. For this approach,
we experimented with a substantial number of models, proximity measures, and hyperparameters,
yielding a total of 47,332 distinct models. Finally, our third approach combines the BM25 and the
embedding models to increase the diversity of the results.
The results on the 2020 TREC News Track show that the performance of the BM25 model
(nDCG@5 of 0.5924) greatly exceeds the TREC median performance (nDCG@5 of 0.5250) and
achieves the highest score at the shared task. The performance of the embedding model alone
(nDCG@5 of 0.4541) is lower than the TREC median and BM25. The performance of the combined
model (nDCG@5 of 0.5873) is rather close to that of the BM25 model; however, an analysis of the
results shows that the recommended articles are different from those proposed by BM25, hence may
constitute a promising approach to reach diversity without much loss in relevance.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (Masters)
Authors:	Khloponin, Pavel
Institution:	Concordia University
Degree Name:	M. Comp. Sc.
Program:	Computer Science
Date:	June 2022
Thesis Supervisor(s):	Kosseim, Leila
Keywords:	Background linking Document embedding Proximity measures
ID Code:	990826
Deposited By:	PAVEL KHLOPONIN
Deposited On:	27 Oct 2022 14:38
Last Modified:	27 Oct 2022 14:38

References:

Adomavicius, G., et al. (2005, January). Incorporating Contextual Information in Recommender
Systems Using a Multidimensional Approach. ACM Transactions on Information Systems
(TOIS), 23(1), 103-145.
Ak, A. E., ahan Kksal, Fayoumi, K., & Yeniterzi, R. (2020). SU-NLP at TREC NEWS 2020. In
TREC (Ed.), NIST Special Publication: Proceedings of the 29th Text REtrieval Conference
(TREC 2020), https://trec.nist.gov/pubs/trec29/trec2020.html. NIST.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning
to Align and Translate. In Y. Bengio & Y. LeCun (Eds.), 3rd International Conference on
Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference
Track Proceedings. Retrieved from http://arxiv.org/abs/1409.0473
Bialecki, A., Muir, R., & Ingersoll, G. (2012). Apache lucene 4. In A. Trotman, C. L. A. Clarke,
I. Ounis, J. S. Culpepper, M. Cartright, & S. Geva (Eds.), Proceedings of the SIGIR 2012
Workshop on Open Source Information Retrieval, OSIR@SIGIR 2012, Portland, Oregon,
USA, 16th August 2012 (pp. 17–24). University of Otago, Dunedin, New Zealand.
Bimantara, A., et al. (2018). htw saar @ TREC 2018 News Track. In TREC (Ed.), NIST Special
Publication: Proceedings of the 27th Text REtrieval Conference (TREC 2018), https://
trec.nist.gov/pubs/trec27/trec2018.html. NIST.
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python (1st ed.).
O’Reilly Media, Inc.
Boers, P., Kamphuis, C., & de Vries, A. (2020). Radboud University at TREC 2020. In TREC
(Ed.), NIST Special Publication: Proceedings of the 29th Text REtrieval Conference (TREC 2020), https://trec.nist.gov/pubs/trec29/trec2020.html. NIST.
Boudin, F. (2016, December). PKE: an open source python-based keyphrase extraction toolkit.
In Proceedings of the 26th International Conference on Computational Linguistics: Sys-
tem Demonstrations COLING 2016 (pp. 69–73). Osaka, Japan. Retrieved from http://
aclweb.org/anthology/C16-2015
Cabrera-Diego, L. A., Boros, E., & Doucet, A. (2021). Elastic Embedded Background Linking for
News Articles with Keywords, Entities and Events. In TREC (Ed.), NIST Special Publication:
Proceedings of the 30th Text REtrieval Conference (TREC 2021), https://trec.nist
.gov/pubs/trec30/trec2021.html. NIST.
Carpineto, C., & Romano, G. (2012, jan). A survey of automatic query expansion in information re-
trieval. ACM Computing Surveys, 44(1). Retrieved from https://doi.org/10.1145/
2071389.2071390 doi: 10.1145/2071389.2071390
Casson, L., Penn, J., & Davis, T. (2001). Libraries in the ancient world. Yale University Press.
Retrieved from https://books.google.ca/books?id=ECBkVPQkNSsC
Cha, S.-H. (2007, 01). Comprehensive survey on distance/similarity measures between probability
density functions. International Journal of Mathematical Models and Methods in Applied
Sciences, 1, 300–307.
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., & Salakhutdinov, R. (2019, July). Transformer-
XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics (pp. 2978–2988).
Florence, Italy: Association for Computational Linguistics. Retrieved from https://
aclanthology.org/P19-1285 doi: 10.18653/v1/P19-1285
Day, N., Worley, D., & Allison, T. (2020). OSC at TREC 2020 - News tracks Back-
ground Linking Task. In TREC (Ed.), NIST Special Publication: Proceedings of the 29th
Text REtrieval Conference (TREC 2020), https://trec.nist.gov/pubs/trec29/
trec2020.html. NIST.
Deshmukh, A. A., & Sethi, U. (2020). IR-BERT: Leveraging BERT for Semantic Search in Back-
ground Linking for News Articles. arXiv. Retrieved from https://arxiv.org/abs/
2007.12603
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidi-
rectional Transformers for Language Understanding. arXiv. Retrieved from https://
arxiv.org/abs/1810.04805
Ding, Y., Lian, X., Zhou, H., Liu, Z., Ding, H., & Hou, Z. (2019). ICTNET at TREC
2019 News Track. In TREC (Ed.), NIST Special Publication: Proceedings of the 28th
Text REtrieval Conference (TREC 2019), https://trec.nist.gov/pubs/trec28/
trec2019.html. NIST.
Engelmann, B., & Schaer, P. (2021). Relation-based re-ranking for background linking. In TREC
(Ed.), NIST Special Publication: Proceedings of the 30th Text REtrieval Conference (TREC
2021), https://trec.nist.gov/pubs/trec30/trec2021.html. NIST.
Essam, M., & Elsayed, T. (2019). bigIR at TREC 2019: Graph-based Analysis for News
Background Linking. In TREC (Ed.), NIST Special Publication: Proceedings of the 28th
Text REtrieval Conference (TREC 2019), https://trec.nist.gov/pubs/trec28/
trec2019.html. NIST.
Essam, M., & Elsayed, T. (2021). bigIR at TREC 2021: Adopting Transfer Learning for News
Background Linking. In TREC (Ed.), NIST Special Publication: Proceedings of the 30th
Text REtrieval Conference (TREC 2021), https://trec.nist.gov/pubs/trec30/
trec2021.html. NIST.
Fabbri, A., Li, I., She, T., Li, S., & Radev, D. (2019, July). Multi-News: A Large-Scale Multi-
Document Summarization Dataset and Abstractive Hierarchical Model. In Proceedings of
The Association for Computational Linguistics (ACL 2019) (pp. 1074–1084). Florence, Italy.
Foley, J., Montoly, A., & Pena, M. (2019). Smith at TREC2019: Learning to Rank Background
Articles with Poetry Categories and Keyphrase Extraction. In TREC (Ed.), NIST Special
Publication: Proceedings of the 28th Text REtrieval Conference (TREC 2019), https://
trec.nist.gov/pubs/trec28/trec2019.html. NIST.
Galassi, A., Lippi, M., & Torroni, P. (2021, Oct). Attention in natural language processing. IEEE
Transactions on Neural Networks and Learning Systems, 32(10), 42914308. Retrieved from
http://dx.doi.org/10.1109/TNNLS.2020.3019893 doi: 10.1109/tnnls.2020
.3019893
Gautam, R., Mitra, M., & Roy, D. (2020). TREC 2020 NEWS Track Background Linking Task. In
TREC (Ed.), NIST Special Publication: Proceedings of the 29th Text REtrieval Conference
(TREC 2020), https://trec.nist.gov/pubs/trec29/trec2020.html. NIST.
Grusky, M., Naaman, M., & Artzi, Y. (2018, June). Newsroom: A Dataset of 1.3 Million Sum-
maries with Diverse Extractive Strategies. In Proceedings of North American Chapter of the
Association for Computational Linguistics - Human Language Technologies (NAACL-HLT
2018) (pp. 708–719). New Orleans.
Iyyer, M., Manjunatha, V., Boyd-Graber, J., & Daum ́e III, H. (2015, July). Deep Unordered Com-
position Rivals Syntactic Methods for Text Classification. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th International Joint
Conference on Natural Language Processing (ACL/IJCNLP 2015) (Volume 1: Long Papers)
(pp. 1681–1691). Beijing, China: Association for Computational Linguistics. Retrieved from
https://aclanthology.org/P15-1162 doi: 10.3115/v1/P15-1162
J ̈arvelin, K., & Kek ̈al ̈ainen, J. (2002, October). Cumulated Gain-based Evaluation of IR Techniques.
ACM Transactions on Information Systems (TOIS), 20(4), 422–446.
Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval.
Journal of Documentation, 28, 11–21.
Kalyan, K. S., Rajasekharan, A., & Sangeetha, S. (2021). Ammus : A survey of transformer-
based pretrained models in natural language processing. arXiv. Retrieved from https://
arxiv.org/abs/2108.05542 doi: 10.48550/ARXIV.2108.05542
Khloponin, P., & Kosseim, L. (2019). The CLaC System at the TREC 2019 News Track. In TREC
(Ed.), NIST Special Publication: Proceedings of the 28th Text REtrieval Conference (TREC
2019), https://trec.nist.gov/pubs/trec28/trec2019.html. NIST.
Khloponin, P., & Kosseim, L. (2020). The CLaC System at the TREC 2020 News Track. In TREC
(Ed.), NIST Special Publication: Proceedings of the 29th Text REtrieval Conference (TREC
2020), https://trec.nist.gov/pubs/trec29/trec2020.html. NIST.
Khloponin, P., & Kosseim, L. (2021). Using Document Embeddings for Background Linking of
News Articles. In E. M ́etais, F. Meziane, H. Horacek, & E. Kapetanios (Eds.), Natural Lan-
guage Processing and Information Systems - 26th International Conference on Applications of Natural Language to Information Systems, NLDB 2021, Saarbr ̈ucken, Germany, June 23-
25, 2021, Proceedings (Vol. 12801, pp. 317–329). Springer. Retrieved from https://doi
.org/10.1007/978-3-030-80599-9 28 doi: 10.1007/978-3-030-80599-9\ 28
Kim, S. N., Medelyan, O., Kan, M.-Y., & Baldwin, T. (2010, July). SemEval-2010 task 5 : Au-
tomatic keyphrase extraction from scientific articles. In Proceedings of the 5th International
Workshop on Semantic Evaluation (SemEval 2010) (pp. 21–26). Uppsala, Sweden: Asso-
ciation for Computational Linguistics. Retrieved from https://aclanthology.org/
S10-1004
Koster, C., & Foley, J. (2021). Middlebury at TREC News 21 Exploring Learning to Rank
Model Variants. In TREC (Ed.), NIST Special Publication: Proceedings of the 30th
Text REtrieval Conference (TREC 2021), https://trec.nist.gov/pubs/trec30/
trec2021.html. NIST.
Lai, G., Xie, Q., Liu, H., Yang, Y., & Hovy, E. (2017, September). RACE: Large-scale ReAding
comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empir-
ical Methods in Natural Language Processing (pp. 785–794). Copenhagen, Denmark: Asso-
ciation for Computational Linguistics. Retrieved from https://aclanthology.org/
D17-1082 doi: 10.18653/v1/D17-1082
Lancaster, F., & Fayen, E. (1973). Information retrieval: On-line. Melville Publishing Company.
Retrieved from https://books.google.ca/books?id=FmdkAAAAMAAJ
Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In Proceedings of the 24th
ACM Special Interest Group on Information Retrieval (SIGIR 2001) Conference (p. 120-127).
New York, NY.
Le, Q., & Mikolov, T. (2014, 22–24 Jun). Distributed representations of sentences and documents.
In E. P. Xing & T. Jebara (Eds.), Proceedings of the 31st International Conference on Machine
Learning (ICML 2014) (Vol. 32, pp. 1188–1196). Bejing, China: PMLR. Retrieved from
http://proceedings.mlr.press/v32/le14.html
Lin, Z., Feng, M., dos Santos, C. N., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017, apr). A
Structured Self-Attentive Sentence Embedding. In Conference Track Proceedings, 5th Inter-
national Conference on Learning Representations (ICLR 2017) (pp. 24–26). OpenReview.net.
Retrieved from https://openreview.net/forum?id=BJC jUqxe
Lirong, Z., Joho, H., & Fujita, S. (2021). TKB48 at TREC 2021 News Track. In TREC (Ed.),
NIST Special Publication: Proceedings of the 30th Text REtrieval Conference (TREC 2021),
https://trec.nist.gov/pubs/trec30/trec2021.html. NIST.
Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv.
Retrieved from https://arxiv.org/abs/1907.11692
Lopez-Ubeda, P., Diaz-Galiano, M. C., Martin-Valdivia, M.-T., & Urena-Lopez, L. A. (2018). Using
clustering to filter results of an Information Retrieval system. In TREC (Ed.), NIST Special
Publication: Proceedings of the 27th Text REtrieval Conference (TREC 2018), https://
trec.nist.gov/pubs/trec27/trec2018.html. NIST.
Lu, K., & Fang, H. (2018). Paragraph as Lead - Finding Background Documents for News Articles.
In TREC (Ed.), NIST Special Publication: Proceedings of the 27th Text REtrieval Conference
(TREC 2018), https://trec.nist.gov/pubs/trec27/trec2018.html. NIST.
Lu, K., & Fang, H. (2019). Leveraging Entities in Background Document Retrieval for
News Articles. In TREC (Ed.), NIST Special Publication: Proceedings of the 28th
Text REtrieval Conference (TREC 2019), https://trec.nist.gov/pubs/trec28/
trec2019.html. NIST.
Lu, K., & Fang, H. (2020). Aspect Based Background Document Retrieval for News Articles. In
TREC (Ed.), NIST Special Publication: Proceedings of the 29th Text REtrieval Conference
(TREC 2020), https://trec.nist.gov/pubs/trec29/trec2020.html. NIST.
Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary infor-
mation. IBM Journal of Research and Development, 1(4), 309-317. doi: 10.1147/rd.14.0309
Lv, Y., & Zhai, C. (2011). When documents are very long, BM25 fails! In Proceedings of the 34th
international acm special interest group on information retrieval (sigir 2011) conference on
research and development in information retrieval (p. 11031104). New York, NY, USA:
Association for Computing Machinery. Retrieved from https://doi.org/10.1145/
2009916.2010070 doi: 10.1145/2009916.2010070
Ma, Y., et al. (2019, November). News2vec: News Network Embedding with Subnode Information.
In Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing
and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP
2019) (p. 4843-4852). Hong Kong: Association for Computational Linguistics. Retrieved
from https://aclanthology.org/D19-1490 doi: 10.18653/v1/D19-1490
MacAvaney, S., Yates, A., Cohan, A., & Goharian, N. (2019). CEDR: Contextualized Embeddings
for Document Ranking. Proceedings of the 42nd International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR 2019), 11011104. Retrieved from
https://doi.org/10.1145/3331184.3331317 doi: 10.1145/3331184.3331317
MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate ob-
servations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability (Vol. 1, pp. 281–297).
Manning, C. D., Raghavan, P., & Sch ̈utze, H. (2008). Introduction to information retrieval. Cam-
bridge, UK: Cambridge University Press. Retrieved from http://nlp.stanford.edu/
IR-book/information-retrieval-book.html
Maron, M. E., & Kuhns, J. L. (1960, jul). On relevance, probabilistic indexing and information
retrieval. Journal of the ACM, 7(3), 216244. Retrieved from https://doi.org/10
.1145/321033.321035 doi: 10.1145/321033.321035
Metzler, D., & Bruce Croft, W. (2007, jun). Linear feature-based models for information retrieval.
Inf. Retr., 10(3), 257274. Retrieved from https://doi.org/10.1007/s10791-006
-9019-z doi: 10.1007/s10791-006-9019-z
Mihalcea, R., & Tarau, P. (2004, July). TextRank: Bringing order into text. In Proceedings of
the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004)
(pp. 404–411). Barcelona, Spain: Association for Computational Linguistics. Retrieved from
https://aclanthology.org/W04-3252
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations
in vector space. In Y. Bengio & Y. LeCun (Eds.), 1st International Conference on Learn-
ing Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track
Proceedings. Retrieved from http://arxiv.org/abs/1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations
of words and phrases and their compositionality. In Proceedings of the 26th Internationa
Conference on Neural Information Processing Systems (NeurIPS 2013) - Volume 2 (p. 3111-
3119). Red Hook, NY, USA: Curran Associates Inc.
Miller, D. (2019). Leveraging BERT for extractive text summarization on lectures. CoRR,
abs/1906.04165. Retrieved from http://arxiv.org/abs/1906.04165
Missaoui, S., MacFarlane, A., Makri, S., & Gutierrez-Lopez, M. (2019). DMINR at TREC
News Track. In TREC (Ed.), NIST Special Publication: Proceedings of the 28th
Text REtrieval Conference (TREC 2019), https://trec.nist.gov/pubs/trec28/
trec2019.html. NIST.
Muresan, G., & Harper, D. J. (2004). Topic modeling for mediated access to very large doc-
ument collections. Journal of the American Society for Information Science and Technol-
ogy, 55(10), 892-910. Retrieved from https://onlinelibrary.wiley.com/doi/
abs/10.1002/asi.20034 doi: https://doi.org/10.1002/asi.20034
Murphy, T., & Secundus, G. (2004). Pliny the elder’s natural history: The empire in the en-
cyclopedia. Oxford University Press. Retrieved from https://books.google.ca/
books?id=6NC T tG9lQC
Naseri, S., Foley, J., & Allan, J. (2018). UMass at TREC 2018: CAR, Common Core
and News Tracks. In TREC (Ed.), NIST Special Publication: Proceedings of the 27th
Text REtrieval Conference (TREC 2018), https://trec.nist.gov/pubs/trec27/
trec2018.html. NIST.
Nogueira, R., & Lin, J. (2019). From Doc2Query to DocTTTTTQuery. Online preprint.
Retrieved from https://cs.uwaterloo.ca/ ̃jimmylin/publications/
Nogueira Lin 2019 docTTTTTquery-latest.pdf
Nogueira, R., Yang, W., Lin, J. J., & Cho, K. (2019). Document expansion by query prediction.
ArXiv, https://arxiv.org/abs/1904.08375.
Okura, S., et al. (2017, August). Embedding-Based News Recommendation for Millions of Users.
In Proceedings of the 23rd ACM Special Interest Group on Knowledge Discovery and Data
Mining (SIGKDD 2017) Conference (pp. 1933–1942). New York.
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999, November). The PageRank Citation
Ranking: Bringing Order to the Web. (Technical Report No. 1999-66). Stanford InfoLab.
Retrieved from http://ilpubs.stanford.edu:8090/422/ (Previous number =
SIDL-WP-1999-0120)
Qu, J., & Wang, Y. (2019). UNC SILS at TREC 2019 News Track. In TREC (Ed.), NIST Special
Publication: Proceedings of the 28th Text REtrieval Conference (TREC 2019), https://
trec.nist.gov/pubs/trec28/trec2019.html. NIST.
Radford, A., & Narasimhan, K. (2018). Improving language understanding by generative pre-
training. Preprint. Retrieved from https://cdn.openai.com/research-covers/
language-unsupervised/language understanding paper.pdf
Radford, A., et al. (2019). Language models are unsupervised multitask learn-
ers. Preprint. Retrieved from https://cdn.openai.com/better-language
-models/language models are unsupervised multitask learners.pdf
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2019). Exploring
the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv. Retrieved
from https://arxiv.org/abs/1910.10683 doi: 10.48550/ARXIV.1910.10683
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., et al. (2020). Exploring the
Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine
Learning Research, 21(140), 1-67. Retrieved from http://jmlr.org/papers/v21/
20-074.html
Rajpurkar, P., Jia, R., & Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions
for SQuAD. arXiv. Retrieved from https://arxiv.org/abs/1806.03822 doi:
10.48550/ARXIV.1806.03822
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for
Machine Comprehension of Text. arXiv. Retrieved from https://arxiv.org/abs/
1606.05250 doi: 10.48550/ARXIV.1606.05250
Reimers, N., & Gurevych, I. (2019, November). Sentence-BERT: Sentence Embeddings using
Siamese BERT-Networks. In Proceedings of 2019 Conference on Empirical Methods in Nat-
ural Language Processing and 9th International Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP 2019) (p. 39823992). Hong Kong.
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995, January).
Okapi at trec-3. In Overview of the third text retrieval conference (trec-3) (Overview of the
Third Text REtrieval Conference (TREC3) ed., p. 109-126). Gaithersburg, MD: NIST. Re-
trieved from https://www.microsoft.com/en-us/research/publication/
okapi-at-trec-3/
Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: Bm25 and beyond.
Foundations and Trends in Information Retrieval, 3(4), 333-389. Retrieved from http://
dx.doi.org/10.1561/1500000019 doi: 10.1561/1500000019
Robertson, S. E., & Jones, K. S. (1976). Relevance weighting of search terms. Journal of the Amer-
ican Society for Information Science, 27(3), 129-146. Retrieved from https://asistdl
.onlinelibrary.wiley.com/doi/abs/10.1002/asi.4630270302 doi:
https://doi.org/10.1002/asi.4630270302
Salton, G., & Buckley, C. (1988, jan). Term-weighting approaches in automatic text retrieval. Infor-
mation Processing & Management, 24(5), 513–523. Retrieved from https://doi.org/
10.1016%2F0306-4573%2888%2990021-0 doi: 10.1016/0306-4573(88)90021-0
Sethi, U., & Deshmukh, A. A. (2021). Semantic Search for Background Linking in News Articles.
In TREC (Ed.), NIST Special Publication: Proceedings of the 30th Text REtrieval Conference
(TREC 2021), https://trec.nist.gov/pubs/trec30/trec2021.html. NIST.
Soboroff, I. (2021, November). TREC 2021 News Track Overview. In TREC (Ed.), NIST Special
Publication: Proceedings of the 30th Text REtrieval Conference (TREC 2021), https://
trec.nist.gov/pubs/trec30/trec2021.html. NIST.
Soboroff, I., Huang, S., & Harman, D. (2018, November). TREC 2018 News Track Overview. In
TREC (Ed.), NIST Special Publication: Proceedings of the 27th Text REtrieval Conference
(TREC 2018), https://trec.nist.gov/pubs/trec27/trec2018.html. NIST.
Soboroff, I., Huang, S., & Harman, D. (2020a, March). TREC 2019 News Track Overview. In
TREC (Ed.), NIST Special Publication: Proceedings of the 28th Text REtrieval Conference
(TREC 2019), https://trec.nist.gov/pubs/trec28/trec2019.html. NIST.
Soboroff, I., Huang, S., & Harman, D. (2020b, November). TREC 2020 News Track Overview. In
TREC (Ed.), NIST Special Publication: Proceedings of the 29th Text REtrieval Conference
(TREC 2020), https://trec.nist.gov/pubs/trec29/trec2020.html. NIST.
TREC (Ed.). (2018). NIST Special Publication: Proceedings of the 27th Text REtrieval Conference
(TREC 2018), https://trec.nist.gov/pubs/trec27/trec2018.html. NIST.
TREC (Ed.). (2019). NIST Special Publication: Proceedings of the 28th Text REtrieval Conference
(TREC 2019), https://trec.nist.gov/pubs/trec28/trec2019.html. NIST.
TREC (Ed.). (2020). NIST Special Publication: Proceedings of the 29th Text REtrieval Conference
(TREC 2020), https://trec.nist.gov/pubs/trec29/trec2020.html. NIST.
TREC (Ed.). (2021). NIST Special Publication: Proceedings of the 30th Text REtrieval Conference
(TREC 2021), https://trec.nist.gov/pubs/trec30/trec2021.html. NIST.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., et al. (2017). Attention is all
you need (Vol. 30). Curran Associates, Inc. Retrieved from https://proceedings
.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa
-Paper.pdf
Wagenpfeil, S., Kevitt, P. M., & Hemmje, M. (2021). University of Hagen @ TREC2021
News Track. In TREC (Ed.), NIST Special Publication: Proceedings of the 30th
Text REtrieval Conference (TREC 2021), https://trec.nist.gov/pubs/trec30/
trec2021.html. NIST.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). GLUE: A multi-
task benchmark and analysis platform for natural language understanding. In International
conference on learning representations. Retrieved from https://openreview.net/
forum?id=rJ4km2R5t7
Wellisch, H. (1995). Indexing from A to Z. New York, Dublin: H.W. Wilson. Retrieved from
https://books.google.ca/books?id= KQ5AQAAMAAJ
Wiegand, W., & Davis, D. (2015). Encyclopedia of library history. Taylor & Francis. Retrieved
from https://books.google.ca/books?id=YZpsBgAAQBAJ
Yang, P., & Lin, J. (2018). Anserini at TREC 2018: CENTRE, Common Core, and News Tracks.
In TREC (Ed.), NIST Special Publication: Proceedings of the 27th Text REtrieval Conference
(TREC 2018), https://trec.nist.gov/pubs/trec27/trec2018.html. NIST.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). XL-
Net: Generalized Autoregressive Pretraining for Language Understanding. In H. Wal-
lach, H. Larochelle, A. Beygelzimer, F. d Alch ́e-Buc, E. Fox, & R. Garnett (Eds.), Ad-
vances in Neural Information Processing Systems (!NeurIPS!) (Vol. 32). Curran Associates,
Inc. Retrieved from https://proceedings.neurips.cc/paper/2019/file/
dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf
Zhang, J., Zhao, Y., Saleh, M., & Liu, P. (2020, 13–18 Jul). PEGASUS: Pre-training with ex-
tracted gap-sentences for abstractive summarization. In H. D. III & A. Singh (Eds.), Proceed-
ings of the 37th International Conference on Machine Learning (ICML 2020) (Vol. 119, pp.
11328–11339). PMLR. Retrieved from https://proceedings.mlr.press/v119/
zhang20ae.html
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015,
dec). Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching
Movies and Reading Books. In 2015 IEEE International Conference on Computer Vision
(ICCV 2015) (p. 19-27). Los Alamitos, CA, USA: IEEE Computer Society. Retrieved from
https://doi.ieeecomputersociety.org/10.1109/ICCV.2015.11 doi: 10
.1109/ICCV.2015.11

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Vector Space Proximity Based Document Retrieval For Document Embeddings Built By Transformers

Vector Space Proximity Based Document Retrieval For Document Embeddings Built By Transformers

Abstract

References: