Skip to content

Latest commit

 

History

History
170 lines (157 loc) · 28.3 KB

README.md

File metadata and controls

170 lines (157 loc) · 28.3 KB

Awesome PLM for Scientific Text (SciLM)

Resources for the paper: A Survey of Pre-trained Language Models for Processing Scientific Text

The Evolutionary tree of SciLMs

A tree file

Table of Contents

Related survey papers

Existing SciLMs

Biomedical Domain

No. Year Name Base-model Objective #Parameters Code
1 2019/01 BioBERT BERT MLM, NSP 110M GitHub
2 2019/02 BERT-MIMIC BERT MLM, NSP 110M, 340M N/A
3 2019/04 BioELMo ELMo Bi-LM 93.6M GitHub
4 2019/04 Clinical BERT (Emily) BERT MLM, NSP 110M GitHub
5 2019/04 ClinicalBERT (Kexin) BERT MLM, NSP 110M GitHub
6 2019/06 BlueBERT BERT MLM, NSP 110M, 340M GitHub
7 2019/06 G-BERT GNN + BERT Self-Prediction, Dual-Prediction 3M GitHub
8 2019/07 BEHRT BERT MLM, NSP N/A GitHub
9 2019/08 BioFLAIR FLAIR Bi-LM N/A GitHub
10 2019/09 EhrBERT BERT MLM, NSP 110M GitHub
11 2019/12 Clinical XLNet XLNet Generalized Autoregressive Pretraining 110M GitHub
12 2020/04 GreenBioBERT BERT CBOW Word2Vec, Word Vector Space Alignment 110M GitHub
13 2020/05 BERT-XML BERT MLM, NSP N/A N/A
14 2020/05 Bio-ELECTRA ELECTRA Replaced Token Prediction 14M GitHub
15 2020/05 Med-BERT BERT MLM, Prolonged LOS Prediction 110M GitHub
16 2020/05 ouBioBERT BERT MLM, NSP 110M GitHub
17 2020/07 PubMedBERT BERT MLM, NSP, Whole-Word Masking 110M HuggingFace
18 2020/08 MCBERT BERT MLM, NSP 110M, 340M GitHub
19 2020/09 BioALBERT ALBERT MLM, SOP 12M, 18M GitHub
20 2020/09 BRLTM BERT MLM N/A GitHub
21 2020/10 BioMegatron Megatron MLM, NSP 345M, 800M, 1.2B GitHub
22 2020/10 CharacterBERT BERT + Character-CNN MLM, NSP 105M GitHub
23 2020/10 ClinicalTransformer BERT - ALBERT - RoBERTa - ELECTRA MLM, NSP - MLM, SOP - MLM - Replaced Token Prediction 110M - 12M - 125M - 110M GitHub
24 2020/10 SapBERT BERT Multi-Similarity Loss 110M GitHub
25 2020/10 UmlsBERT BERT MLM 110M GitHub
26 2020/11 bert-for-radiology BERT MLM, NSP 110M GitHub
27 2020/11 Bio-LM RoBERTa MLM 125M, 355M GitHub
28 2020/11 CODER PubMedBERT - mBERT Contrastive Learning 110M - 110M GitHub
29 2020/11 exBERT BERT MLM, NSP N/A GitHub
30 2020/12 BioMedBERT BERT MLM, NSP 340M GitHub
31 2020/12 LBERT BERT MLM, NSP 110M GitHub
32 2021/04 CovidBERT BioBERT MLM, NSP 110M N/A
33 2021/04 ELECTRAMed ELECTRA Replaced Token Prediction N/A GitHub
34 2021/04 KeBioLM PubMedBERT MLM, Entity Detection, Entity Linking 110M GitHub
35 2021/04 SINA-BERT BERT MLM 110M N/A
36 2021/05 ProteinBERT BERT Corrupted Token, Annotation Prediction 16M GitHub
37 2021/05 SciFive T5 Span Corruption Prediction 220M, 770M GitHub
38 2021/06 BioELECTRA ELECTRA Replaced Token Prediction 110M GitHub
39 2021/06 EntityBERT BERT Entity-centric MLM 110M N/A
40 2021/07 MedGPT GPT-2 + GLU + RotaryEmbed LM N/A N/A
41 2021/08 SMedBERT SMedBERT Masked Neighbor Modeling, Masked Mention Modeling, SOP, MLM N/A GitHub
42 2021/09 Bio-cli RoBERTa MLM, Subword Masking or Whole Word Masking 125M GitHub
43 2021/11 UTH-BERT BERT MLM, NSP 110M GitHub
44 2021/12 ChestXRayBERT BERT MLM, NSP 110M N/A
45 2021/12 MedRoBERTa.nl RoBERTa MLM 123M GitHub
46 2021/12 PubMedELECTRA ELECTRA Replaced Token Prediction 110M, 335M HuggingFace
47 2022/01 Clinical-BigBird BigBird MLM 166M GitHub
48 2022/01 Clinical-Longformer Longformer MLM 149M GitHub
49 2022/03 BioLinkBERT BERT MLM, Document Relation Prediction 110M, 340M GitHub
50 2022/04 BioBART BART Text Infilling, Sentence Permutation 140M, 400M GitHub
51 2022/05 bsc-bio-ehr-es RoBERTa MLM 125M GitHub
52 2022/05 PathologyBERT BERT MLM, NSP 110M HuggingFace
53 2022/06 RadBERT RoBERTa MLM 110M GitHub
54 2022/06 ViHealthBERT BERT MLM, NSP, Capitalized Prediction 110M GitHub
55 2022/07 Clinical Flair Flair Character-level Bi-LM N/A GitHub
56 2022/08 KM-BERT BERT MLM, NSP 99M GitHub
57 2022/09 BioGPT GPT Autoregressive Language Model 347M, 1.5B GitHub
58 2022/10 Bioberturk BERT MLM, NSP N/A GitHub
59 2022/10 DRAGON GreaseLM MLM, KG Link Prediction 360M GitHub
60 2022/10 UCSF-BERT BERT MLM, NSP 135M N/A
61 2022/10 ViPubmedT5 ViT5 Spans-masking learning 220M GitHub
62 2022/12 ALIBERT BERT MLM 110M N/A
63 2022/12 BioMedLM GPT2 Autoregressive Language Model 2.7B GitHub
64 2022/12 BioReader T5 & RETRO MLM 229.5M GitHub
65 2022/12 clinicalT5 T5 Span-mask Denoising Objective 220M, 770M N/A
66 2022/12 Gatortron BERT MLM 8.9B GitHub
67 2022/12 Med-PaLM Flan-PaLM Instruction Prompt Tuning 540B Official Site
68 2023/01 clinical-T5 T5 Fill-in-the-blank-style denoising objective 220M, 770M PhysioNet
69 2023/01 CPT-BigBird BigBird MLM 166M N/A
70 2023/01 CPT-Longformer Longformer MLM 149M N/A
71 2023/02 Bioformer Bioformer MLM, NSP 43M GitHub
72 2023/02 Lightweight DistilBERT MLM, Knowledge Distillation 65M, 25M, 18M, 15M GitHub
73 2023/03 RAMM PubmedBERT MLM, Contrastive Learning, Image-Text Matching N/A GitHub
74 2023/04 DrBERT RoBERTa MLM 110M GitHub
75 2023/04 MOTOR BLIP MLM, Contrastive Learning, Image-Text Matching N/A GitHub
76 2023/05 BiomedGPT BART backbone + BERT-encoder + GPT-decoder MLM 33M, 93M, 182M GitHub
77 2023/05 TurkRadBERT BERT MLM, NSP 110M N/A
78 2023/06 CamemBERT-bio BERT Whole Word MLM 111M HuggingFace
79 2023/06 ClinicalGPT T5 Supervised Fine Tuning, Rank-based Training N/A N/A
80 2023/06 EriBERTa RoBERTa MLM 125M N/A
81 2023/06 PharmBERT BERT MLM 110M GitHub
82 2023/07 BioNART BERT Non-AutoRegressive Model 110M GitHub
83 2023/07 BIOptimus BERT MLM 110M GitHub
84 2023/07 KEBLM BERT MLM, Contrastive Learning, Ranking Objective N/A N/A
85 2023/09 CPLLM Llama2 Autoregressive Language Model, Supervised Fine Tuning 13B GitHub
86 2023/11 MedCPT BERT Contrastive Learning, Ranking Objective 110M GitHub

Chemical Domain

No. Year Name Base-model Objective #Parameters Code
1 2020/03 NukeBERT BERT MLM, NSP 110M GitHub
2 2020/10 ChemBERTa RoBERTa MLM 125M GitHub
3 2021/05 NukeLM SciBERT, RoBERTa MLM 125M, 355M, 110M GitHub
4 2021/06 ChemBERT RoBERTa MLM 110M GitHub
5 2021/09 MatSciBERT BERT MLM 110M GitHub
6 2021/10 MatBERT BERT MLM 110M GitHub
7 2022/05 BatteryBERT BERT, SciBERT MLM 110M GitHub
8 2022/05 ChemGPT GPT Autoregressive Language Model 1B GitHub
9 2022/08 MaterialsBERT (Shetty) PubMedBERT MLM, NSP, Whole-Word Masking 110M GitHub
10 2022/08 ProcessBERT BERT MLM, NSP 110M N/A
11 2022/09 ChemBERTa-2 RoBERTa MLM, Multi-task Regression 125M GitHub
12 2022/09 MaterialBERT (Yoshitake) BERT MLM, NSP 110M MDR
13 2023/01 MolGen BART Seq2Seq MLM, Autoregressive Language Modeling 460M, 7B Github
14 2023/08 GIT-Mol GIT-Former Xmodal-Text Matching, Xmodal-Text Contrastive Learning 700M N/A
15 2023/10 MolCA Galactica Molecule Captioning, Molecule-Text Contrastive Learning 1.4B GitHub

Multi-domain

No. Year Name Base-model Objective #Parameters Code
1 2019/03 SciBERT (CS + Bio) BERT MLM, NSP 110M GitHub
2 2019/11 S2ORC-SciBERT BERT MLM, NSP 110M GitHub
3 2020/04 SPECTER BERT Triple-loss 110M GitHub
4 2021/03 OAG-BERT BERT MLM 110M GitHub
5 2022/05 ScholarBERT BERT MLM 770M HuggingFace
6 2022/06 SciDEBERTa DeBERTa MLM N/A GitHub
7 2022/09 CSL-T5 T5 Fill-in-the-blank-style denoising objective 220M GitHub
8 2022/10 AcademicRoBERTa RoBERTa MLM 125M GitHub
9 2022/11 Galactica GPT Autoregressive Language Model 125M, 1.3B, 6.7B, 30B, 120B GitHub
10 2022/11 VarMAE RoBERTa MLM 110M N/A
11 2022/12 SciBART BART MLM 124M, 386M Github
12 2023/05 Patton GNN + BERT Network-contextualized MLM, Masked Node Prediction N/A GitHub

Other Scientific Domains

Sorted by Domain-name

No. Year Name Base-model Objective #Parameters Code Domain
1 2022/04 SecureBERT RoBERTa MLM 125M GitHub Cybersecurity
2 2022/12 CySecBERT BERT MLM, NSP 110M N/A Cybersecurity
3 2021/05 MathBERT (Peng) BERT MLM, Masked Substructure Prediction, Context Correspondence Prediction 110M N/A Math
4 2021/06 MathBERT (Shen) RoBERTa MLM 110M GitHub Math
5 2021/10 ClimateBert DistilROBERTA MLM 66M GitHub Climate
6 2020/02 SciGPT2 GPT2 LM 124M GitHub CS
7 2023/06 K2 LLaMA Cosine Loss 7B GitHub Geoscience
8 2023/03 ManuBERT BERT MLM 110M, 126M HuggingFace Manufaturing
9 2023/01 ProtST BERT Masked Protein Modeling, Contrastive Learning, Multi-modal Masked Prediction N/A GitHub Protein
10 2023/01 SciEdBERT BERT MLM 110M N/A Science Education
11 2022/06 SsciBERT BERT MLM, NSP 110M GitHub Social Science