-
Notifications
You must be signed in to change notification settings - Fork 0
/
main.tex
2395 lines (1725 loc) · 166 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Introduction}
\label{chapter:introduction}
The main goal of this work is to design and implement a document-level text sentiment classification system with a state-of-the-art comparable performance on a practical, real-world domain of movie reviews written in Polish. For this purpose, novel strategies for regularising, pre-training and fine-tuning of recurrent neural networks known as \emph{ULMFiT} are used, combined with inductive transfer learning and subword-based tokenization methods.
\section{Contents}
The following thesis is divided into eight chapters providing a theoretical background on sentiment classification and related machine learning problems, as well as details on the implementation of a proposed sentiment classification system.
\textbf{Chapter \ref{chapter:introduction}} introduces the main goal of this work, methods used to realise that goal and a key measurement of a proposed system's performance.
\textbf{Chapter \ref{chapter:sentiment}} defines a problem of document-level sentiment classification, discussing previous work done in the field for Polish language.
\textbf{Chapter \ref{chapter:mlbackground}} offers a theoretical background on core machine learning and deep learning concepts related to the implemented system.
\textbf{Chapter \ref{chapter:rnnnlp}} builds on the provided theoretical background by exploring key concepts behind recurrent neural networks-based natural language processing, including most notable techniques used in this field.
\textbf{Chapter \ref{chapter:projectandimplementation}} presents the reasoning behind the design of the sentiment classification system as well as details on its implementation.
\textbf{Chapter \ref{chapter:usermanual}} includes an user manual for the proposed sentiment classification system, as well as the system's platform requirements and dependencies.
\textbf{Chapter \ref{chapter:experiments}} discusses the results of experiments performed on a implemented system, validating the impact of hyperparameters on the overall accuracy of the underlying machine learning model.
\textbf{Chapter \ref{chapter:finalremarks}} summarises the accomplishments of this work, recapitulating the key insights and performance indicators of an implemented sentiment classification system.
\cleardoublepage
\chapter{Document-level sentiment classification}
\label{chapter:sentiment}
The world-wide-web revolution of early 2000s has brought new ways for people to share their opinions on a variety of subjects at a scale that has never been witnessed before in the world's history.
Being able to \emph{mine} immense amounts of opinionated reviews, articles, blog entries or social media posts from an extensive set of sources, it is crucial to automatically process and determine their underlying sentiment in order to achieve a success in one's field. Such a problem is known as \emph{document-level sentiment classification}.
This chapter presents a background on fundamental concepts behind \emph{sentiment analysis} and \emph{document classification}, providing a reader with a perspective on the document-level sentiment classification problem for documents written in Polish language.
\section{Sentiment analysis}
Sentiment analysis is the computational study of opinions and underlying sentiments expressed in text \cite{handbook_nlp}. The goal of sentiment analysis is to derive information from text about the entity discussed in the opinion, its assessed features (or \emph{aspects}), and the sentiment towards individual aspects of an opinion holder given in an opinion. Formally stated:
\begin{definition}
\label{def:opinion}
An opinion is a quintuple
\begin{equation}
(e, a, s, h, t),
\end{equation}
where $e$ is the target entity of an opinion, $a$ is the target aspect of the entity $e$ on which the opinion has been given, $s$ is the sentiment of the opinion on the aspect $a$ of entity $e$, $h$ is the opinion holder and $t$ is the opinion posting time \cite{liu}.
\end{definition}
A sentiment is the underlying feeling, attitude, evaluation or emotion associated with an opinion. Consumer research classifies the sentiment into two distinctive types:
\begin{enumerate}
\item \emph{Rational sentiments}, which come from rational reasoning, tangible beliefs, and utilitarian attitudes,
\item \emph{Emotional sentiments}, based on non-tangible and emotional responses to entities that are deeply connected with a person's psychological state of mind \cite{chaudhuri:2006}.
\end{enumerate}
Rational sentiments do not express any emotions and an opinion containing a rational sentiment is also referred to as an \emph{rational opinion}. For example, a sentence \textit{"This car is worth the price"} is a rational opinion. On the other hand, opinions expressing emotional sentiment, such as \textit{"I love my iPhone"} or \textit{"I am bewildered about their service personnel"} are called \emph{emotional emotions} \cite{liu}.
Sentiments are also classified based on their positive, neutral or negative \emph{orientation} and their \emph{intensity}. A neutral orientation means that a given statement does not contain any sentiment towards the discussed entity (e.g. \textit{"This is my car"}) \cite{liu}.
Each of the sentiment types and orientations can have different strengths or intensities based on chosen words or phrases (called \emph{sentiment expressions}) and \emph{intensifiers} or \emph{diminishers} used in an opinion. For instance, using word \textit{{good}} implies lesser intensity of a statement with a positive sentiment than a word \textit{{excellent}}, and using a diminisher, such as \textit{{really}} before a word \textit{{terrible}} increases the negativity of an expression \cite{liu}. Sentiment expressions, intensifiers and diminishers are applied mainly to opinions with positive or negative sentiment orientation.
Unlike factual information, a majority of opinions is subjective. This is caused by each and every single opinion holder coming from a different background, having lived through different experiences, being concerned with different interests, ideologies and beliefs \cite{liu}. For instance, two different customers might have bought the same smartphone model, but the first customer has received a defective unit, and the other one - a properly working phone. The first customer might have a negative sentiment towards the phone (or a phone brand) if the fault is discovered, or a positive one, depending on a nature of a fault, the customers' technical knowledge and the overall satisfaction with a unit. This ambiguity poses one of the biggest challenges for the field of sentiment analysis.
\section{Natural language processing}
Natural language processing (or \emph{NLP}) is a subfield of \emph{computer science} and \emph{linguistics} that uses computer-based methods to analyse language in natural text and speech \cite{nlp_def}. The goal of NLP is to develop techniques allowing for near-human-level accuracy in tasks such as sentiment analysis, \emph{natural language understanding}, \emph{speech recognition} or \emph{natural language generation}.
Natural Language Processing has relied on the developments in the field of \emph{artificial intelligence} (\emph{AI}), evolving from \emph{symbolic} approaches (based on manually defined sets of rules) to more modern, statistics-based methods for problem-solving \cite{handbook_nlp}.
Many recent breakthroughs in NLP brought by \emph{deep learning} have made it possible for automatic language translators and voice assistants, for instance, to become ubiquitous and indispensable parts of our everyday life.
\subsection{Document definition}
In NLP, \emph{document} is an entity containing a distinct text. For instance, a single document might contain the entire contents of a single book, as well as an individual article.
In this work, a term document is used interchangeably with \emph{text document} to emphasise the nature of document's contents.
\subsection{Text classification}
\emph{Text classification} is a fundamental natural language processing task of assigning a label $l$ for an input text document $t$ from a defined set of labels $C$ known as \emph{classes} based on predefined criteria. It has broad applications in the fields of sentiment analysis, topic labelling and spam detection, among others \cite{wiki:doc_classification}.
For example, having a set of labels $C = \{\mathrm{e-mail}, \mathrm{spam}\}$ and input e-mail subjects $t_1 =$ "WIN LOTTERY, REWARD!" and $t_2 = $ "Report from Research, May 2019" the goal of a text classification of e-mail subject texts is to produce pairs $(t_1, l_1)$ and $(t_2, l_2)$, where $l_1 = $ "spam" and $l_2 = $ "e-mail".
\section{Document-level sentiment classification}
\label{document-level-sentiment-classification}
Document-level sentiment classification, also referred to as or \emph{document sentiment analysis} or \emph{document sentiment classification}, is a task of classifying an \emph{opinionated document} as expressing a positive, neutral or negative opinion (or sentiment) \cite{handbook_nlp}.
It is assumed in document sentiment analysis that the opinionated text document expresses opinions on a single entity $e$ and contains opinions from a single opinion holder $h$ \cite{liu}. Thus, in order for a document to be properly classified using document-level sentiment classification, it should not contain opinions expressed on different entities, or by multiple authors on a single entity.
Considering an example opinionated document containing a review of a book from Amazon.com:
\begin{remark}{}
\label{ex:1}
"Thrilling read. If you're having trouble sleeping this is perfect for reading just before bed. If you're wide awake and want to learn how to write and publish a scientific paper, it works for that too. I'm a grad student and this book was recommended to me by a prof from my undergrad university and it was worth every penny. It helped me get better at analysing the quality of writing by others \cite{amazon:review}."
\end{remark}
an accurate document sentiment classification determines the example document to be positive in terms of the opinion holder's $h$ sentiment towards a book entity $e$.
Document-level sentiment classification task treats sentiment classification as a traditional text classification problem with sentiment orientations as classes \cite{liu}. As such, applications of a supervised learning-based solutions are used as a basis for modern document sentiment analysis systems.
\section{From symbolic to neural networks-based approaches to NLP and document sentiment analysis}
Classical approaches to NLP have their roots in \emph{symbolic artificial intelligence}, based on the knowledge of the field of linguistics and involving meticulous analysis of texts using many predefined sets of rules \cite{handbook_nlp}.
Modern natural language processing is, contrary to symbolic approaches, heavily based on the latest developments in the area of \emph{deep learning}. Combining classical methods, such as word \emph{tokenization}, with the \emph{deep neural networks}' ability of learning accurate representations of data has allowed for state of the art results in a variety of NLP tasks to be achieved, including text classification, sentiment analysis and machine translation \cite{ulmfit,languagemodels:vstranslation,goodfellow,pan:sentimentanalysis,zhou:sentimentcnn}.
After an introduction of core machine learning and deep learning concepts in chapter \ref{chapter:mlbackground}, \emph{recurrent neural networks-based} approaches to natural language processing and, specifically, the problem of document sentiment classification are discussed in chapter \ref{chapter:rnnnlp} of this work.
\section{Polish document sentiment classification}
\label{sentiment:polish}
Polish language belongs to a family of highly inflected languages, in which a form or ending of a word depends on it's use in a sentence, reflecting a grammatical tense, case, voice, aspect, person, number, gender, mood, and others \cite{wiki:inflection}. Combining a complex grammar with an extensive set of exceptions, Polish language has been perceived as difficult for automated natural language processing \cite{polish:difficult,polishdiff,polishdiff_grammar}.
This section introduces a neural network-based approach to the problem of the Polish document-level sentiment classification, as well as provides an outlook on the previous work done on this subject for Polish opinions.
\subsection{This work}
As the aim of this work is to produce a document-level text sentiment classification system with a state-of-the-art comparable performance, the resulting system is based upon \emph{deep recurrent neural networks}.
The ULMFiT method has been chosen as a foundation of the system, used to produce a final \emph{target classifier} model. Based on promising results from previous works \cite{czaplakardas:ulmfit,poleval2019}, a \emph{sentencepiece}-based tokenizaton method is used on two separate, real-world corpora - \emph{plwiki} and \emph{Filmweb+}, specifically prepared for use in this work.
Compared to previous work (see \ref{previouswork}), our document sentiment classification system is trained on relatively small labelled dataset, Filmweb+, consisting of just 3085 positive and 3085 negative example documents, with an average length of 514 words per document.
Also, the results of this work can be reproduced using commercially available hardware, with the process of pre-training and fine-tuning language models and training target classifier taking less than 10 hours of computing time on \lstinline{Nvidia GeForce 2080Ti} graphics card.
The motivation between the design and implementation choices made is provided in chapter \ref{chapter:projectandimplementation} of this work, and a theoretical background - in chapter \ref{chapter:mlbackground} and \ref{chapter:rnnnlp}.
\subsection{Previous work}
\label{previouswork}
Previous work on the Polish text document sentiment classification has been based on classical NLP approaches \cite{buczynski:shallowclassicalnlp,wojcik:comparisonclassicnlp}, \emph{Bayesian} \cite{wiki:bayesian, chlasta:sentimenttwitter,polish:sentimentbayesian} and \emph{support-vector machines}-based \cite{wiki:svms,polish:sentimentsvms} classification methods, as well as deep neural networks \cite{czaplakardas:ulmfit,korzeniowski:poleval,wawer:polishsentimentshorttexts,poleval2019}, with the latter producing state-of-the-art results for a Polish language at the time of writing this thesis.
However, a vast majority of the work on document-level sentiment classification has been performed on \emph{corpora} consisting of relatively short texts, such as posts from Twitter \cite{poleval2019,korzeniowski:poleval,chlasta:sentimenttwitter}, short user rating comments from Filmweb \cite{wawer:polishsentimentshorttexts} and Opineo store and product reviews \cite{polish:sentimentsvms}. In fact, no recent work on Polish document sentiment analysis that involved use of a dataset consisting of long (over 1000 characters on average), real-world documents.
Some of the neural networks-based approaches to document-level sentiment classification have leveraged the \emph{pre-training} and \emph{fine-tuning} of \emph{language models} as well as \emph{inductive transfer learning} \cite{czaplakardas:ulmfit,poleval2019,korzeniowski:poleval,wawer:polishsentimentshorttexts}, using methods known \emph{Universal Language Model Fine-Tuning} (or \emph{ULMFiT}) \cite{ulmfit} and \emph{Bidirectional Encoder Representations from Transformers} (or \emph{BERT}) \cite{bert}. Both of these methods have produced satisfactory results.
\chapter{Background on machine learning}
\label{chapter:mlbackground}
The following chapter introduces the core problems of machine learning and deep learning involved with a recurrent neural networks-based natural language processing systems used to perform document-level sentiment classification.
It defines core concepts, such as \emph{transfer learning}, \emph{regularisation}, \emph{backpropagation} and \emph{recurrent neural networks} referred to in the latter parts of this work.
\section{Machine learning}
\emph{Machine learning} (or \emph{ML}) is a subfield of \emph{artificial intelligence} that gives computers the ability to learn without being explicitly programmed \cite{samuel_def_ml}.
\begin{figure}[]
\centering
\includegraphics[scale=0.35]{figures/ai_venn.png}
\caption{A diagram illustrating the relationships between the fields of artificial intelligence, machine learning, representation learning and deep learning \cite{goodfellow}.}
\label{ai:diagram}
\end{figure}
A formal definition of \textit{learning} is provided by Tom Mitchell \cite{mitchell_def_learn}:
\theoremstyle{definition}
\begin{definition}{}
\label{ml:def}
A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.
\end{definition}
To put a definition \ref{ml:def} in layman's terms, a computer program learns when it is able to interpret new data (experiences $E$) in such a way, that it is able to solve a given problem (task $T$) with a better outcome (as measured by $P$).
Machine learning-based classification systems, for instance, learn how to accurately label a provided input by being presented with a set of example inputs for each category called the \emph{training set}. For such systems:
\begin{enumerate}
\item The task $T$ is to classify an input based on its distinctive features,
\item The experience $E$, from which a computer program learns, is a \emph{sample} from a training set,
\item The performance measure $P$ can be defined in numerous ways deemed appropriate; commonly, ratio of correctly classified inputs - a metric called \textit{accuracy} - is used.
\end{enumerate}
Sentiment classification system is an example of such a system, for which the task $T$ is to classify an opinionated document based on its contents and an experience $E$ is an individual, already labelled document. A $P$ can be measured by an accuracy, \emph{precision} or \emph{F1 score} metrics, among others.
% TODO: ADD SOMETHING MORE ON METRICS
A core part of the machine learning system is called a machine learning \emph{model}, which is an entity that has obtained the knowledge from all the experiences $E_i \in \{E_1, \dots, E_n\}$ belonging to a training set. The process of presenting a machine learning system with new examples from a training set is called \emph{training}.
Machine learning is a broad field in its own, with many subfields, most notable of which include \emph{representation learning} and \emph{deep learning}. A relation between artificial intelligence, machine learning and its subfields is shown in figure \ref{ai:diagram}.
\subsection{Types of machine learning systems}
Machine learning systems are classified based on the human input needed during the training process into:
\begin{enumerate}
\item \emph{Supervised learning} systems, which learn how to interpret new data given some input data with associated, desired outputs (or \emph{labels}),
\item \emph{Semi-supervised learning} systems, which learn from input data for which only some labels are available,
\item \emph{Unsupervised learning} systems, which learn the correlations between given \textit{unlabelled} input data.
\end{enumerate}
% add reinforcement learning
% up up up
\subsection{Inductive and transductive learning}
\emph{Transduction}, in the context of statistical learning, refers to predicting specific results of drawing from a given set of examples. A classical example of a transductive algorithm is the \emph{K-Nearest Neighbours} algorithm, which uses the training data directly to make a prediction.
% Read more on transduction in NLP, whether it's used in ulmfit or not
Contrary to transduction, \emph{induction} in machine learning is a process of deducing general rules from observed training cases.
\subsection{Representation learning}
The performance of machine learning systems depends heavily on a \emph{representation of data} used \cite{bengio2012representation}.
An example ML system based on a \emph{logistic regression} algorithm \cite{logreg_caesarian} used to recommend a cesarean delivery produces its output based on an input vector of features, representing neonatal birth weight or gestational age at birth, among others. Such a system requires its operator to manually \emph{pre-process} input data not only for the purposes of training, but also - in this case - in order to produce a valid classification.
\begin{figure}[]
\centering
\includegraphics[scale=0.3]{figures/representation.png}
\caption{Using a different representation of data may result in a much simpler solution for a target machine learning task. In this example, a change of coordinate system from Cartesian to polar allows to perform a classification of objects belonging to class of dots and triangles by separating them with just a single line \cite{goodfellow}.}
\label{representation:cartesianvpolar}
\end{figure}
In general, much of the actual effort involved in deploying a machine learning algorithm has involved the design process of task-oriented pre-processing pipelines and data transformations, resulting in a representation of data that allows for accurate outputs, as illustrated in figure \ref{representation:cartesianvpolar} \cite{bengio2012representation}.
However, many problems cannot be solved through a manual engineering of data representations. Automated face detection, for example, was an unsolvable problem until as recently as the year 2001 \cite{face2001, introductiontoml}, when an integral image representation was first proposed.
This difficulty has lead to an emergence of \emph{representation learning}, a set of machine learning techniques which allow a system to automatically discover the representations needed for a given task and replace a tedious, aforementioned process called \emph{feature engineering} \cite{wiki:representation_learning}.
Current proliferation of object detection, speech recognition or NLP systems \cite{bengio2012representation}, among others, is attributed to the emergence of effective representation learning methods, such as \emph{deep learning}.
\subsection{Transfer learning}
\emph{Transfer learning} refers to a machine learning algorithm's ability of exploiting commonalities between given learning tasks \cite{bengio2012representation} through an use of knowledge obtained in one learning task in order to improve a system's performance in another task.
%A more mathematical transfer learning definition: http://ict.usc.edu/pubs/Inductive%20Transfer%20Learning%20for%20Handling%20Individual%20Differences%20in%20Affective%20Computing.pdf
\begin{figure}[]
\centering
\includegraphics[scale=0.15]{figures/transfer.png}
\caption{Transfer learning allows for the use of knowledge obtained by model $A$ when solving a source task (from a source domain) in order to improve model $B$'s accuracy on a target task (from a target domain) \cite{state_of_nlp_2019}.}
\end{figure}
\subsubsection{Inductive transfer learning}
\label{transferlearning:inductive}
\emph{Inductive transfer learning} is a method of improving a generalisation's accuracy on a target task's domain with limited labelled data available through the use of labelled data of related source domain \cite{inductivetl_def}.
The assumption behind inductive transfer learning is that it might be easy to obtain large quantity of quality data from a broader domain, which will help the machine learning system "understand" complex relations between certain features of data and then use them to produce a model capable of solving a subproblem which smaller amount of data is available for. This process is also referred to as \emph{pre-training}.
Inductive transfer learning has helped to improve accuracy on many computer vision tasks, such as image segmentation, classification, or object detection, where it is common to fine-tune models pre-trained on datasets such as ImageNet or MS-COCO \cite{ulmfit}.
In NLP, inductive transfer learning had been thought to be applicable only for semantically-similar tasks \cite{mou2016transferable} until 2018, when \emph{Universal Language Model Fine-tuning} (\emph{ULMFiT}) method \cite{ulmfit} using \emph{sequential transfer learning} was proposed, perceived today as an idea that has lead to the most improvements in the field so far \cite{state_of_nlp_2019}.
% VERY VERY IMPORTANT ARTICLE: https://ruder.io/state-of-transfer-learning-in-nlp/
\subsubsection{Sequential transfer learning}
Sequential transfer learning is a method that has allowed to significantly increase accuracy in when inductive transfer learning is applied in the field of Natural Language Processing.
\begin{figure}[]
\centering
\includegraphics[scale=0.3]{figures/sequential.png}
\caption{In sequential transfer learning a machine learning model is first trained on a large set of unlabelled training data and then used to improve an accuracy on the target task, such as the document classification \cite{state_of_nlp_2019}.}
\end{figure}
It is applied by pre-training data representations on a large, unlabelled text \emph{corpus} (set of documents) and then adapting these representations for a target, supervised task, such as classification, Q\&A or sequence labelling \cite{state_of_nlp_2019}.
\subsection{Evaluating machine learning models}
To evaluate the accuracy of a machine learning model, it is required to define:
\begin{enumerate}
\item A performance metric $P$, which quantifies the accuracy between predicted and true values,
\item A set of observations outside the training data, whose predictions are tested by the metric \cite{howard:mechanicsofmachinelearning}.
\end{enumerate}
In order for a model to be applicable in a real-word scenario, it is required to \emph{generalise} well, meaning that it is able to perform well (according to $P$) on input data not available during the process of training. Evaluation of model's accuracy and generalisation ability can be performed through \emph{hold out} method.
\subsubsection{Hold out method}
In a hold out method, a training dataset is randomly divided into three sets:
\begin{enumerate}
\item A training set, used to train a given model, which contains 70\% of available data,
\item A \emph{validation} set, used to measure the impact of \emph{hyperparameters} change on the performance of the model, which contains 10-15\% of remaining data,
\item A \emph{test} set, used to estimate a generalisation of a model, which also contains 10-15\% of remaining data \cite{howard:mechanicsofmachinelearning}.
\end{enumerate}
An usage of a hold out method enables us to verify the performance of a model on two separate sets, the validation set and a test set, which reduces the risk of \emph{overfitting} a model to a validation set \cite{handson_geron}.
\subsubsection{Bias-variance trade-off}
Assuming that a machine learning model can be trained on different training sets and evaluated more than once, we define \emph{bias} as the difference between the expected (or average) predictions of a resulting realisations of a model and the target output. Models with high bias value tend to be overtly simplistic, leading to low accuracy on training, validation and test sets.
Similarly, a \emph{variance} of a model is a measurement of how the predictions for a fixed input vary between different realisations of a machine learning model. High variance results in a complex model, which is able to correctly describe relations between data from a training set, but unable to generalise well to new inputs \cite{fortmann-roe:biasvariance}.
A \emph{bias-variance trade-off} is a central problem in a machine learning of finding a balance between capturing the regularities in the training data and a model's ability to generalise well to unseen data \cite{wiki:biasvariance}.
In order to formally define the impact of bias and variance on the predictions of the model, let's denote $Y$ as the variable a machine learning model is trying to predict, and $X$ as our inputs to the model. Then, we may assume that $X$'s relation to $Y$ can be defined as:
\begin{equation}
Y = f(X) + \epsilon,
\end{equation}
where the error term $\epsilon$ is normally distributed with a mean of zero, i.e. $\epsilon \thicksim \mathcal{N}(0, \sigma_\epsilon)$.
Since a model $\hat{f}(X)$ of $f(X)$ can be estimated using a machine model of choice, an expected square prediction error at point $x$ is:
\begin{equation}
Err(x) = \mathbb{E}[(Y - \hat{f}(x))^2],
\end{equation}
which, decomposed, has bias and variance components:
\begin{equation}
Err(x) = \underbrace{(\mathbb{E}[\hat{f}(x)] - f(x))^2}_\text{Squared bias} + \underbrace{\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]}_\text{Variance} + \underbrace{\sigma^2_\epsilon}_\text{$E_{irr}$},
\label{biasvariance:1}
\end{equation}
where $E_{irr}$ is an irreducible error resulting from a noise in a defined relation that cannot be reduced by any model \cite{hastie:statisticallearning,fortmann-roe:biasvariance}.
In conclusion, finding a good balance between bias and variance results in an improved accuracy of a machine learning model.
\begin{figure}[]
\centering
\includegraphics[scale=0.4]{figures/biasvariance.png}
\caption{An illustration of bias-variance trade-off. A reduction of bias results in an increase of variance of the model, and vice versa. \cite{fortmann-roe:biasvariance}.}
\end{figure}
\subsubsection{Overfitting and underfitting}
Overfitting occurs when a machine learning model performs well on a training data, but fails to generalise well to new, previously unseen data. An overfitting model shows low \emph{bias} and high \emph{variance}.
An \emph{underfitting} model, on the other hand, is a model that is unable to produce accurate results on training data (and new data, as well), most often due to its insufficient complexity, reflected in low variance and high bias of a model \cite{kai:overfitting}.
\subsubsection{Convergence}
In mathematics, a sequence $S_n$ is said to be convergent to a given limit $L$, $\lim_{n \to \infty} S_n = L$ if, $\forall \epsilon > 0$ there exists $N$ such, that $|S_n - S| < \epsilon$ for $n > N$. In machine learning, it is often said that a model \emph{converges}, meaning that an error rate of an algorithm after each training iteration approaches a global (or local) minimum \cite{convergence}.
\subsection{Gradient descent}
\label{gradientdescent}
\emph{Gradient descent} is an optimisation algorithm used to minimise a value of a given objective function $f(\theta)$, where $\theta \in \mathbb{R}^d$ are the machine learning model's parameters, by iteratively moving in a direction of the steepest descent of $f(\theta)$ as defined by the negative of the gradient \cite{gradient_descent_cheatsheet}.
\subsubsection{Learning rate}
A parameter $\eta$ called the \emph{learning rate} is defined for the purposes of gradient descent algorithm, which defines the size of a step taken to reach a local minimum of a given objective function $f(\theta)$.
\subsubsection{Optimisation}
Optimisation refers to a task of either minimising or maximising a value of a given function $f(x)$ by altering $x$. Most optimisation problems involve the minimisation of $f(x)$, since the maximisation may be accomplished via a minimisation algorithm by minimising $-f(x)$ \cite{goodfellow}.
\subsubsection{Gradient vector}
\emph{Gradient} of a scalar-valued multi-variable function $f(x_1, x_2, \dots)$, denoted $\nabla f$, packages all its partial derivative information into a vector \cite{gradient_khan}:
\begin{equation}
\nabla f = \begin{bmatrix}
\dfrac{\delta f}{\delta x_1} \\[3ex]
\dfrac{\delta f}{\delta x_2} \\[3ex]
\vdots\\[1ex]
\end{bmatrix}.
\end{equation}
If at a given point $p \in f(x_1, x_2, \dots)$ the gradient of a function is not a zero vector, the direction of $\nabla f$ is the direction of the fastest increase of the function $f$ at $p$. The magnitude of $\nabla f$ is the rate of increase in that direction \cite{gradient_calculus, wiki:gradient}.
\subsubsection{Batch gradient descent}
A \emph{batch gradient descent} method computes the gradient $\nabla$ of the cost function $C$ with respect to parameters $\theta$ of the entire training set \cite{gradient_descent_ruder}:
\begin{equation}
\theta = \theta - \eta \cdot \nabla_\theta C(\theta).
\end{equation}
Computing the gradient for an entire training set at a time is impractical, however, due to extensive computations required and memory requirements posed by larger datasets.
\subsubsection{Stochastic gradient descent}
\emph{Stochastic gradient descent} method performs an update of $\theta$ for each training example $x_i$ and output $y_i$ \cite{gradient_descent_ruder}:
\begin{equation}
\theta = \theta - \eta \cdot \nabla_\theta C(\theta; x_i, y_i).
\end{equation}
Unlike the batch gradient descent method, stochastic gradient descent method computes gradients for distinctive members from a training set only.
\subsubsection{Mini-batch gradient descent}
\label{gradient:mini-batch}
\emph{Mini-batch gradient descent} builds on the ideas of stochastic gradient descent and batch gradient descent, performing an update for every mini-batch of $n$ training examples:
\begin{equation}
\theta = \theta - \eta \cdot \nabla_\theta C(\theta; \boldsymbol{x^{(i:i+n)}}, \boldsymbol{y^{(i:i+n)}}),
\end{equation}
where $\boldsymbol{x^{(i:i+n)}}, \boldsymbol{y^{(i:i+n)}}$ denote vectors of elements from index $i$ to index $i+n$ taken from vector $\boldsymbol{x}$ of all training examples and vector $\boldsymbol{y}$ of all outputs respectively.
An application of mini-batch gradient descent reduces the variance of the parameter updates, which is introduced by stochastic gradient descent, leading to a more stable convergence. It can also benefit greatly from matrix-based calculation optimisations, making it very efficient \cite{gradient_descent_ruder}.
% on AdaGrad here https://ruder.io/optimizing-gradient-descent/
\section{Neural networks}
\emph{Neural networks}, or artificial neural networks (\emph{ANN}) are machine learning models consisting of interconnected \emph{artificial neurons} gradually processing input data through pre-learned transformations in order to produce an accurate output for a given task.
ANNs have been inspired by the biological neural networks which constitute animal brains in recognition of their superior ability to perform certain computations, such as pattern recognition or perception \cite{haykin_nn} instantaneously, as opposed to traditional, rule-based computing systems.
Today, neural network models are the heart of cutting-edge developments in the field of artificial intelligence, exceeding the human performance in tasks, such as a game of \emph{Go} \cite{mastering_go}.
% See the Haykin's book for a list of ANN's advantages
\subsection{Artificial neurons}
An \emph{artificial neuron} (also referred to as neuron or perceptron) is an information-processing unit fundamental to the operation of a neural network \cite{haykin_nn}.
A neuron $N_k$ is defined by a set of inputs $x_1, x_2, \dots, x_n$, a set of weights $w_1, w_2, \dots, w_n$, where each weight $w_i$ corresponds to an input $x_i$, a bias value $b_k$, an activation function $\phi( \cdot )$ and an output $y_k$.
\begin{figure}[]
\centering
\includegraphics[scale=0.45]{figures/neuron.png}
\caption{A diagram illustrating the input data flow in an artificial neuron \cite{haykin_nn}.}
\end{figure}
Neuron's output $y_k$ is calculated by applying an activation function $\phi$ on top of a weighted sum of inputs, to which a bias value $b_k$ is added:
\begin{equation}
\label{neuron:1}
y_k = \phi(\sum_{i=1}^n (x_iw_i) + b_k).
\end{equation}
The bias value $b_k$ is used to affect the value an activation function of a $N_k$ neuron outputs. In another, often used notation, $b_k$ is applied as a weight for an input $x_0$, which has a fixed value of 1. This simplifies the equation \ref{neuron:1}:
\begin{equation}
\label{neuron:2}
y_k = \phi(\sum_{i=0}^n x_iw_i).
\end{equation}
\subsection{Multi-layer perceptron}
A \emph{multi-layer perceptron} (or \emph{MLP}) is a quintessential neural network model in which multiple neurons are grouped into \emph{layers} \cite{goodfellow}. It is a \emph{feedforward neural network}, which means that the connections between the neurons do not form a cycle.
\begin{figure}[]
\centering
\includegraphics[scale=0.35]{figures/mlp.png}
\caption{An example of a multi-layer perceptron neural network consisting of an input layer, two hidden layers and an output layer. \cite{haykin_nn}.}
\end{figure}
A MLP consists of an \emph{input layer}, followed by one or more \emph{hidden layers} and an final layer called the \emph{output layer}, producing the output of the network. The overall structure of a neural network is also referred to as an \emph{architecture} of a network.
The goal of a multi-layer perceptron is to approximate a function $f^\star$. For instance, in classification tasks, where $y = f^\star(x)$ performs a mapping from an input $x$ to a category $y$, a MLP defines a mapping $y = f(x, \theta)$ and learns $\theta$, the value of parameters resulting in the best $f^\star$ approximation \cite{goodfellow}.
\subsection{Hidden layers}
Hidden layers are all the layers in a neural network except the input layer and the output layer of the network.
While it has been thought that a multi-layer perceptron with just one hidden layer consisting of a large enough set of neurons can model even the most complex mathematical functions \cite{handson_geron}, deep neural networks have a much higher \emph{parameter efficiency} compared to the shallow neural networks. This means that neural networks with multiple hidden layers are able to model complex functions using exponentially fewer neurons, which results in a faster-to-train network.
Deploying multiple hidden layers also results in a network which can learn appropriate data representations gradually: first, modelling low-level structures, then gradually combining them into more complex structures to produce a final, high-level representation used to produce a final output of a network.
Such an approach can be thought as an application of the \emph{divide-and-conquer} strategy in the context of machine learning. It also yields a notable advantage in the resulting neural network being able to generalise to new datasets through an application of transfer learning - lower layers of deep neural network can be reused for the purpose of training similar network on new sets of input and output data, instead of performing a random initialisation of weights and biases for these layers \cite{handson_geron}.
\subsection{Activation functions}
\emph{Activation functions} are functions applied to the activation value of an artificial neuron in a neural network in order to produce a final output of a neuron. Their application produces an uniform mapping from activation values to outputs of a neural network's layer.
An activation function should be a function that is monotonic, in order for it to produce outputs preserving relationships between pairs of input data, and differentiable \cite{deep_learning_from_scratch}.
\subsubsection{Sigmoidal activation functions}
\begin{figure}
\centering
\begin{tikzpicture}[declare function={sigma(\x)=1/(1+exp(-\x));
sigmap(\x)=sigma(\x)*(1-sigma(\x));}]
\begin{axis}%
[
grid=major,
xmin=-6,
xmax=6,
axis x line=bottom,
ytick={0,.5,1},
ymax=1,
axis y line=middle,
samples=100,
domain=-6:6,
legend style={at={(1,0.9)}}
]
\addplot[blue,mark=none] (x,{sigma(x)});
\addplot[red,dotted,mark=none] (x,{sigmap(x)});
\legend{$\sigma(x)$,$\sigma'(x)$}
\end{axis}
\end{tikzpicture}
\caption{A plot of a sigmoid activation function $\sigma(x)$ and its derivative $\sigma '(x)$.} \label{sigmoid:chart}
\end{figure}
A \emph{sigmoid function}, also referred to as sigmoid curve or logistic function (see figure \ref{sigmoid:chart}) is a function defined by the formula:
\begin{equation}
\sigma(x) = \frac{1}{1-e^{-x}}.
\end{equation}
While the sigmoid function was one of the first activation functions used in neural networks, its modern applications are limited due to a \emph{vanishing gradient} problem. For very high or very low values of $\sigma(x)$, its derivative's $\sigma '(x)$ values are small, causing equally marginal change of weights in the first layers of a model. Computing a value of sigmoid function also requires significantly more processing power as opposed to the \emph{ReLU} function.
\subsubsection{Rectified Linear Activation Unit}
\emph{Rectified Linear Activation Unit} (or ReLU) is an activation function defined as:
\begin{equation}
f(x) = max(0, x) = \begin{cases}
x & \text{if } x > 0, \\
0 & \text{otherwise}.
\end{cases}
\end{equation}
It is one of the most commonly used activation functions in modern deep learning models \cite{relu_def} due to the fact that the value of ReLU function can be easily computed. An \emph{ImageNet} image classification neural network has been shown to converge over 25\% faster when ReLU was used instead of $tanh$ function, for instance \cite{relu_speed}.
Being a differentiable function, ReLU allows the \emph{backpropagation} algorithm to be used during the process of training a neural network.
\begin{figure}
\centering
\begin{tikzpicture}
\begin{axis}[
grid=major,
axis lines = middle,
xlabel = $x$,
ylabel = {$f(x)$},
xmax = 3,
xmin= -3,
ymax = 3,
ymin = -1,
axis y line=middle,
samples=100,
domain=-3:3,
]
%Below the red parabola is defined
\addplot [
domain=0:3,
samples=100,
color=blue,
]
{x};
\addplot [
domain=-3:0,
samples=100,
color=blue,
]
{0};
\end{axis}
\end{tikzpicture}
\caption{A Rectified Linear Activation Unit is one of the most often used activation functions in modern neural networks.}
\end{figure}
\subsubsection{Softmax}
\emph{Softmax} is a function that takes as an input a vector of $K$ real numbers and normalises it into a probability distribution consisting of $K$ probabilities proportional to the exponential of the input numbers \cite{wiki:softmax}. It is defined as:
\begin{equation}
\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \text{ for } i = 1, \dotsc , K \text{ and } \mathbf z=(z_1,\dotsc,z_K) \in\mathbb{R} ^K,
\end{equation}
where on each element $z_i$ of an input vector $\textbf{z}$ a standard exponential function is applied and the calculated value is normalised through a division by the sum of all the exponents, ensuring that the sum of output vector's components $\sum_{i=1}^K\sigma(\mathbf{z})_i$ is equal to 1 and all the components of $\sigma(\mathbf{z})$ are in the range of $(0, 1)$. What's more, a relation between the value of an input and output probability is maintained - larger input components correspond to larger probabilities, and so forth.
Softmax is used as an activation function in neural networks to convert the outputs of neurons from a given layer into probability values, corresponding for instance to model's certainty that the input belongs to a given class in a classification problem.
\subsection{Hyperparameters}
\emph{Hyperparameters} in the context of neural networks are all the parameters of a given network which value is set before the beginning of a learning process.
They can be classified as model hyperparameters, which, in the context of neural networks include a number of hidden layers in a neural network, an amount of neurons per hidden layer and activation functions used in neurons, or algorithm hyperparameters, which affect the speed and outcomes of the learning process, such as the learning rate or mini-batch size \cite{wiki:hyperparams}.
\subsection{Backpropagation}
\emph{Backpropagation} method is an algorithm used during the training process of a neural network to calculate the changes of neuron weights and biases needed to be applied in order to minimise a given cost function, and thus improve an accuracy of the model.
The method's efficiency has allowed an application of neural networks to a much wider field of problems, that were previously off-limits due to time and cost constraints, due to its ability to produce internal representations of data on par with those manually-engineered, or even those found in biological neural networks \cite{backpropagation_brilliant}.
\subsubsection{Notation}
We use $w_{j, k}^l$ to denote the weight for the connection from the $k^{th}$ neuron in the $(l-1)^{th}$ layer to the $j^{th}$ neuron in the $l^{th}$ layer.
For the network biases, we denote $b_j^l$ as the bias of the $i^{th}$ neuron in the $l^{th}$ layer of the network, and for activations, we use $a_j^l$ for the activations of the $j^{th}$ neuron in the $l^{th}$ layer.
The activation $a_j^l$ of the $j^{th}$ neuron in the network is related to the activations in the $(l-1)^{th}$ layer by the equation
\begin{equation}
\label{notation:1}
a_j^l = \sigma(\sum_k w^l_{j, k}a^{l-1}_k + b_j^l),
\end{equation}
where the sum is over all neurons $k$ in the $(l-1)^{th}$ layer.
In order to rewrite the expression \ref{notation:1} in a matrix form, the \emph{weight matrix} $\boldsymbol{w^l}$ is defined for each layer $l$, which is populated by weights connecting to the $l^{th}$ layer of neurons, which means, that the entry in the $j^{th}$ row and $k^{th}$ column is $w_{j, k}^l$. Similarly, for each layer $l$ a bias vector $\boldsymbol{b^l}$ is defined, which components are defined as $b^l_j$ values.
Having these matrices defined, the equation \ref{notation:1} can be rewritten as
\begin{equation}
\label{notation:2}
a^l = \sigma(w^la^{l-1}+b^l),
\end{equation}
which means, that in order to calculate the activations of neurons in one layer, all we have to do is to apply the weight matrix to the activations from a previous layer, then add the bias vector, and, finally, apply the $\sigma$ activation function. We also define
\begin{equation}
\label{notation:3}
\boldsymbol{z^l} \equiv w^la^{l-1} + b^l
\end{equation}
where $\boldsymbol{z^l}$ is called the \emph{weighted input} to the neurons in layer $l$ \cite{backpropagation_nielsen}.
\subsubsection{Cost function assumptions}
In order for the backpropagation method to be applied, we need to assume that our cost function $C$ can be expressed as an average $C = \frac{1}{n}\sum_x C_x$ over cost functions $C_x$,where $x$ is an individual training example. This is due to the fact that backpropagation allows us to compute partial derivatives $\frac{\delta C_x}{\delta w}$ and $\frac{\delta C_x}{\delta b}$, which $\frac{\delta C}{\delta w}$ and $\frac{\delta C}{\delta b}$ are recovered from by averaging over training examples. We will also assume for now that the training example $x$ is fixed, and refer to the cost $C_x$ as $C$, until otherwise noted.
We also need to make an assumption that a cost function can be expressed as a function of the outputs from a neural network $C = C(\boldsymbol{a^L})$. For example, a quadratic cost function $C = \frac{1}{2n} \sum_x ||y(x) - \boldsymbol{a^L}(x)||^2$ for a single training example $x$ may be written as :
\begin{equation}
\label{assumptions:1}
C = \frac{1}{2}||y-\boldsymbol{a^L}||^2 = \frac{1}{2} \sum_j (y_j - a_j^L)^2,
\end{equation}
and thus is a function of the output activations. While the function also depends on the desired output $y$, it's important to remember that both the training example $x$ and $y$ are fixed, and as such it is sound to regard $C$ as a function of $\boldsymbol{a^L}$, where $y$ is merely a parameter helping to define $C$ \cite{backpropagation_nielsen}.
\subsubsection{Hadamard product}
Let's suppose that $\boldsymbol{s}$ and $\boldsymbol{t}$ are two vectors of the same dimension. An \emph{element-wise} product of two vectors, also known as the \emph{Hadamard product}, is denoted as $\boldsymbol{s} \odot \boldsymbol{t}$, where:
\begin{equation}
\label{hadamard:1}
(\boldsymbol{s} \odot \boldsymbol{t})_j = s_jt_j.
\end{equation}
\subsubsection{Backpropagation algorithm}
The backpropagation algorithm provides a way to compute a gradient of the cost function:
\begin{enumerate}
\item \textbf{Input} x: set the corresponding activation $a^1$ for the input layer
\item \textbf{Feedforward}: for each $l = 2, 3, \dots, L$ compute $\boldsymbol{z^L} = \boldsymbol{w^l}\boldsymbol{a^{l-1}} + \boldsymbol{b^l}$ and $\boldsymbol{a^l} = \sigma(\boldsymbol{z^l})$
\item \textbf{Output error} $\boldsymbol{\delta^L}$: compute the vector $\boldsymbol{d^L} = \Delta_a C \odot \delta'(\boldsymbol{z^L})$
\item \textbf{Backpropagation of the error}: for each $l = L - 1, L - 2 \dots, 2$ compute $\boldsymbol{\delta^l} = ((\boldsymbol{w^{l+1}})^T \delta^{l+1}) \odot \sigma'(\boldsymbol{z^l})$
\item \textbf{Output}: the gradient of the cost function, given by $\frac{\delta C}{\delta w^l_{j, k}} = \boldsymbol{a^{l-1}_k \delta^l_j}$ and $\frac{\delta C}{\delta b_j^l} = \boldsymbol{\delta^l_j}$
\end{enumerate}
An error vector $\boldsymbol{\delta^j}$ is computed starting from the final layer, which is a consequence of the fact that the cost is a function of outputs from the network. The chain rule is applied repeatedly to evaluate the cost variations with weights and biases from earlier layers in the network \cite{backpropagation_nielsen}.
\subsection{Training neural networks}
% TODO: Verify and add that mini-batch gradient descent is a batch normalization
In order to train a neural network, the backpropagation method is combined with a learning algorithm, such as the mini-batch gradient descent, in order to compute a gradient of the cost function for many training examples.
\subsubsection{Mini-batch gradient descent-based backpropagation algorithm}
Given a mini-batch of $m$ training examples:
\begin{enumerate}
\item For each training example $x$: set the corresponding activation $a^{x, 1}$ and perform: \begin{itemize}
\item \textbf{Feed-forward}: for each $l = 2, 3, \dots, L$ compute $\boldsymbol{z^{x, L}} = \boldsymbol{w^l}\boldsymbol{a^{x,l-1}} + \boldsymbol{b^l}$ and $\boldsymbol{a^{x,l}} = \sigma(\boldsymbol{z^{x,l}})$
\item \textbf{Output error} $\boldsymbol{\delta^{x, L}}$: compute the vector $\boldsymbol{d^{x, L}} = \Delta_a C_x \odot \delta'(\boldsymbol{z^{x, L}})$
\item \textbf{Backpropagation of the error}: \\ for each $l = L - 1, L - 2 \dots, 2$ compute $\boldsymbol{\delta^{x, l}} = ((\boldsymbol{w^{l+1}})^T \delta^{x, l+1}) \odot \sigma'(\boldsymbol{z^{x, l}})$
\end{itemize}
\item \textbf{Gradient descent}: for each $l = L, L - 1, \dots, 2$ update
\begin{itemize}
\item weights, according to the rule $w^l \rightarrow w^l - \frac{\eta}{m} \sum_x \delta^{x,l} (a^{x, l - 1})^T$
\item biases, according to $b^l \rightarrow b^l - \frac{\nabla}{m} \sum_x \delta^{x, l}$.
\end{itemize}
\end{enumerate}
An implementation of stochastic gradient descent will also require two outer loops, one generating mini-batches of $m$ training examples and another iterating over multiple epochs of training, which were omitted for simplicity \cite{backpropagation_nielsen}.
\subsection{Regularisation}
In order to reduce overfitting in neural networks, a set of methods known as \emph{regularisation} techniques has been developed in order improve the generalisation from the training data to the test data. An example of such an technique is a \emph{L2 regularisation} method \cite{backpropagation_nielsen}.
\subsubsection{L2 regularisation}
\emph{Weight decay}, also known as the L2 regularisation, is one of the most commonly used regularisation techniques, which works by adding an extra term to the cost function $C$ - the sum of all squared weights in the neural network:
\begin{equation}
C = C_0 + \frac{\lambda}{2n} \sum_w w^2,
\end{equation}
where $C_0$ is the unmodified cost function, $\lambda > 0$ is an \emph{regularisation parameter}, and $n$ - the size of a training set. A quadratic cost function with L2 regularisation applied is defined then as follows:
\begin{equation}
C = \frac{1}{2n} \sum_x ||y - a^L||^2 + \frac{\lambda}{2n} \sum_w w^2,
\end{equation}
For small values of $\lambda$, the regularisation has an effect of minimising the original cost function, and for the large values of $\lambda$ - affects the values of the network's weights, making it preferred to learn smaller values \cite{backpropagation_nielsen}.
\begin{figure}[]
\centering
\includegraphics[scale=0.48]{figures/regularisation.png}
\caption{Regularisation helps trained models to generalise better to new data by limiting the complexity of the model caused by an overfitting to the noise in the training data. An example of underfitting, a good bias-variance trade-off and overfitting is shown here accordingly in plots depicting a degree 1-, 4- and 15- polynomial attempting to match the training samples given \cite{regularisation_scikit}.}
\end{figure}
\subsubsection{Dropout}
Arguably the most popular regularisation technique for neural networks is known as \emph{dropout}. Introduced in 2012, it has proven to improve the accuracy of state-of-the-art models by 1-2\%, decreasing their error rate by almost 40\% \cite{handson_geron}.
Dropout works by assigning each neuron in the neural network excluding an output layer a new hyperparameter, which is the probability $p$ (most often set to $p = \frac{1}{2}$) of a neuron being omitted during a single training step. After the training, the neurons are no longer excluded from the computations.
Applying such an omission during training has been shown to produce subsequent data representations by neurons which contribute towards the solution of a given task, instead of representations helpful only in context of several other feature detectors. This has an effect of training multiple training networks with different hyperparameters and then averaging their output to produce a final output - but is much more computationally efficient \cite{dropout}.
\section{Deep learning}
\emph{Deep learning} is a subset of machine learning methods focusing on learning representations of data, often expressed in terms of other, simpler representations, which aim is to solve a given target task \cite{goodfellow}. Its approach to the problem of representation learning has proven to be effective in solving problems perceived as difficult in the fields of computer vision and natural language processing, among others.
% Some more background on deep learning needs to go there
Most modern deep learning solutions are based upon \emph{deep neural networks}.
\subsection{Deep neural networks}
Deep neural networks are machine learning models which utilise multiple hidden layers in their internal infrastructure in order to produce a highly accurate representations of data.
\subsection{Recurrent neural networks}
\label{rnns}
\emph{Recurrent Neural Networks} (or \emph{RNNs}) are neural networks which work on arbitrary-length sequences as opposed to fixed-size inputs. The input sequences are processed by iterating through their elements and maintaining a \emph{hidden state} containing information relative to what's been already processed \cite{chollet}.
RNNs are used in natural language processing tasks such as text translation or sentiment analysis, \emph{time series} analysis and content generation \cite{rnn_generation}. Unlike feedforward neural networks, RNNs have connections pointing backwards.
\subsubsection{Memory cells}
\emph{Memory cells} are basic building blocks of recurrent neural networks. At an each time step $t$ a single memory cell outputs its hidden state $h_{(t)}$, which is a function of an input at that time step and the cell's state at the previous time step
\begin{equation}
h_{(t)} = f(h_{(t-1}, x_{(t)}).
\end{equation}
The cell's output at time step $t$, denoted as $y_{(t)}$ is also a function of the previous state and current inputs.
\begin{figure}[]
\centering
\includegraphics[scale=0.6]{figures/memory_cells.png}
\caption{Memory cell's hidden state $h_{(t-1)}$ is passed as an input to the cell at the time step $t$, allowing the cell to take account its previous inputs when producing an output $y_{(t)}$. A recurrent cell can be \emph{unrolled}, which means, that it is represented against a time axis, as shown in this example \cite{handson_geron}. }
\end{figure}
\subsubsection{Long Short-Term Memory Cell}
\label{lstm}
\emph{Long Short Term Memory Cell} (or \emph{LSTM}) is a memory cell whose internal state is split into two vectors: $h_{(t)}$, representing a short-term state, and $c_{(t)}$, which represents a long-term state \cite{lstm_cell}.
\begin{figure}[]
\centering
\includegraphics[scale=0.9]{figures/lstm_cell.png}
\caption{A LSTM cell through an introduction of both a short-term and long-term state, allowing the neural network to take into account from different time steps $t_i$ depending on their determined importance \cite{handson_geron}.}
\end{figure}
% ADD UNDER THE IMAGE AN EXPLAINATION OF INPUTS, OUTPUTS; maybe use another graphic which will depict the cell better
A RNN learns whether an information should be kept, removed or read from $c_{(t)}$ in a LSTM cell. At each time step $t$ an input long term state from a previous time step, $c_{(t-1)}$:
\begin{enumerate}
\item passes through a \emph{forget gate}, which controls what information should be removed from $c_{(t)}$,
\item has an addition operation performed on it based on the output of an \emph{input gate},
\item is returned as a long term state $c_{(t)}$ of the cell at a current time step.
\end{enumerate}
After the addition operation $c_{(t)}$'s state is copied and passed through a tanh function, which result is then modified by an \emph{output gate}. These operations produce a short-term state $h_{(t)}$ equal to the cell's output for a current time step, denoted as $y_{(t)}$.
Also, at each time step $t$ an input vector $x_{(t)}$ is passed along with $h_{(t-1)}$ through four fully-connected layers:
\begin{enumerate}
\item A layer outputting $g_{(t)}$, which is computed based on $x_{(t)}$ and $h_{(t-1)}$'s values,
\item A forget gate controller layer ($f_{(t)}$) controlling which parts of the long-term state to erase,
\item An input gate controller layer ($i_{(t)}$), which determines what parts of $g_{(t)}$'s output are added to $c_{(t)}$,
\item An output gate controller layer ($o_{(t)}$), determining which parts of $c_{(t)}$ should be output at $t$.
\end{enumerate}
The LSTM cell can be thus defined mathematically as:
\begin{equation}
\begin{split}
&i_t = \sigma(W^ix_t + U^ih_{t-1})\\
&f_t = \sigma(W^fx_t + U^fh_{t-1})\\
&o_t = \sigma(W^ox_t + U^oh_{t-1})\\
&g_t = tanh(W^cx_t + U^ch_{t-1})\\
&c_t = i_t \odot g_t + f_t \odot c_{t-1}\\
&h_t = o_t \odot tanh(c_t),\\
\end{split}
\end{equation}
where $W^i, W^f, W^o, U^i, U^f, U^o, U^c$ are weight matrices and $\odot$ is an element-wise multiplication.
\subsubsection{Backpropagation through time}
A recurrent neural network is trained using a strategy called the \emph{backpropagation through time} (or \emph{BPTT}), which is a regular backpropagation algorithm applied to a RNN unfolded through time.
\begin{figure}[]
\centering
\includegraphics[scale=0.9]{figures/bptt.png}
\caption{An illustration of the backpropagation through time strategy applied to an unfolded RNN. Note that the gradient is computed for all outputs used by the cost function (in this instance $Y_{2}$, $Y_{3}$ and $Y_{4}$), not just the final output of a network \cite{handson_geron}.}
\label{bptt:1}
\end{figure}
Similarly to previously discussed algorithm, the forward pass is performed through the unfolded RNN, represented in figure \ref{bptt:1} by the dashed arrows. Then, an output sequence is evaluated using a cost function:
\begin{equation}
C(Y_{(t_{min})}, Y_{(t_{min} + 1)}, \dots, Y_{(t_{max})}),
\end{equation}
where $t_{min}$ and $t_{max}$ represent first and final time steps, excluding ignored outputs, and the gradients of $C$ are propagated backward through the unrolled network (step marked in figure \ref{bptt:1} as solid arrows). The computed gradients are used to update the model's parameters $\theta$.
\chapter{Recurrent neural networks-based natural language processing and document sentiment classification}
\label{chapter:rnnnlp}
Recurrent neural networks' ability to process arbitrary length inputs and interpret information based on a given context makes them an auspicious base for modern natural language processing systems. In fact, most of the state-of-the-art NLP applications, including document sentiment classification are based on RNNs \cite{handson_geron}.
However, this has not always been the case. Until 2013, recurrent neural networks had been thought to be difficult to train \cite{ruder:neuralnlp,sutskever:rnntraining} and most of the latter approaches required large datasets and prolonged computational time in order to converge \cite{ulmfit}.
In this chapter we discuss some of the key milestones in the field of neural network-based natural language processing, which have resulted in addressing aforementioned shortcomings and proliferation of cutting-edge NLP solutions. We also explain core concepts behind modern NLP frameworks and tool-kits, such as \emph{tokenization}.
Finally, this chapter provides a perspective on the significance of introduced developments in the context of a task of document-level sentiment classification and the entire field of natural language processing. In this process, some of the established methods are introduced, including ULMFiT.
% Maybe write here that Google was able to cut down on their translation systems complexity, reducing the amount of code from 50,000 to 500 lines thanks to neural network-based NLP?
\section{Document representation}
\label{document:representation}
In order for a text document to be processed by a neural network it has to be converted into a numerical representation. Most often such a document is represented by a vector obtained through a process called \emph{text vectorization} \cite{chollet}.
\subsection{Tokenization}
Before a given text document can be vectorized, it has to be first segmented into individual subsequences, such as individual words, characters, or groups of multiple consecutive words or characters known as \emph{n-grams} \cite{chollet}. Such a task is referred to as \emph{tokenization}.
Individual subsequences produced by a \emph{tokenizer}, a system performing the tokenization, are called \emph{tokens}. An example illustration of a tokenization process for a sentence in English is shown in a figure \ref{tokenization:1}. A set of all the tokens generated by the tokenizer is called a \emph{token vocabulary}, denoted as $V^*$.
\begin{figure}[]
\centering
\includegraphics[scale=0.5]{figures/tokenization.png}
\caption{An illustration of the tokenization process performed by a \emph{spaCy} tokenizer \cite{spacy:101}. A raw text (\textit{"Let's go to N.Y.!"}) is first split into subsequences on whitespace characters and for each subsequence a check is performed against a tokenizer's predefined set of rules and exceptions. Here, \textit{"Let's} is first split into \textit{"} and \textit{Let's} based on a prefix splitting rule. Then \textit{Let's} is again split into tokens \textit{Let} and \textit{'s} according to a defined exception (in English, \textit{Let's} is short for \textit{Let us}). Similarly, \textit{N.Y.!"} produces tokens \textit{N.Y.}, \textit{!} and \textit{"}. }
\label{tokenization:1}
\end{figure}
More advanced tokenization methods, such as \emph{Byte-Pair-Encoding segmentation} have been shown to increase a performance on many tasks, such as machine translation \cite{kudo:subword}.
In highly inflected languages\footnote{Latin, Polish or Finnish are examples of highly inflected languages.} (see section \ref{sentiment:polish}) a properly chosen tokenization method helps to reduce the amount of tokens generated, increasing the model's ability to interpret their meaning without increasing the size of the vocabulary, having a positive impact on model's complexity, training duration and efficiency.
\subsection{One-hot encoding}
Tokens represent a discrete, categorical features, which can be represented as a vector through a method referred to as \emph{one-hot encoding} \cite{hirst:neuralnetworkmethodsinnlp}.
Using one-hot encoding, each token $t_i$ in the vocabulary of size $n$ is assigned an unique index $i \in \{0, 1, \dots, n-1\}$. Then, for each $t_i$, a vector $v$ of size $n$ is created, where all the elements are $0$, except for the $i$-th entry, which has a value of $1$.
Vectors obtained through one-hot encoding are binary, sparse and highly-dimensional, making them computationally inefficient for larger vocabularies. Such an encoding also does not maintain any semantic relationship between input tokens and output vectors, resulting in a loss of valuable information which could be taken advantage of during the process of training a neural network \cite{tds:embeddings}. Thus, another vectorization method called \emph{word embeddings} is more commonly used.
\subsection{Token embeddings}
\label{embeddings}
Token embeddings are an efficient, dense representation of tokens in which similar tokens have a similar encoding. An embedding is a dense vector of floating-point values, which are the weights learned during the process of training a neural network.
The dimensionality of token embeddings varies depending on the size of the dataset. The more dimensional the embeddings are, the more likely they are to depict relationships between used tokens, at a cost of requiring more training data in order to learn \cite{tf:embeddings}.
% insert here a visual comparison between word embeddings and one-hot encoded vectors
\subsection{Token embeddings layer}
A \emph{token embeddings layer} is an input layer of a neural network that performs mapping from integer indices representing individual tokens to associated vector embeddings \cite{tf:embeddings,ulmfit}.
Depending on an implementation, an embeddings layer may use as an input an integer index $i \in \{0, 1, \dots, n-1\}$ assigned for each token $t$ in a vocabulary $V^*$ of size $n$ \cite{tf:embeddings} or a one-hot-encoded vector representation of $t$ \cite{ulmfit:berlin}.
% cite word2vec
% Compared to vectors obtained through a one-hot encoding method, token embeddings are usually learned during a supervised learning task. A choice of such a task is one of the key factors affecting their effectiveness \cite{tds:embeddings}.
\subsection{Notation and assumptions on words and tokens}
We will denote an individual word as $w$ and a word vocabulary as $V$. Similarly, an individual token will be denoted as $t$, and the vocabulary of tokens as $V^*$. Words always convey a particular meaning, while the tokens are an output of a tokenizer \cite{hirst:neuralnetworkmethodsinnlp}. The goal of tokens is to simplify a word and syntax-based meaning representation, through a generation of token-based representation.
For the purposes of simplicity, we will use the term words in the latter parts of this paper almost exclusively, unless it is important to make a distinction between them.
\subsection{Corpora transformations}
In natural language processing, a set of text documents used in a given learning task is referred to as \emph{corpus}. Corpus can be either labelled or unlabelled, depending on a machine learning task it's applied to.
In order to use a text corpus for the purposes of training a recurrent neural network, it's contents, after tokenization, needs to be concatenated into a single entity. Such a process requires an addition of two tokens: a token signifying the beginning of a concatenated document, $T_{\mathrm{BEG}}$, and a token put at the end of each concatenated document, $T_{\mathrm{END}}$.
Defining $C$ as a corpus and $d_i$ as an individual member of the corpus, the resulting entity $e$ is created:
\begin{equation}
e =(T_{\mathrm{BEG}}\oplus d_1 \oplus T_{\mathrm{END}}) \oplus (T_{\mathrm{BEG}}\oplus d_2 \oplus T_{\mathrm{END}}) \oplus \dots (T_{\mathrm{BEG}} \oplus d_n \oplus T_{\mathrm{END}}),
\end{equation}
where $n$ is the length of the corpus (total number of text documents in a set) and $\oplus$ is the concatenation.
\begin{figure}[]
\centering
\includegraphics[scale=0.45]{figures/minibatch.png}
\caption{In a mini-batch-based process of training a RNN, a concatenated entity $e$ of text documents is first split into a number of batches based on a defined batch size $bs$. As a result, batches consisting of $n_t$ tokens are created, illustrated as a $bs \times n_t$ matrix. Then, mini-batches are created by truncating created batches based on a value of $bptt$ parameter, which indicates, how many tokens are going to be back-propagated through when training the model. For instance, having a mini-batch consisting of 64 sequences of 70 tokens, during the training one mini-batch is fed to the model, and, as a result, all 64 sequences are processed in parallel. For each mini-batch model performs its predictions and calculates the loss before a backpropagation process begins \cite{ulmfit:berlin}.}
\label{corpus:minibatch}
\end{figure}
However, due to computational and memory constraints, training a RNN using a large corpora is performed using a mini-batch-based backpropagation through time algorithm (see figure \ref{corpus:minibatch}) \cite{ulmfit:berlin}.
% \section{Language modelling}
% \label{languagemodelling}
% \emph{Language modelling} is the task of learning a probability distribution over words in a given vocabulary $V$. As a result, a language model is created, which assigns probability for the likelihood of a given word (or sequence of words) to follow an arbitrary sequence of words \cite{hirst:neuralnetworkmethodsinnlp}.
% Most often the aim of the language models is to compute the probability of the occurrence of word $w_t$ given $n-1$ previous words in a sequence, denoted as $P(w_t | w_{t-1}, w_{t-2}, \dots, w_1)$, where \cite{ruder:onwordembeddings}:
% \begin{equation}
% P(w_t | w_{t-1}, w_{t-2}, \dots, w_1) = P(w_1) P(w_2 | w_1)P(w_3|w_1w_2)\dots P(w_t|w_1w_2\dots w_t-1).
% \label{lm_1}
% \end{equation}
% Due to the fact that the last term of equation \ref{lm_1} requires conditioning on $n - 1$ words, which is a task almost equally difficult as modelling an entire sentence, language models rely on the \emph{Markov assumption}, assuming that the next word in a sequence depends only on the last $k$ words (\textit{"future is independent of past given the present"}) \cite{hirst:neuralnetworkmethodsinnlp}:
% \begin{equation}
% P(w_i |w_{i-1} w_{i-2}\dots w_{1}) \approx P(w_i |w_{i} w_{i-1}\dots w_{i-k}).
% \end{equation}
% The probability of a whole sentence or a document can be as such approximated by the product of probabilities of each word given $n$ previous words \cite{ruder:onwordembeddings}:
% \begin{equation}
% P(w_1, w_2, \dots, w_t) = \prod_i P(w_i | w_{i-1} w_{i-2} \dots w_{i-n+1}).
% \end{equation}
% % should I add that in neural networks softmax is used for this? https://ruder.io/word-embeddings-1/index.html#classicneurallanguagemodel
\section{Transfer learning: from fine-tuning of pre-trained word embeddings to fine-tuning of pre-trained language models}
Most of the earlier research on transfer learning applications in neural network-based natural language processing focused on transductive transfer learning, where, for instance, the models' ability to perform accurate predictions in classification tasks when documents from a different domain were used as an input was measured \cite{transductive:blitzer,ulmfit}.
For inductive transfer learning, two noteworthy techniques have been proposed: first, a fine-tuning of pre-trained word embeddings method was proposed \cite{word2vec}, and later - fine-tuning of entire, pre-trained models \cite{ulmfit}.
\subsection{Fine-tuning pre-trained word embeddings}
\label{finetuning:embeddings}
In order to improve the word embeddings' ability to represent semantic relationships between words in a given vocabulary $V$, highly-specialised machine learning models, such as \emph{word2vec} and \emph{GloVe} have been used, which, due to a substantial improvement of their performance, have allowed to be trained on much larger datasets, thus obtaining relation-maintaining word embeddings \cite{embeddings:relations,word2vec,glove}.
Applied in an embedding layer of a neural network, pre-trained word embeddings have been shown to increase the model's performance on various NLP tasks \cite{embeddings:accuracy,embeddings:accuracy2}.
However, as the usage of pre-trained word embeddings affects only the first layer of a model, the remaining parameters still remain randomly initialised. Taking into account the benefits of pre-training an entire model, which include an improved generalisation \cite{pretraining:benefits}, first successful language model pre-training and fine-tuning methods have been proposed. Their impact on neural networks-based natural language processing cannot be understated \cite{state_of_nlp_2019}.
%TODO: Add word vectors illustration
\subsection{Fine-tuning pre-trained language models}
\label{finetuning:languagemodels}
Language models had been historically thought to be prone to suffering from a phenomena known as \emph{catastrophic forgetting} \cite{goodfellow:forgetting} when fine-tuned with a classifier, as well as requiring large datasets in order not to overfit \cite{dai:sequencelearning}. Nowadays, they are ubiquitous in natural language processing, allowing for state-of-the-art results to be achieved on a wide range of tasks \cite{state_of_nlp_2019}.
This is due to the fact that novel methods, such as the ULMFiT \cite{ulmfit}, \emph{BERT} \cite{bert} or \emph{GPT2} \cite{gpt2} have made it possible to perform a successful pre-training and latter fine-tuning of a deep neural networks-based language model.
Language modelling as a pre-training source task has also been shown to lead to a better performance on target task, as opposed to machine translation or auto-encoding \cite{languagemodels:vstranslation}, especially when syntactic information is of relevance.
\begin{figure}[]
\centering
\includegraphics[scale=0.33]{figures/ulmfit_results.png}
\caption{A comparison of validation error rates for supervised and semi-supervised ULMFiT method applications as opposed to training an AWD-LSTM model from scratch, with varying amount of training examples. Tests were performed on \emph{IMDB} (left), \emph{TREC-6} (center) and \emph{AG} (right) datasets \cite{ulmfit}.}
\label{ulmfit:results}
\end{figure}
Arguably, the most important benefit of using a pre-trained language model is that it substantially reduces the amount of labelled training examples needed for a target model to converge, as shown in figure \ref{ulmfit:results}. It is, however, computationally expensive to pre-train a language model which reaches a low \emph{perplexity}\footnote{Perplexity is a measurement of how well a probability distribution or probability model predicts the sample. The perplexity of language model can be understood as the level of perplexity when predicting the following word - a total number of choices a language model can make \cite{perplexity}.} which is a measurement of how well a probability distribution or probability model predicts a sample \cite{state_of_nlp_2019}.
\section{AWD-LSTM}
\label{awdlstm}
Due to over-parametrisation of neural networks, their generalisation ability depends heavily on the regularisation techniques used. However, naive applications of strategies, such as dropout and batch normalisation in recurrent neural networks have not been highly successful, compared to results in feed-forward and convolutional neural networks.
A proposal of \emph{AvSDG Weight-Dropped LSTM} (or \emph{AWD-LSTM}), a set of regularisation strategies developed for RNNs \cite{merity:awdlstm} has helped to produce state-of-the-art results in language modelling \cite{ulmfit:berlin}.
\begin{table}[ht]
\centering
\begin{tabular}{ lcc }
\toprule
Model & Validation & Test \\ \midrule
AWD-LSTM & 68.6 & 65.8 \\ \midrule
- variable sequence lengths & 69.3 & 66.2 \\
- embedding dropout & 71.1 & 68.1 \\
- weight decay & 71.9 & 68.7 \\
- AR/TAR & 73.2 & 70.1 \\
- weight-dropping & 78.4 & 74.9 \\
\bottomrule
\end{tabular}
\caption{An impact of removal of regularisation techniques used in AWD-LSTM on the model's validation and test perplexity measured on a \emph{WikiText-2} dataset. A removal of hidden-to-hidden weights regularisation performed by weight dropping alone leads to an increase of over 10 points of perplexity as compared to a model employing all the regularisation strategies, proving that regularisation has a significant, positive impact on the recurrent neural network's performance \cite{merity:awdlstm}.}
\end{table}
\subsection{Weight-dropped LSTM}
\label{weightdropped}
AWD-LSTM makes use of \emph{DropConnect} on the recurrent, hidden-to-hidden weight matrices ($U^i, U^f, U^o, U^c$), as opposed to previous techniques, which introduced dropout between time steps, acting on hidden state vector $h_{t-1}$, or on update of a memory state $c_t$.
\begin{figure}[]
\centering
\includegraphics[scale=0.6]{figures/dropconnect.png}
\caption{DropConnect sets a randomly selected subset of weight values in a neural network to zero, as opposed to a regular droput, which is applied as a mask, denoted here as $m(\cdot)$, on a random subset of network's activations $a(\cdot)$ \cite{dropconnect}.}
\label{dropconnect}
\end{figure}
This enables the use of external, highly-optimised LSTM implementations, such as \emph{cuDNN LSTM} \cite{merity:awdlstm}, while retaining the LSTM's ability to learn long term dependencies of input data \cite{seth:awdlstmgreat}.
\subsection{Variable-length backpropagation through time}
\label{vl:bptt}
Having a fixed length of sequences in a mini-batches used to train the network using the backpropagation through time algorithm results in an inefficient use of available training data. Merity et al. illustrates this by an example: having 100 elements to perform a backpropagation with a fixed $bptt$ window of 10, any element divisible by 10 will not have any element to backpropagate into \cite{merity:awdlstm}, essentially meaning, that for an every final element of a mini-batch a language model's prediction is never performed. To prevent such an inefficiency, an use of \emph{variable-length backpropagation sequences} was proposed.
Using variable-length backpropagation sequences approach, a base sequence length is determined to be $bptt$ with probability $p$ and $\frac{bptt}{2}$ with probability $1-p$. Then, a sequence length $l$ is selected according to $\mathcal{N}(bptt, s)$, where $s$ is the standard deviation and $\mathcal{N}$ - a normal distribution \cite{seth:awdlstmgreat}. This results in a more efficient usage of available training data.
\subsection{Variational dropout}
\label{variationaldropout}
\emph{Variational dropout} addresses the inefficiency of applying a regular dropout in long short-term memory networks, by performing a generation of a dropout mask once and then re-using a generated mask within a single forward and backward pass, as opposed to its re-generation on every dropout function call . This ensures that in an RNN at each time step $t$ an identical dropout mask is applied for input $x_t$.
In AWD-LSTM, variational dropout is used in all dropout operations, excluding those related to hidden-to-hidden weight matrices described in section \ref{weightdropped}. Specifically, a variational dropout mask is applied for all inputs and outputs of LSTM \cite{merity:awdlstm}.
\subsection{Embeddings dropout}
\label{embeddingsdropout}
Another regularisation called \emph{embeddings dropout} is used in AWD-LSTM to perform a dropout on a randomly-chosen set of inputs to the embedding matrix of a neural network, resulting in some of the input tokens being omitted during a single forward and backward pass process of training a network.
Intuitively, the goal of using an embeddings dropout is to make a model less prone produce output based on the presence of individual tokens. Used with variational dropout, embeddings dropout method has been shown to improve the performance of LSTM-based natural language processing models performing language modelling \cite{merity:awdlstm} and sentiment analysis \cite{gal:embeddingdropout}, among others.