Chapter 1: Introduction to Flair
There are few Natural Language Processing (NLP) frameworks out there as easy to learn and as easy to work with as Flair. Packed with pre-trained models, excellent documentation, and readable syntax, it provides a gentle learning curve for NLP researchers who are not necessarily skilled in coding; software engineers with poor theoretical foundations; students and graduates; as well as individuals with no prior knowledge simply interested in the topic. But before diving straight into coding, some background about the motivation behind Flair, the basic NLP concepts, and the different approaches to how you can set up your local environment may help you on your journey toward becoming a Flair NLP expert.
In Flair's official GitHub README, the framework is described as:
"A very simple framework for state-of-the-art Natural Language Processing"
This description will raise a few eyebrows. NLP researchers will immediately be interested in knowing what specific tasks the framework achieves its state-of-the-art results in. Engineers will be intrigued by the very simple label, but will wonder what steps are required to get up and running and what environments it can be used in. And those who are not knowledgeable in NLP will wonder whether they will be able to grasp the knowledge required to understand the problems Flair is trying to solve.
In this chapter, we will be answering all of these questions by covering the basic NLP concepts and terminology, providing an overview of Flair, and setting up our development environment with the help of the following sections:
- A brief introduction to NLP
- What is Flair?
- Getting ready
To get started, you will need a development environment with Python 3.6+. Platform-specific instructions for installing Python can be found at https://docs.python-guide.org/starting/installation/.
You will not require a GPU-equipped development machine, though having one will significantly speed up some of the training-related exercises described later in the book.
You will require access to a command line. On Linux and macOS, simply start the Terminal application. On Windows, press Windows + R to open the Run box, type
cmd and then click OK.
Flair's official GitHub repository is available via the following link: https://github.com/flairNLP/flair. In this chapter we will install Flair version 0.11.
The code examples covered in this chapter are found in this book's official GitHub repository in the following Jupyter notebook: https://github.com/PacktPublishing/Natural-Language-Processing-with-Flair/tree/main/Chapter01.
A brief introduction to NLP
Before diving straight into what Flair is capable of and how to leverage its features, we will be going through a brief introduction to NLP to provide some context for readers who are not familiar with all the NLP techniques and tasks solved by Flair. NLP is a branch of artificial intelligence, linguistics, and software engineering that helps machines understand human language. When we humans read a sentence, our brains immediately make sense of many seemingly trivial problems such as the following:
- Is the sentence written in a language I understand?
- How can the sentence be split into words?
- What is the relationship between the words?
- What are the meanings of the individual words?
- Is this a question or an answer?
- Which part-of-speech categories are the words assigned to?
- What is the abstract meaning of the sentence?
The human brain is excellent at solving these problems conjointly and often seamlessly, leaving us unaware that we made sense of all of these things simply by reading a sentence.
Even now, machines are still not as good as humans at solving all these problems at once. Therefore, to teach machines to understand human language, we have to split understanding of natural language into a set of smaller, machine-intelligible tasks that allow us to get answers to these questions one by one.
In this section, you will find a list of some important NLP tasks with emphasis on the tasks supported by Flair.
Tokenization is the process of breaking down a sentence or a document into meaningful units called tokens. A token can be a paragraph, a sentence, a collocation, or just a word.
For example, a word tokenizer would split the sentence Learning to use Flair into a list of tokens as
["Learning", "to", "use", "Flair"].
Tokenization has to adhere to language-specific rules and is rarely a trivial task to solve. For example, with unspaced languages where word boundaries aren't defined with spaces, it's very difficult to determine where one word ends and the next one starts. Well-defined token boundaries are a prerequisite for most NLP tasks that aim to process words, collocations, or sentences including the following tasks explained in this chapter.
Text vectorization is a process of transforming words, sentences, or documents in their written form into a numerical representation understandable to machines.
One of the simplest forms of text vectorization is one-hot encoding. It maps words to binary vectors of length equal to the number of words in the dictionary. All elements of the vector are
0 apart from the element that represents the word, which is set to
1 – hence the name one-hot.
For example, take the following dictionary:
The word cat would be the first word in our dictionary and its one-hot encoding would be
[1, 0, 0]. The word dog would be the second word in our dictionary and its one-hot encoding would be
[0, 1, 0]. And the word goat would be the third and last word in our dictionary and its one-hot encoding would be
[0, 0, 1].
This approach, however, suffers from the problem of high dimensionality as the length of this vector grows linearly with the number of words in the dictionary. It also doesn't capture any semantic meaning of the word. To counter this problem, most modern state-of-the-art approaches use representations called word or document embeddings. Each embedding is usually a fixed-length vector consisting of real numbers. While the numbers will at first seem unintelligible to a human, in some cases, some vector dimensions may represent some abstract property of the word – for example, a dimension of a word-embedding vector could represent the general (positive or negative) sentiment of the word. Given two or more embeddings, we will be able to compute the similarity or distance between them using a distance measure called cosine similarity. With many modern NLP solutions, including Flair, embeddings are used as the underlying input representation for higher-level NLP tasks such as named entity recognition.
One of the main problems with early word embedding approaches was that words with multiple meanings (polysemic words) were limited to a single and constant embedding representation. One of the solutions to this problem in Flair is the use of contextual string embeddings where words are contextualized by their surrounding text, meaning that they will have a different representation given a different surrounding text.
Named entity recognition
Named entity recognition (NER) is an NLP task or technique that identifies named entities in a text and tags them with their corresponding categories. Named entity categories include, but aren't limited to, places, person names, brands, time expressions, and monetary values.
The following figure illustrates NER using colored backgrounds and tags associated with the words:
In the previous example, we can see that three entities were identified and tagged. The first and third tags are particularly interesting because they both represent the same word, Berkeley, yet the first one clearly refers to an organization whereas the second one refers to a geographic location. The human brain is excellent at distinguishing between different entity types based on context and is able to do so almost seamlessly, whereas machines have struggled with it for decades. Recent advancements in contextual string embeddings, an essential part of Flair, made a huge leap forward in solving that.
Word-Sense Disambiguation (WSD) is an NLP technique concerned with identifying the intended sense of a given word with multiple meanings.
For example, take the given sentence:
George tried to return to Berlin to return his hat.
WSD would aim to identify the sense of the first use of the word return, referring to the act of giving something back, and the sense of the second return, referring to the act of going back to the same place.
Part-of-Speech (POS) tagging is a technique closely related to both WSD and NER that aims to tag the words as corresponding to a particular part of speech such as nouns, verbs, adjectives adverbs, and so on.
Actual POS taggers provide a lot more information with the tags than simply associating the words with noun/verb/adjective categories. For example, the Penn Treebank Project corpus, one of the most widely used NER corpora, distinguishes between 36 different types of POS tags.
Another NLP technique closely related to POS tagging is chunking. Unlike parts of speech (POS), where we identify individual POS, in chunking we identify complete short phrases such as noun phrases. In Figure 1.2, the phrase A lovely day can be considered a chunk as it is a noun phrase, and in its relationship to other words works the same way as a noun.
Stemming and lemmatization
Stemming and lemmatization are two closely related text normalization techniques used in NLP to reduce the words to their common base forms. For example, the word play is the base word of the words playing, played and plays.
The simpler of the two techniques, stemming, simply accomplishes this by cutting off the ends or beginnings of words. This simple solution often works, but is not foolproof. For example, the word ladies can never be transformed into the word lady by stemming only. We therefore need a technique that understands the POS category of a word and takes into account its context. This technique is called lemmatization. The process of lemmatization can be demonstrated using the following example.
Take the following sentence:
this meeting was exhausting
Lemmatization reduces the previous sentence to the following:
this meeting be exhaust
It reduces the word
be and the word
exhaust. Also note that the word
meeting is used as a noun and it is therefore mapped to the same word
meeting, whereas if the word
meeting was used as a verb, it would be reduced to
A popular and easy-to-use library for performing lemmatization with Python is spaCy. Its models are trained on large corpora and are able to distinguish between different POS, yielding impressive results.
Text classification is an NLP technique used to assign a text or a document to one or more classes or document types. Practical uses for text classification include spam filtering, language identification, sentiment analysis, and programming language identification from syntax.
Having covered the basic NLP concepts and terminology, we can now move on to understanding what Flair is and how it manages to solve NLP tasks with state-of-the-art results.
Flair is a powerful NLP framework published as a Python package. It provides a simple interface that is friendly, easy to use, and caters to people from various backgrounds including those with little prior knowledge in programming. It is published under the MIT License, which is one of the most permissive free software licenses.
Flair as an NLP framework comes with a variety of tools and uses. It can be defined in the following ways:
- It is an NLP framework used in NLP research for producing models that achieve state-of-the-art results across many NLP tasks such as POS tagging, NER, and chunking across several languages and datasets. In Flair's GitHub repository, you will find step-by-step instructions on how to reproduce these results.
- It is a tool for training, validating, and distributing NER, POS tagging, chunking, word sense disambiguation, and text classification models. It features tools that help ease the training and validation processes such as the automatic corpora downloading tool, and tools that facilitate model tuning such as the hyperparameter optimization tool. It supports a growing number of languages.
- It is a tool for downloading and using state-of-the-art pre-trained models. The models are downloaded seamlessly, meaning that they will be automatically downloaded the first time you use them and will remain stored for future use.
- It is a platform for the proposed state-of-the-art Flair embeddings. The state-of-the-art results Flair achieves in many NLP tasks can by and large be attributed to its proposed Flair contextual string embeddings described in more detail in the paper Contextual String Embeddings for Sequence Labeling. The author refers to them as "the secret sauce" of Flair.
- It is an NLP framework for working with biomedical data. A special section of Flair is dedicated solely to working with biomedical data and features a set of pretrained models that achieve state-of-the-art results, as well as a number of corpora and comprehensive documentation on how to train custom biomedical tagging models.
- It is a great practical introduction to NLP. Flair's extensive online documentation, simple interface, inclusive support for a large number of languages, and its ability to perform a lot of the tasks on non-GPU-equipped machines all make it an excellent entry point for someone aiming to learn about NLP through practical hands-on experimentation.
Setting up the development environment
Now that you have a basic understanding of features offered by the framework, as well as an understanding of the basic NLP concepts, you are now ready to move to the next step of setting up your development environment for Flair.
To be able to follow the instructions in this section, first make sure you have Python 3.6+ installed on your device as described in the Technical requirements section.
Creating the virtual environment
In Python, it's generally good practice to install packages in virtual environments so that the project dependencies you are currently working on will not affect your global Python dependencies or other projects you may work on in the future.
We will use the
venv tool that is part of the Python Standard Library and requires no installation. To create a virtual environment, simply create a new directory, move into it, then run the following command:
$ python3 -m venv learning-flair
Then, to activate the virtual environment on Linux or macOS, run the following:
$ source learning-flair/bin/activate
If you are running Windows, run the following:
Your command line should become prefixed with
(learning-flair) $ and your virtual environment is now active.
Installing a published version of Flair in a virtual environment
You should now be ready to install Flair version 0.11 with this single command:
(learning-flair) $ pip install flair==0.11
The installation should now commence and finish within a minute or so depending on the speed of your internet connection.
You can verify the installation by running the following command, which will display a list of package properties including its version:
(learning-flair) $ pip show flair Name: flair Version: 0.11 Summary: A very simple framework for state-of-the-art NLP Home-page: https://github.com/flairNLP/flair …
A command output like the preceding indicates the package has been successfully installed in your virtual environment.
Installing directly from the GitHub repository (optional)
In some cases, the features we aim to make use of in Flair may already be implemented in a branch on GitHub, but those changes may not yet be released as part of a Python package published on PyPI. We can install Flair with those features directly from the Git repository branch.
For example, here is how you can install Flair from the
(learning-flair) $ git clone https://github.com/flairNLP/flair.git (learning-flair) $ cd flair (learning-flair) $ git checkout master (learning-flair) $ pip install .
Installing code from non-reviewed branches can introduce unreliable or unsafe code. When installing Flair from development branches, make sure the code you are installing comes from a trusted source. Also note that the future versions of Flair (versions larger than 0.11) may not be compatible with the code snippets found in this book.
Replace the term
master with any other branch name to install the package from a branch of your choice.
Running code that uses Flair
Running code that makes use of the Flair Python package is no different from running any other type of Python code.
The recommended way for you to run the code snippets in this book is to execute them as code cells in a Jupyter notebook, which you can install and run as follows:
(learning-flair) $ pip install notebook (learning-flair) $ jupyter notebook
You can then create a new Python 3 notebook and run your first Flair script to verify the package is imported successfully:
import flair print(flair.__version__)
After executing, the preceding code should print out the version of Flair you are currently using, indicating that the Flair package has been imported successfully and you are ready to start.
In this chapter, you became familiar with the basic NLP terminology and tasks. As you learn about Flair, you will often come across terms such as tokenization, NER, and POS, and the knowledge gained in this chapter will help you understand what they mean.
You also now understand where Flair sits in the NLP space, what problems it's solving and which fields it excels in. Finally, you've learned how to install Flair inside your virtual environment either from a PyPI package or a Git branch. You are now ready to start coding with Flair!
In the upcoming chapter, we will be covering basic syntax and the basic objects in Flair, known as base types.