Training Spacy Models on custom data using NER

Introduction

In this article we will learn how to use Transfer Learning to train our existing Spacy model which is failing to identify certain entities to start recognizing those words while we develop our NER model.

If you want to go deeper into how to train Spacy Model on custom data using NET then Coding exercises with live examples can be accessed at Code Demo.

Prepare Training Data

First import following packages:

from tqdm import tqdm
import spacy
from spacy.tokens import DocBin

Lets load structure for our basic Spacy Model for custom training

nlp = spacy.blank("en") # For new model creation

nlp = spacy.load("en_core_web_lg") # To train the existing model with new parameters

Now lets define DocBin() object which store the related value and length of each word defined by our training dataset.

db = DocBin() 

Now lets prepare our training data in below format:

training_data = [
    ("John Water ",{'entities': [(0, 4, "PERSON")]}),
    ("Taj Mahal ", {'entities':[(0, 3, "NAME")]}),
]

Next we will write code to read above training data and store same in DocBin object format with their labels defined

for text, annot in tqdm(training_data): 
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

In the last section for preparation of data we will save the docbin object

db.to_disk("./train.spacy")

Preparing Config file for training of Spacy model on custom data

copy default_config.cfg from website https://spacy.io/usage/training#config 
And save it as base_config.cfg 

Now open this file and edit below parameters,

[nlp]
lang = en
pipeline = ["ner"]

[components]
components = ner
[components.ner]
source = "en_core_web_sm"

Now go to cmd or terminal and execute following command,

python3 -m spacy init fill-config base_config.cfg config.cfg

This will generate config.cfg file in the same folder

Training the Spacy Model

In order to train existing Spacy model with new parameters, we will execute below command from cmd or terminal

python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy

This will train the model and following would be displayed on cmd


/usr/lib/python3/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.26.4) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
ℹ Using CPU

=========================== Initializing pipeline ===========================
Set up nlp object from config
Pipeline: ['ner']
Resuming training for: ['ner']
Created vocabulary
Finished initializing nlp object
Initialized pipeline components: []
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['ner']
ℹ Initial learn rate: 0.001
E    #       LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  --------  ------  ------  ------  ------
  0       0      4.57    0.00    0.00    0.00    0.00
200     200     29.14  100.00  100.00  100.00    1.00
.....
19800   19800      0.00  100.00  100.00  100.00    1.00
20000   20000      0.00  100.00  100.00  100.00    1.00
✔ Saved pipeline to output directory
output/model-last

The updated trained model will be saved in output/model-last folder.


Re-validating our input with trained model

Lets execute below code to validate if our trained model is able to give us new values.

import spacy
nlp_new = spacy.load(R"./output/model-best") 
tokens = nlp_new("Did you see John Water nearby ?") 
print([(X, X.ent_iob_, X.ent_type_) for X in tokens])

Output:
 
[(Did, 'O', ''), (you, 'O', ''), (see, 'O', ''), (John, 'B', 'PERSON'), (Water, 'O', ''), (nearby, 'O', ''),  (?, 'O', '')]

Training Spacy Models on custom data using NER

Technical Articles

Gemma: Google’s Open-Source Powerhouse for Responsible AI

Top 10 Generative AI Tools and Platforms Reshaping the Future

Decoding the Future: Gen AI’s Evolution in 2024 – Trends, Strategies, and Business Impact

Useful Links

Categories