Skcript Technologies Private Limited

← All Articles

How to easily extract Text from anything using spaCy

— Written by

How to easily extract Text from anything using spaCy

Here’s a new framework that our AI Developer just unearthed - with this framework you can now extract text in a jiffy and also do a load of other cool stuff. Read on and find out how!

Hey guys, I’d like to tell you there is this super amazing NLP framework called spaCy. Most of us always go for NLTK when it comes to any NLP application because of its simple documentation and most of us are first exposed to it when we started our NLP journey.

Luckily, I stumbled upon this framework called spaCy. And I started using it because it is faster than NLTK - and I’ll also give you fair warning that I am not here to compare spaCy with NLTK!

So, let’s try understanding spaCy’s working a bit more before going deep into it.

Follow this and install spaCy. Make sure you are doing it in a virutalenv. It’s always good practice to use a virtual environment.

The following are the core features that spaCy provides.

NAMEDESCRIPTION
TokenizationSegmenting text into words, punctuations marks etc.
Part-of-speech(POS) TaggingAssigning word types to tokens, like verb or noun.
Dependency ParsingAssigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
LemmatizationAssigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
Sentence Boundary Detection(SBD)Finding and segmenting individual sentences.
Named Entity Recognition(NER)Labelling named “real-world” objects, like persons, companies or locations.
SimilarityComparing words, text spans and documents and how similar they are to each other.
Text ClassificationAssigning categories or labels to a whole document, or parts of a document.
Rule-based MatchingFinding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
TrainingUpdating and improving a statistical model’s predictions.
SerializationSaving objects to files or byte strings.

Here, we are going to see about Rule-based Matching which is going to help us in text/entity extraction.

One simple example to get started with,

import spacy 
from spacy.matcher import Matcher 
from spacy.attrs import *
 
# This is the part where to loads the vocabulary
nlp = spacy.load('en') 
# Creating a matcher object
matcher = Matcher(nlp.vocab) 
sentence = u"Completed my Engineering in 1876"
doc = nlp(sentence) 

patterns = {
            "year": [{'IS_DIGIT': True }],
            "is_engineering": [{"LOWER": "engineering"}]
          }

for label, pattern in patterns.iteritems():
  matcher.add(label, None, pattern)

matches = matcher(doc) 

for match in matches:
  # match object returns a tuple with (id, startpos, endpos)
  print doc[match[1]:match[2]]

What else can you really do with this Matching? That was my first question too when I was trying to understand what spaCy could do!

The one thing I admire about spaCy is, the documentation and the code. Both are beautifully written. And any noob can understand it just by reading. No complication adapters or exceptions.

P.S: For beginners, there was a big leap taken from spaCy 1.x to spaCy 2 and you might need to get hold of new functions and new changes in function names. But it’s worth investing time in.

There are few attrs that help in easier extraction of text from the sentence. This helps us in achieving custom patterns which are very stable.

This is the attrs file. You can see that they are very simple and helpful attrs like LIKE_URL, LIKE_EMAIL etc., and the best part is you can define your own flags and attrs in special cases.

There is an on_match (callback function) in the matcher.add() function. The second parameter takes the matched triple object and uses send as the parameter to the on_match callback function().

A sample of the working:

def on_match(*args):
  print("Matched")
  # the remaining workflow.
  
matcher.add("Checking", on_mathc, [{"LOWER": "checking"}])

I hope you are able to understand the basic operations that can be done using spaCy. spaCy 2 is the bleeding edge version and it’s getting loaded with lots and lots of features that every NLP enthusiast has ever dreamt of - and there are even other libraries like textacy which have been built on the top of spaCy.

Okay guys, until we meet next time, I wish you have some good time with spaCy’s magic!

Up next

Land the perfect landing page.
Skcript /svr/how-to-easily-extract-text-from-anything-using-spacy/ /svrmedia/heroes/spacy-text.png
Skcript Technologies Private Limited

Book a free consultation

Book a time with our consultants to discuss your project and get a free quote. No strings attached.