Exhibit 25.31 demonstrates the Python implementation of Named Entity Recognition (NER) using Stanford NER to identify and classify entities into three categories — organization, person, location. The entites are sourced from tweets associated with the English Premier League.
The process involves the following steps:
tweets-statuses.json
.
Note the ambiguity arrising from words such as Manchester and Madrid which fall under multiple categories — location (Manchester, Madrid) and organization (Manchester United, Real Madrid). Manual cleaning is required in such cases to ensure that entities are correctly tagged.
import os
java_path = "C:/Program Files/Java/jdk-16.0.1/bin/java.exe"
os.environ['JAVAHOME'] = java_path
import urllib.request
import zipfile
from nltk.tag.stanford import StanfordNERTagger
# Set the direct path to the NER Tagger.
# Use english.all.3class.distsim (three class classifier) to find three classes of named entities.
_model_filename = r'C:\Delphi\Web\Marketing-Analytics\py\data\stanford-ner-2015-04-20\classifiers\english.all.3class.distsim.crf.ser.gz'
_path_to_jar = r'C:\Delphi\Web\Marketing-Analytics\py\data\stanford-ner-2015-04-20\stanford-ner.jar'
# Initialize the NLTK's Stanford NER Tagger API with the DIRECT PATH to the model and .jar file.
st = StanfordNERTagger(model_filename=_model_filename, path_to_jar=_path_to_jar)
entities = []
# read the tweets
import pandas as pd
df = pd.read_json('data/tweets-statuses.json')
df[:5]
'''
Iterate through tweets list and
(1) tag the named entities, and
(2) extract and store only entities related to three classes – organization, person, location.
'''
for tweet in df['text']:
# split the tweet by whitespace, into words, and tag them (see o/p lst_tags)
lst_tags = st.tag(tweet.split())
for tup in lst_tags: # for each (tag, word) tuple in lst_tags
if(tup[1] != 'O'): # exclude 'O' (Outside - do not belong to named entity class)
entities.append(tup)
lst_tags # list of tags for the last tweet
[('Liverpool', 'ORGANIZATION'), ('are', 'O'), ('on', 'O'), ('the', 'O'), ('verge', 'O'), ('of', 'O'), ('selling', 'O'), ('Rhian', 'PERSON'), ('Brewster,', 'O'), ('a', 'O'), ('player', 'O'), ('who', 'O'), ('has', 'O'), ('yet', 'O'), ('to', 'O'), ('play', 'O'), ('a', 'O'), ('Premier', 'O'), ('League', 'O'), ('game,', 'O'), ('for', 'O'), ('£23.5…', 'O'), ('https:t.co6yuDSdxbjW', 'O')]
# Print the first 20 entities
entities[:20]
[('Premier', 'ORGANIZATION'), ('League', 'ORGANIZATION'), ('La', 'ORGANIZATION'), ('Liga', 'ORGANIZATION'), ('Brewster', 'PERSON'), ('Swansea', 'LOCATION'), ('Forbes', 'PERSON'), ('Premier', 'ORGANIZATION'), ('League', 'ORGANIZATION'), ('Chennai', 'ORGANIZATION'), ('Super', 'ORGANIZATION'), ('Kings', 'ORGANIZATION'), ('Premier', 'ORGANIZATION'), ('League', 'ORGANIZATION'), ('Europa', 'ORGANIZATION'), ('League', 'ORGANIZATION'), ('Burnley', 'ORGANIZATION'), ('Southampton', 'ORGANIZATION'), ('Everton', 'ORGANIZATION'), ('@MrAncelotti', 'ORGANIZATION')]
# Load entity tuples to dataframe df_entities and name the columns “word” and “ner”
df_entities = pd.DataFrame(entities)
df_entities.columns = ["word","ner"]
df_entities
from collections import Counter
'''
Counter is a class from the collections module that counts the frequency of elements
in a collection, like a list. It returns a dictionary-like object where keys are the
elements, and values are their counts.
'''
# Filter df_entities to extract rows with NER = Organisations.
organizations = df_entities[df_entities['ner'].str.contains("ORGANIZATION")]
# top 10 Organisations: Get the top 10 most mentioned organizations.
cnt = Counter(organizations['word'])
cnt.most_common(10)
[('League', 18), ('Premier', 15), ('Burnley', 6), ('Southampton', 5), ('Liverpool', 5), ('Tottenham', 4), ('United', 3), ('La', 2), ('Europa', 2), ('Sheffield', 2)]
# Filter df_entities to extract rows with NER = PERSON.
people = df_entities[df_entities['ner'].str.contains("PERSON")]
# top 10 Persons: Get the top 10 most mentioned persons.
cnt = Counter(people['word'])
cnt.most_common(10)
[('Brewster', 5), ('Harry', 4), ('Kane', 3), ('Saliba', 2), ('Thiago', 2), ('Rhian', 2), ('Jose', 2), ('Mourinho', 2), ('Carlos', 2), ('Vinicius', 2)]
# Filter df_entities to extract rows with NER = LOCATION.
locations = df_entities[df_entities['ner'].str.contains("LOCATION")]
# top 5 Locations: Get the top 10 most mentioned locations.
cnt = Counter(locations['word'])
cnt.most_common(5)
[('Swansea', 1), ('Accra', 1), ('Liverpool', 1), ('West', 1), ('Ham', 1)]
# Extract the tweets containing the organization ‘Liverpool’.
liverpool_tweets = df[df['text'].str.contains('Liverpool')]
print(liverpool_tweets['text'])
16 Liverpool have had over 60 million combined fo... 18 Liverpool won the champions league and premier... 21 The extravagance of England's Premier League: ... 44 Really wanted this kid to make it at Liverpool... 69 This is a great deal. He’s never kicked a ball... 92 Liverpool career: Premier League appearances: ... 99 Liverpool are on the verge of selling Rhian Br... Name: text, dtype: object
Use the Search Bar to find content on MarketingMind.
Contact | Privacy Statement | Disclaimer: Opinions and views expressed on www.ashokcharan.com are the author’s personal views, and do not represent the official views of the National University of Singapore (NUS) or the NUS Business School | © Copyright 2013-2025 www.ashokcharan.com. All Rights Reserved.