The Network and Trajectories of Transitions

among Sentential Co-Occurrences of Characters of

Arthur Conan Doyle's A Study in Scarlet

By Moses Boudourides & Sergios Lenis

IMPORTANT: To use this notebook, you'll need to

  1. Install IPython Notebook (easiest way: use Anaconda)
  2. Download this notebook and all other Python scripts used here from https://github.com/mboudour/WordNets/blob/master/ArthurConanDoyle_AStudyInScarlet_Network&Trajectories.ipynb
  3. Run ipython notebook in the same directory where notebook and scripts were put

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Importing Python modules

In [1]:
import random
import nltk
import codecs
from textblob import TextBlob
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import os
import imp

# utilsdir='/home/sergios-len/Dropbox/Python Projects (1)/utils/tools.py'
utilsdir='/home/mab/Dropbox/Python Projects/utils/'#tools.py'

%matplotlib inline 
%load_ext autoreload
/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

I. Importing the Text of Arthur Conan Doyle's A Study in Scarlet

In [2]:
filename = 'Texts/AStudyInScarlet.txt'
titlename = "Arthur Conan Doyle's A Study in Scarlet"

f = codecs.open(filename, "r", encoding="utf-8").read()

num_lines = 0
num_words = 0
num_chars = 0
for line in f:
    words = line.split()
    num_lines += 1
    num_words += len(words)
    num_chars += len(line)
print "%s has number of words = %i and number of characters = %i" %(titlename,num_words,num_chars)

blob = TextBlob(f)
Arthur Conan Doyle's A Study in Scarlet has number of words = 195313 and number of characters = 260087

II. Counting Sentential Co-Occurrences of Characters of Arthur Conan Doyle's A Study in Scarlet and Measuring Sentential Sentiment Polarities and Subjectivities

In [3]:
dici={'Sherlock Holmes':'Sherlock Holmes', 'Mr. Sherlock Holmes':'Sherlock Holmes', 'Sherlock':'Sherlock Holmes', 
      'Holmes':'Sherlock Holmes',
      'Dr. Watson':'Dr. Watson', 'Watson':'Dr. Watson', 
      'Lestrade':'Lestrade',
      'Lucy Ferrier':'Lucy Ferrier', 'Lucy':'Lucy Ferrier',  
      'John Ferrier':'John Ferrier',
      'John Rance':'John Rance', 'Rance':'John Rance', 
      'Arthur Charpentier':'Arthur Charpentier', 'Lieutenant Charpentier':'Arthur Charpentier', 
      'Mrs. Charpentier':'Mrs. Charpentier', 'Madame Charpentier':'Mrs. Charpentier',
      'Enoch Drebber':'Enoch Drebber', 'Enoch':'Enoch Drebber', 'Drebber': 'Enoch Drebber',  
      'Jefferson Hope':'Jefferson Hope', 'Jefferson':'Jefferson Hope', 'Hope':'Jefferson Hope',
      'Brigham Young':'Brigham Young', 'Brigham':'Brigham Young', 'Young': 'Brigham Young',  
      'Joseph Stangerson':'Joseph Stangerson', 'Joseph':'Joseph Stangerson', 'Stangerson': 'Joseph Stangerson', 
      'Tobias Gregson':'Tobias Gregson', 'Gregson':'Tobias Gregson',
      'Stamford':'Stamford'
     }

ndici={i.lower():k for i,k in dici.items()}
dnici=[(i.split()[0],i.split()[1]) for i in ndici.keys() if len(i.split())>1]

selectedTerms=ndici.keys()
In [4]:
%autoreload 2

tool= imp.load_source('tools', utilsdir+'tools.py')

create_pandas_dataframe_from_text=tool.create_pandas_dataframe_from_text
create_coo_graph=tool.create_coo_graph


dfst,sec_prot,coccurlist,occurlist,dflines=create_pandas_dataframe_from_text(blob,selectedTerms,ndici,titlename)
co_graph=create_coo_graph(coccurlist)

dfst.rename(columns={"Arthur Conan Doyle's A Study in Scarlet selected terms":"Arthur Conan Doyle's A Study in Scarlet Characters"},inplace=True)
dfst.sort_values(by='Frequencies').sort(["Frequencies"], ascending=[0])
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:13: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
Out[4]:
Arthur Conan Doyle's A Study in Scarlet Characters Frequencies
12 Sherlock Holmes 98
8 Enoch Drebber 47
5 Lestrade 43
3 Jefferson Hope 42
13 Tobias Gregson 39
4 Joseph Stangerson 36
2 John Ferrier 27
0 Lucy Ferrier 18
7 Stamford 11
1 John Rance 10
10 Dr. Watson 6
11 Mrs. Charpentier 6
6 Arthur Charpentier 2
9 Brigham Young 2
In [5]:
prot_pol_sub=dflines[['protagonists','#_of_protagonists','polarity','subjectivity']].reset_index()
prot_pol_sub['sentence_id']=prot_pol_sub.index
prot_pol_sub=prot_pol_sub[['sentence_id','protagonists','#_of_protagonists','polarity','subjectivity']]

cuts = 1
prot_pol_sub = prot_pol_sub[prot_pol_sub['#_of_protagonists']>cuts]
lp = prot_pol_sub['protagonists'].tolist()
lpn = []
for i in lp:
    for j in i:
        lpn.append(j)
# len(set(lpn))
print "The total number of sentences in %s with at least %i characters in each one of them is %i." %(titlename,cuts+1,len(prot_pol_sub))
prot_pol_sub.rename(columns={'protagonists':'Lists_of_Characters','#_of_protagonists':'#_of_Characters','polarity':'Polarity','subjectivity':'Subjectivity'},inplace=True)
prot_pol_sub.sort(["#_of_Characters"], ascending=[0]) 
ddff = prot_pol_sub.drop('sentence_id', 1)
ddff.index.name = 'Sentence_ID'
ddff
The total number of sentences in Arthur Conan Doyle's A Study in Scarlet with at least 2 characters in each one of them is 42.
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:15: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
Out[5]:
Lists_of_Characters #_of_Characters Polarity Subjectivity
Sentence_ID
0 [John Rance, Sherlock Holmes] 2 0.328571 0.602381
97 [Dr. Watson, Stamford, Sherlock Holmes] 3 0.000000 0.000000
468 [Lestrade, Tobias Gregson] 2 -0.700000 0.666667
479 [Lestrade, Tobias Gregson] 2 0.350000 0.644444
624 [Lestrade, Sherlock Holmes] 2 0.000000 0.000000
666 [Lestrade, Tobias Gregson] 2 -0.075000 0.350000
688 [Lestrade, Tobias Gregson] 2 0.087500 0.237500
729 [Lestrade, Tobias Gregson] 2 0.250000 0.500000
794 [John Rance, Sherlock Holmes] 2 -0.800000 1.000000
1074 [Lestrade, Tobias Gregson] 2 0.500000 0.888889
1118 [Arthur Charpentier, Tobias Gregson] 2 0.000000 0.000000
1311 [Sherlock Holmes, Tobias Gregson] 2 0.000000 0.000000
1314 [Joseph Stangerson, Enoch Drebber] 2 0.400000 0.800000
1366 [Joseph Stangerson, Enoch Drebber] 2 -0.250000 0.250000
1382 [Joseph Stangerson, Enoch Drebber] 2 0.276190 0.560952
1414 [Joseph Stangerson, Enoch Drebber] 2 0.100000 0.200000
1444 [Lestrade, Joseph Stangerson] 2 -0.500000 0.900000
1447 [Lestrade, Tobias Gregson] 2 0.285714 0.535714
1465 [Lestrade, Tobias Gregson] 2 0.300000 1.000000
1484 [Joseph Stangerson, Enoch Drebber] 2 0.000000 0.000000
1489 [Lestrade, Sherlock Holmes, Tobias Gregson] 3 0.500000 0.500000
1745 [Joseph Stangerson, Enoch Drebber] 2 0.150216 0.427706
1825 [Lucy Ferrier, Jefferson Hope] 2 0.375000 0.562500
1917 [Joseph Stangerson, Enoch Drebber] 2 0.800000 0.900000
1965 [Jefferson Hope, John Ferrier] 2 0.000000 0.000000
1974 [Joseph Stangerson, Enoch Drebber] 2 0.150000 0.325000
2090 [Lucy Ferrier, Jefferson Hope] 2 -0.166667 0.066667
2115 [Lucy Ferrier, Jefferson Hope] 2 0.016667 0.333333
2175 [Lucy Ferrier, Jefferson Hope] 2 0.000000 0.000000
2273 [Joseph Stangerson, Enoch Drebber] 2 0.100000 0.400000
2274 [Joseph Stangerson, Enoch Drebber] 2 1.000000 0.300000
2311 [Joseph Stangerson, Enoch Drebber] 2 0.000000 0.000000
2312 [Joseph Stangerson, Enoch Drebber] 2 0.104762 0.676190
2336 [Lestrade, Sherlock Holmes, Tobias Gregson] 3 0.066667 0.533333
2420 [Joseph Stangerson, Enoch Drebber] 2 -0.333333 0.583333
2424 [Joseph Stangerson, Enoch Drebber] 2 0.000000 0.000000
2428 [Joseph Stangerson, Enoch Drebber] 2 0.500000 1.000000
2485 [Lucy Ferrier, John Ferrier] 2 0.078571 0.402381
2545 [Joseph Stangerson, John Ferrier] 2 0.200000 0.200000
2580 [Lestrade, Sherlock Holmes, Tobias Gregson] 3 0.050000 0.200000
2655 [Jefferson Hope, Enoch Drebber] 2 0.150000 0.231250
2686 [Lestrade, Tobias Gregson] 2 -0.046429 0.616964

Basic Univarate Statistics of Selected Sentences in Arthur Conan Doyle's A Study in Scarlet

In [6]:
prot_pol_sub[['#_of_Characters','Polarity','Subjectivity']].describe()
Out[6]:
#_of_Characters Polarity Subjectivity
count 42.000000 42.000000 42.000000
mean 2.095238 0.101153 0.414172
std 0.297102 0.333149 0.319287
min 2.000000 -0.800000 0.000000
25% 2.000000 0.000000 0.200000
50% 2.000000 0.083036 0.401190
75% 2.000000 0.283333 0.613318
max 3.000000 1.000000 1.000000

The Histogram of the Number of Characters in the Selected Sentences in Arthur Conan Doyle's A Study in Scarlet

In [7]:
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
from mpl_toolkits.axes_grid1.inset_locator import mark_inset

ndfl=dflines[dflines['#_of_protagonists']>0  ]

fig, ax = plt.subplots(figsize=[12, 10])
axes2 = zoomed_inset_axes(ax, 16, loc=7)  # zoom = 6

dflines['#_of_protagonists'].plot.hist(ax=ax)

ax.set_xlabel('#_of_Characters')
ax.set_ylabel('Frequency')
ax.set_title('Histogram of # of characters')

x1, x2, y1, y2 = 2.95, 3., 0, 30
axes2.set_xlim(x1, x2)
axes2.set_ylim(y1, y2)
ndfl['#_of_protagonists'].plot.hist(ax=axes2)
axes2.set_ylabel('Frequency')

mark_inset(ax, axes2, loc1=2, loc2=4, fc="none", ec="0.5")
axes3 = zoomed_inset_axes(ax, 10, loc=10)

x1, x2, y1, y2 = 2, 2.1, 0, 60
axes3.set_xlim(x1, x2)
axes3.set_ylim(y1, y2)
ndfl['#_of_protagonists'].plot.hist(ax=axes3)
axes3.set_ylabel('Frequency')

mark_inset(ax, axes3, loc1=2, loc2=4, fc="none", ec="0.5")
plt.show()

III. The Two-Mode Network of Characters and Sentences in Arthur Conan Doyle's A Study in Scarlet

In [8]:
%autoreload 2

from tools import draw_network_node_color

sstt="%s Two-Mode Network of Sentences and Characters" %titlename
pos=nx.spring_layout(sec_prot)
nds=[nd for nd in sec_prot.nodes() if isinstance(nd,int)]
prot=[nd for nd in sec_prot.nodes() if nd not in nds]

for en,nd in enumerate(nds):
    if en<len(nds)/2.:
        pos[nd][0]=-1
        pos[nd][1]=en*2./len(nds)
    else:
        pos[nd][0]=1
        pos[nd][1]=(en-len(nds)/2.)*2./len(nds)
for en ,nd in enumerate(prot):
    pos[nd][0]=0
    pos[nd][1]=en*1./len(prot)

possit=draw_network_node_color(sec_prot,sstt,pos=pos,with_edgewidth=False,withLabels=True,labfs=12,valpha=0.2,
                               ealpha=0.4,labelfont=15,with_node_weight=False,node_size_fixer=300.,node_col='polarity')
In [9]:
possit=draw_network_node_color(sec_prot,sstt,pos=pos,with_edgewidth=False,withLabels=True,labfs=12,valpha=0.2,
                               ealpha=0.4,labelfont=15,with_node_weight=False,node_size_fixer=300.,
                               node_col='subjectivity',colormat='Greens')

IV. The Network of Sententially Co-Occurring Characters in Arthur Conan Doyle's A Study in Scarlet

In [10]:
%autoreload 2

from tools import draw_network, make_graph_from_lists

plist = prot_pol_sub['Lists_of_Characters'].tolist()
pplist=prot_pol_sub['Polarity'].tolist()
nplist=prot_pol_sub['#_of_Characters'].tolist()
splist=prot_pol_sub['Subjectivity'].tolist()

G = make_graph_from_lists(plist,pplist,nplist,splist)
posg=nx.spring_layout(G,scale=50,k=0.55,iterations=20)
# posg=nx.spring_layout(G,scale=50)#,k=0.55)#,iterations=20)

sstt="%s Network of Selected Characters \n(Sentences colored in polarity)" %titlename
possit=draw_network(G,sstt,pos=posg,with_edgewidth=True,withLabels=True,labfs=15,valpha=0.2,ealpha=0.7,labelfont=15,
                   with_edgecolor=True,edgecolor='polarity',colormat='Blues')
In [11]:
sstt="%s Network of Selected Characters \n(Sentences Colored in Subjectivity)" %titlename
possit=draw_network(G,sstt,pos=posg,with_edgewidth=True,withLabels=True,labfs=15,valpha=0.2,ealpha=0.7,labelfont=15,
                   with_edgecolor=True,edgecolor='subjectivity',colormat='Greys')

V. Centralities of Nodes in the Network of Sententially Co-Occurring Characters in Arthur Conan Doyle's A Study in Scarlet

In [12]:
from tools import draw_centralities_subplots

centrali=draw_centralities_subplots(G,pos=posg,withLabels=False,labfs=5,figsi=(15,22),ealpha=1,vals=True)