Python Character Analysis

Introduction

I have separated my work into sections so ease of flow. All Python code is included in this article. Observations of the data are shown in the histogram and heatmap below.

The header of my python file gives general information:

Title: Python Character Analysis
Author: Eric Pena
Date: Oct. 2019

Text Source:
Academic Sample
http://www.thegrammarlab.com/?nor-portfolio=1000000-word-sample-corpora

Packages

Below are important packages that I am importing for the program to work properly.

import pandas as pd
import fileinput as fi
import matplotlib.pyplot as plt
import seaborn as sns
import string

User Defined Functions

I have defined several functions used by the \verb|main()| function:

def read(file):
	"""Reads given file and parses characters

	Args:
		file: the text file to be parsed
	Returns:
		charArr: parsed character array
	"""
	return [i for line in fi.input(file) for i in line]

# -------------------------------------------------------------------

def count(array):
	"""Counts characters and creates freq table

	Args:
		array: character array of text
	Returns:
		freq: dictionary that represents freq table
	"""
	return {c: array.count(c) for c in array}


# -------------------------------------------------------------------


def partition2(array):
	"""Works similar to Mathematica's partition function
		but slightly differently. This function will create
		a string that combines each pair of characters in
		order to be hashed through by the count function.

	Args:
		array: this array
	Returns:
		
	"""
	return [str(array[i]) + str(array[i + 1]) for i in range(len(array) - 1)]
# -------------------------------------------------------------------


def dict_print(d):
	"""Print function specifically for dictionary

	Args:
		d: dictionary
	Returns:
		None: only prints out the contents of the dictionary
	"""
	[print(key[0] + ' --- ' + key[1] + ' :\t' + str(val)) for key, val in d.items()]

# -------------------------------------------------------------------


def to_dataframe(d):
	"""converts the dictionary of transitions to a dataframe from which
		can be turned into a heatmap

	Args:
		d: dictionary
	Returns:
		df: dataframe 
	"""
	# :: Create dataframe
	df = pd.DataFrame(columns=('First', 'Second', 'Frequency'))

	# :: Initialize matrix
	alpha = list(string.ascii_letters)[:26]
	alpha.append(' ')
	for i in alpha:
		for j in alpha:
			df = df.append(pd.Series([i, j, 0], index=df.columns), ignore_index=True)

	# :: Pivot our dataframe to make a matrix for heatmap
	df = df.pivot("First", "Second", "Frequency")

	# :: Add relevant frequencies to the matrix
	for k in d:
		df[k[1]][k[0]] = d[k]

	df = df[df.columns].astype(int)

	return df
# -------------------------------------------------------------------


def show_heatmap(df, filename):
	"""Create and plot heatmap of data

	Args:
		df: dataframe of frequencies
	Returns:
		None: Instead will plot a heatmap of the data
	"""
	# :: Creae heatmap and customize
	sns.set()
	ax = sns.heatmap(df, cmap="binary", robust=True, xticklabels=True, yticklabels=True)
	ax.xaxis.set_label_position('top')
	ax.xaxis.set_ticks_position('top')
	ax.spines['top'].set_visible(False)
	ax.tick_params(top=False, left=False)
	ax.xaxis.label.set_color('darkgray')
	ax.yaxis.label.set_color('darkgray')
	ax.tick_params(axis='x', colors='darkgray')
	ax.tick_params(axis='y', colors='darkgray')
	plt.xlabel('Second Letter', fontsize=18)
	plt.ylabel('First Letter', fontsize=18)
	plt.show()

	figure = ax.get_figure()
	figure.savefig(filename, dpi=400)

Main Program

This shows the code for the main program which utilizes the functions above.

def main():
	# ---------------------------MAIN PROGRAM---------------------------
	# :: Reads in text file
	# :: Counts the frequencies
	# :: Data stored in dictionary
	# :: Plots histogram of results
	
	freq_dict = count(read('text.txt'))
	plt.bar(freq_dict.keys(), freq_dict.values(), color='gray')
	plt.title('Character Histogram')
	plt.xlabel('Characters')
	plt.ylabel('Frequency')
	plt.show()

	# :: Reads in text file
	# :: Partitions in 2-tuples for transitions
	# :: Data stored in dictionary
	# :: Frequencies are printed to console/terminal
	
	dict_print(count(partition2(read('text.txt'))))

	df = to_dataframe(count(partition2(read('text.txt'))))
	print(df)
	
	filename = '/Users/ericpena/iCloud/Binghamton_Courses/500_Computational_Tools/HW2/heatmap.png'
	show_heatmap(df, filename)

if __name__ == '__main__':
	main()

Plot of Histogram

Figure 1 — Histogram that shows frequencies of characters appearing in the text

Histogram Observations

Here are a few observations about the histogram above:

  • $space\ character$: The space character is by far the most frequent. This makes sense since after each word, a space appears
  • ${j, z, x, k}$: Characters such as $j$, $z$, $x$, and $k$ are low frequency — not often present in common words
  • $vowels$: It makes sense for the frequency of the vowels to be higher than consonants given how English is structured

Heatmap of Character Transitions

The heat map below visually represents the frequencies of the transitions $c_i \rightarrow c_{i+1}$ where $c_i$ is the $i^{th}$ character in the supplied text file.

Figure 2 — Heatmap that shows the frequencies of character transitions

Heatmap Observations

Here are a few observations about the heatmap above:

  • $Common\ Occurences$: Some common occurrences: $t \rightarrow h$, $i \rightarrow n$, $n \rightarrow t$, $r \rightarrow e$, $t \rightarrow i$
  • $Spaces$: As expected the row and column of the $space$ is quite active — this makes sense since all words start and end with a $space$
  • $Bare$: It’s interesting but not unexpected that the right bottom right is quite bare — very low frequencies later in the alphabet

Robustness Parameter

The heatmap above is actually using a robust=True parameter that normalizes the frequencies into a small range in order to improve the visualization. This is an improvement over the heatmap with the original frequencies. See below for the difference between the $RAW$ heatmap and the $ROBUST$ heatmap. More visual information can be obtained by using the $robust$ parameter since the `interesting’ events are much more pronounced.

Figure 3 — Shows the difference between the Raw and Robust frequencies for the heatmap

Appendix — Output Data

Histogram Frenquencies

{’d’: 234, ‘i’: 574, ‘f’: 233, ’e’: 958, ‘r’: 428, ’n’: 492, ‘c’: 255, ’ ‘: 1370, ‘w’: 111, ‘h’: 344, ’m’: 184, ’s’: 455, ’t’: 653, ‘o’: 475, ‘u’: 206, ‘a’: 561, ‘p’: 146, ’l’: 336, ‘y’: 77, ‘x’: 24, ‘b’: 111, ‘k’: 15, ‘g’: 103, ‘v’: 60, ‘q’: 20, ‘j’: 9, ‘z’: 11}

Heatmap Frenquencies

d — i : 30 i — f : 11 f — f : 15 f — e : 10 e — r : 114 r — e : 113 e — n : 105 n — c : 22 c — e : 49 e — : 339 — w : 79 w — h : 23 h — e : 210 — m : 53 m — c : 1 c — : 8 — i : 72 i — s : 61 s — : 199 — t : 252 t — h : 212 m — o : 13 o — i : 5 s — t : 41 t — u : 11 u — r : 46 — c : 90 c — o : 54 o — n : 104 n — t : 88 t — e : 94 t — : 107 m — a : 18 a — : 40 a — s : 55 s — s : 18 — o : 93 o — f : 70 f — : 73 — s : 69 s — a : 14 a — m : 23 m — p : 25 p — l : 14 l — e : 47 — a : 168 a — f : 6 f — t : 4 r — : 64 — h : 40 h — u : 12 u — m : 9 m — i : 37 i — d : 28 i — t : 46 t — y : 11 y — : 60 — e : 40 e — x : 11 x — p : 3 p — o : 41 o — s : 28 s — u : 28 a — n : 92 n — d : 65 d — : 140 m — d : 1 — d : 39 d — r : 2 r — y : 12 — r : 37 e — s : 71 u — l : 21 l — t : 4 t — s : 17 s — c : 3 c — u : 3 u — s : 31 s — i : 53 i — o : 60 n — : 131 c — h : 34 e — m : 26 i — c : 47 c — a : 45 a — l : 81 l — : 50 o — m : 24 t — i : 92 — f : 103 f — i : 59 i — b : 38 b — e : 59 r — s : 37 w — e : 33 e — l : 40 l — l : 49 — k : 1 k — n : 1 n — o : 21 o — w : 12 w — n : 3 h — a : 34 a — t : 55 — l : 37 l — i : 42 i — g : 23 g — n : 4 o — c : 10 l — u : 30 l — o : 18 i — n : 128 n — v : 2 v — e : 34 g — a : 8 e — d : 68 o — u : 27 u — n : 17 n — e : 37 — q : 2 q — u : 20 u — a : 8 i — e : 24 d — o : 5 o — e : 2 — n : 19 o — t : 30 a — d : 10 d — d : 1 — u : 12 u — p : 9 p — : 6 t — o : 47 o — : 48 i — m : 21 l — y : 27 — b : 46 e — c : 25 a — u : 9 s — e : 44 n — l : 4 a — j : 1 j — o : 1 o — r : 66 a — r : 64 e — p : 6 r — t : 15 d — e : 24 e — t : 32 r — m : 17 — p : 66 p — e : 20 c — t : 35 p — r : 27 r — o : 42 e — i : 11 n — s : 24 x — t : 4 t — r : 17 r — a : 32 a — c : 43 t — a : 24 a — b : 18 b — l : 12 r — g : 6 n — i : 28 t — t : 15 u — c : 7 h — : 31 w — a : 19 a — x : 12 x — e : 2 f — a : 20 l — c : 1 o — h : 1 h — o : 18 o — l : 26 l — s : 11 c — i : 8 d — s : 16 i — l : 28 l — a : 44 r — l : 5 e — q : 7 u — e : 20 a — p : 11 p — p : 4 o — x : 1 x — : 9 w — t : 3 h — i : 26 — g : 19 g — o : 2 o — o : 2 o — d : 2 a — g : 7 g — r : 16 e — e : 18 m — e : 46 — v : 22 v — a : 20 b — y : 10 e — z : 2 z — : 2 n — z : 1 z — a : 1 f — l : 8 a — v : 12 g — h : 10 b — a : 11 r — n : 11 n — h : 6 s — k : 6 k — : 7 e — a : 39 r — c : 2 g — e : 13 u — g : 5 g — u : 10 u — i : 13 e — y : 5 n — g : 33 g — : 29 f — r : 9 m — : 19 r — i : 36 e — o : 6 o — g : 3 p — h : 4 e — g : 6 g — i : 7 o — p : 10 r — f : 13 s — h : 13 w — s : 3 h — t : 7 a — i : 13 w — i : 15 s — w : 3 x — i : 4 m — u : 7 d — u : 7 i — q : 9 p — i : 7 i — i : 1 i — : 1 e — f : 7 p — a : 9 c — k : 3 k — e : 5 e — v : 10 f — u : 9 b — s : 1 s — o : 13 r — p : 4 p — t : 6 m — n : 8 f — o : 26 n — f : 6 d — a : 3 i — a : 23 h — l : 1 i — k : 3 n — y : 1 n — a : 19 r — v : 1 l — w : 1 a — y : 3 y — s : 2 v — i : 5 r — r : 8 s — p : 5 i — z : 3 z — e : 4 o — b : 1 b — t : 1 i — p : 1 y — i : 2 i — v : 10 c — r : 9 c — c : 2 g — y : 1 — z : 3 z — i : 4 s — m : 7 c — l : 4 p — u : 4 t — w : 6 m — s : 5 b — o : 5 l — d : 11 b — i : 4 p — s : 4 b — u : 7 u — t : 12 h — y : 2 y — d : 1 i — r : 8 c — y : 1 g — g : 1 a — z : 1 n — k : 1 y — z : 1 l — m : 1 — y : 3 y — p : 2 x — c : 2 r — u : 4 u — f : 1 d — l : 1 o — a : 1 s — y : 1 y — m : 1 o — v : 1 d — v : 2 u — : 1 — j : 5 j — u : 2 y — t : 3 a — q : 1 y — r : 1 g — l : 1 w — o : 6 r — d : 5 u — d : 2 u — b : 3 y — e : 4 u — o : 1 m — m : 1 e — w : 6 w — : 5 s — b : 1 g — f : 1 m — b : 3 a — w : 1 a — k : 1 b — : 1 n — u : 2 k — s : 2 n — j : 1 j — a : 1 s — r : 1 a — e : 1 j — e : 5 a — h : 1 r — b : 1 o — j : 1 e — u : 2 v — o : 1 s — l : 4 h — m : 1 h — r : 2 d — w : 3 w — r : 1 e — j : 1 s — q : 1