Start by importing the necessary libraries.

```
In [1]:
```%matplotlib inline
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
from bigbang.archive import Archive
import bigbang.utils as utils
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os
import csv
import re
import scipy
import scipy.cluster.hierarchy as sch
import email

```
In [2]:
```#pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options
plt.rcParams['axes.facecolor'] = 'white'
import seaborn as sns
sns.set()
sns.set_style("white")

```
In [3]:
```list_url = '6lo' # perpass happens to be one that I subscribe to
ietf_archives_dir = '../archives' # relative location of the ietf-archives directory/repo
list_archive = mailman.open_list_archives(list_url, ietf_archives_dir)
activity = Archive(list_archive).get_activity()

```
```

```
In [4]:
```people = None
people = pd.DataFrame(activity.sum(0), columns=['6lo']) # sum the message count, rather than by date

```
In [5]:
```people.describe()

```
Out[5]:
```

`-activity.csv`

files which contain the activity summary for the full list archive. These files are created with `bin/mail_to_activity.py`

or might be included in the mailing list archive repository.

```
In [6]:
```f = open('../examples/mm.ietf.org.txt', 'r')
ietf_lists = set(f.readlines()) # remove duplicates, which is a bug in list maintenance

```
In [7]:
```list_activities = []
for list_url in ietf_lists:
try:
activity_summary = mailman.open_activity_summary(list_url, ietf_archives_dir)
if activity_summary is not None:
list_activities.append((list_url, activity_summary))
except Exception as e:
print(str(e))

```
```

```
In [8]:
```len(list_activities)

```
Out[8]:
```

**This operation is a little slow.** Don't repeat this operation without recreating `people`

from the cells above.

```
In [9]:
```list_columns = []
for (list_url, activity_summary) in list_activities:
list_name = mailman.get_list_name(list_url)
activity_summary.rename(columns={'Message Count': list_name}, inplace=True) # name the message count column for the list
people = pd.merge(people, activity_summary, how='outer', left_index=True, right_index=True)
list_columns.append(list_name) # keep a list of the columns that specifically represent mailing list message counts

```
In [10]:
```# the original message column was duplicated during the merge process, so we remove it here
people = people.drop(columns=['6lo_y'])
people = people.rename(columns={'6lo_x':'6lo'})

```
In [12]:
```people.describe()

```
Out[12]:
```

```
In [46]:
```# not sure how the index ended up with NaN values, but need to change them to strings here so additional steps will work
new_index = people.index.fillna('missing')
people.index = new_index

Split out the email address and header name from the From header we started with.

```
In [47]:
```froms = pd.Series(people.index)
emails = froms.apply(lambda x: email.utils.parseaddr(x)[1])
emails.index = people.index
names = froms.apply(lambda x: email.utils.parseaddr(x)[0])
names.index = people.index
people['email'] = emails
people['name'] = names

Let's create some summary statistical columns.

```
In [48]:
```people['Total Messages'] = people[list_columns].sum(axis=1)
people['Number of Groups'] = people[list_columns].count(axis=1)
people['Median Messages per Group'] = people[list_columns].median(axis=1)

```
In [49]:
```people['Total Messages'].sum()

```
Out[49]:
```

**101,510** "people" sent a combined total of **1.2 million messages**. Most people sent only 1 message.

```
In [22]:
```people[['Total Messages']].plot(kind='hist', bins=100, logy=True, logx=False)
people[['Number of Groups']].plot(kind='hist', bins=100, logy=True, logx=False)

```
Out[22]:
```

```
In [23]:
```working = people[people['Total Messages'] > 5]
working['Total Messages (log)'] = np.log10(working['Total Messages'])
working['Number of Groups (log)'] = np.log10(working['Number of Groups'])

```
```

```
In [24]:
```working[['Median Messages per Group']].plot(kind='hist', bins=100, logy=True)

```
Out[24]:
```

```
In [25]:
```working.plot.scatter('Number of Groups','Total Messages', xlim=(1,300), ylim=(1,20000), logx=False, logy=True)

```
Out[25]:
```

```
In [26]:
```sns.jointplot(x='Number of Groups',y='Total Messages (log)', data=working, kind="kde", xlim=(0,50), ylim=(0,3));

```
```

Can we learn implicit relationships between groups based on the messaging patterns of participants?

We want to work with just the data of people and how many messages they sent to each group.

```
In [27]:
```df = people[people['Total Messages'] > 5]
df = df.drop(columns=['email','name','Total Messages','Number of Groups','Median Messages per Group'])
df = df.fillna(0)

```
In [28]:
```import sklearn
from sklearn.decomposition import PCA
scaled = sklearn.preprocessing.maxabs_scale(df)
pca = PCA(n_components=2, whiten=True)
pca.fit(scaled)

```
Out[28]:
```

```
In [29]:
```components_frame = pd.DataFrame(pca.components_)
components_frame.columns = df.columns
components_frame

```
Out[29]:
```

```
In [30]:
```for i, row in components_frame.iterrows():
print('\nComponent %d' % i)
r = row.sort_values(ascending=False)
print('Most positive correlation:\n %s' % r[:5].index.values)
print('Most negative correlation:\n %s' % r[-5:].index.values)

```
```

Component 0 is mostly routing (Layer 3 and Layer 2 VPNs, the routing area working group, interdomain routing. (IP Performance/Measurement seems different -- is it related?)

Component 1 is all Internet area groups, mostly related to IPv6, and specifically different groups working on mobility-related extensions to IPv6.

When data was unscaled, PCA components seemed to connect to ops and ipv6, a significantly different result.

For our two components, we can see which features are most positively correlated and which are most negatively correlated. On positive correlation, looking up these groups, it seems like there is some meaningful coherence here. On Component 0, we see groups in the "ops" area: groups related to the management, configuration and measurement of networks. On the other component, we see groups in the Internet and transport areas: groups related to IPv6, the transport area and PSTN transport.

That we see such different results when the data is first scaled by each feature perhaps suggests that the initial analysis was just picking up on the largest groups.

```
In [31]:
```pca.explained_variance_

```
Out[31]:
```

The explained variance by our components seems extremely tiny.

```
In [32]:
```component_df = pd.DataFrame(pca.transform(df), columns=['PCA%i' % i for i in range(2)], index=df.index)
component_df.plot.scatter(x='PCA0',y='PCA1')

```
Out[32]:
```

And with a larger number of components?

```
In [33]:
```pca = PCA(n_components=10, whiten=True)
pca.fit(scaled)
components_frame = pd.DataFrame(pca.components_)
components_frame.columns = df.columns
for i, row in components_frame.iterrows():
print('\nComponent %d' % i)
r = row.sort_values(ascending=False)
print('Most positive correlation:\n %s' % r[:5].index.values)
print('Most negative correlation:\n %s' % r[-5:].index.values)

```
```

`mtgvenue`

or `policy`

or `iasa20`

(an IETF governance topic).

Because we have people and the groups they send to, we can construct a *bipartite graph*.

We'll use just the top 5000 people, in order to make complicated calculations run faster.

```
In [34]:
```df = people.sort_values(by="Total Messages",ascending=False)[:5000]
df = df.drop(columns=['email','name','Total Messages','Number of Groups','Median Messages per Group'])
df = df.fillna(0)

```
In [35]:
```import networkx as nx
G = nx.Graph()
for group in df.columns:
G.add_node(group,type="group")
for name, data in df.iterrows():
G.add_node(name,type="person")
for group, weight in data.items():
if weight > 0:
G.add_edge(name,group,weight=weight)

```
In [36]:
```nx.is_bipartite(G)

```
Out[36]:
```

Yep, it is bipartite! Now, we can export a graph file for use in visualization software Gephi.

```
In [37]:
```nx.write_gexf(G,'ietf-participation-bipartite.gexf')

```
In [38]:
```people_nodes, group_nodes = nx.algorithms.bipartite.sets(G)

```
```

```
In [39]:
```pr = nx.pagerank(G, weight="weight")

```
In [40]:
```nx.set_node_attributes(G, "pagerank", pr)

```
```

```
In [ ]:
```sorted([node for node in list(G.nodes(data=True))
if node[1]['type'] == 'group'],
key=lambda x: x[1]['pagerank'],
reverse =True)[:10]

```
In [41]:
```sorted([node for node in list(G.nodes(data=True))
if node[1]['type'] == 'person'],
key=lambda x: x[1]['pagerank'],
reverse =True)[:10]

```
```

```
In [42]:
```person_nodes = [node[0] for node in G.nodes(data=True) if node[1]['type'] == 'person']

**NB: Slow operation for large graphs.**

```
In [43]:
```cc = nx.algorithms.bipartite.centrality.closeness_centrality(G, person_nodes, normalized=True)

```
In [44]:
```for node, value in list(cc.items()):
if type(node) not in [str, str]:
print(node)
print(value)

```
```

```
In [45]:
```del cc[14350.0] # remove a spurious node value

```
```

```
In [ ]:
```nx.set_node_attributes(G, "closeness", cc)

```
In [ ]:
```sorted([node for node in list(G.nodes(data=True))
if node[1]['type'] == 'person'],
key=lambda x: x[1]['closeness'],
reverse=True)[:25]

*TODO: calculating bi-cliques (the people who all are connected to the same group) and then measuring correlation in bi-cliques (people who belong to many of the same groups) could allow for analysis of cohesive subgroups and a different network analysis/visualization.* See Borgatti, S.P. and Halgin, D. In press. “Analyzing Affiliation Networks”. In Carrington, P. and Scott, J. (eds) The Sage Handbook of Social Network Analysis. Sage Publications. http://www.steveborgatti.com/papers/bhaffiliations.pdf

```
In [ ]:
```