Tutorial

Create a fake dataset

from nsbm import nsbm
import pandas as pd
import numpy as np
df = pd.DataFrame(
    index=["w{}".format(w) for w in range(1000)],
    columns=["doc{}".format(d) for d in range(250)],
    data=np.random.randint(1, 100, 250000).reshape((1000, 250)))
df_key_list = []
# an additional feature
df_key_list.append(
    pd.DataFrame(
        index=["keyword{}".format(w) for w in range(100)],
        columns=["doc{}".format(d) for d in range(250)],
        data=np.random.randint(1, 10, (100, 250)))
)

# another additional feature
df_key_list.append(
    pd.DataFrame(
        index=["author{}".format(w) for w in range(10)],
        columns=["doc{}".format(d) for d in range(250)],
        data=np.random.randint(1, 5, (10, 250)))
)

# other features
df_key_list.append(
    pd.DataFrame(
        index=["feature{}".format(w) for w in range(25)],
        columns=["doc{}".format(d) for d in range(250)],
        data=np.random.randint(1, 5, (25, 250)))
)

df is a Bag of Words (BoW) representation of the documents.
df_key_list is a list of (BoW), all of them have to share the same columns (documents) in this case keywords, authors and features are the additional (more than words) information about the documents.

Create and fit a model

Create a model

model = nsbm()
model.make_graph_multiple_df(df, df_key_list)

Fit the model

model.fit(n_init=1, B_min=50, verbose=False)

Parameters:

n_init the number of initializations: olny the one with the shortest DL will be kept
B_min minimum number of blocks
B_max maximum number of blocks
parallel the model will be fitted with heavy parallelization
verbose if True, print the progress

The fit is performed using graph_tool.inference.minimize_nested_blockmodel_dl()

Save the results

model.save_data()

Stochastic Block Models on graph_tool

For a complete tutorial on how to infer network structure using stochastic block models see graph_tool tutorial