nsbm

triSBM

Copyright(C) 2021 fvalle1

This program is free software: you can redistribute it and / or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see < http: // www.gnu.org/licenses/>.

class trisbm.trisbm.trisbm[source]

Class to run trisbm

_get_shape()[source]
Returns:

list of tuples (number of documents, number of words, (number of keywords,…))

clusters(l=0, n=10)

Get n ‘most common’ documents from each document cluster. most common refers to largest contribution in group membership vector. For the non-overlapping case, each document belongs to one and only one group with prob 1.

clusters_query(doc_index, l=0)

Get all documents in the same group as the query-document. Note: Works only for non-overlapping model. For overlapping case, we need something else.

draw(*args, **kwargs) None[source]

Draw the network

Parameters:
  • *args – positional arguments to pass to self.state.draw

  • **kwargs – keyword argument to pass to self.state.draw

dump_model(filename='trisbm.pkl')[source]

Dump model using pickle

To restore the model:

import cloudpickle as pickle file=open(“trisbm.pkl” ,”rb”) model = pickle.load(file)

file.close()

fit(n_init=5, verbose=True, deg_corr=True, overlap=False, parallel=True, B_min=3, B_max=None, *args, **kwargs) None[source]

Fit using minimize_nested_blockmodel_dl

Parameters:
  • n_init – number of initialisation. The best will be kept

  • verbose – Print output

  • deg_corr – use deg corrected model

  • overlap – use overlapping model

  • parallel – perform parallel moves

  • *args – positional arguments to pass to gt.minimize_nested_blockmodel_dl

  • **kwargs – keywords arguments to pass to gt.minimize_nested_blockmodel_dl

fit_overlap(n_init=1, hierarchical=True, B_min=20, B_max=160, parallel=True, verbose=True)

Fit the sbm to the word-document network.

Parameters:
  • hierarchical – bool (default: True). Hierarchical SBM or Flat SBM. Flat SBM not implemented yet.

  • Bmin – int (default:20): pass an option to the graph-tool inference specifying the minimum number of blocks.

get_D()
Returns:

number of doc-nodes == number of documents

get_N()
Returns:

number of edges == tokens

get_V()
Returns:

number of word-nodes == types

get_groups(l=0)[source]
Parameters:

l – hierarchy level

Returns:

groups

get_mdl()[source]

Get minimum description length

Proxy to self.state.entropy()

group_membership(l=0)
Return the group-membership vectors for
  • document-nodes, p_td_d, array with shape Bd x D

  • word-nodes, p_tw_w, array with shape Bw x V

It gives the probability of a nodes belonging to one of the groups.

group_to_group_mixture(l=0, norm=True)
load_graph(filename='graph.xml.gz') None[source]

Load a presaved graph

Parameters:

filename – graph to load

load_model(filename='topsbm.pkl')
make_graph(df: DataFrame, get_kind) None[source]

Create a graph from a pandas DataFrame

Parameters:
  • df – DataFrame with words on index and texts on columns. Actually this is a BoW.

  • get_kind – function that returns 1 or 2 given an element of df.index. [1 for words 2 for keywords]

make_graph_from_BoW_df(df, counts=True, n_min=None)

Load a graph from a Bag of Words DataFrame

Parameters:
  • df – DataFrame should be a DataFrame with where df.index is a list of words and df.columns a list of documents

  • counts – save edge-multiplicity as counts (default: True)

  • n_min – filter all word-nodes with less than n_min counts (default None)

make_graph_multiple_df(df: DataFrame, df_keyword_list: list) None[source]

Create a graph from two dataframes one with words, others with keywords or other layers of information

Parameters:
  • df – DataFrame with words on index and texts on columns

  • df_keyword_list – list of DataFrames with keywords on index and texts on columns

metadata(l=0, n=10, kind=2)[source]

get the n most common keywords for each keyword-group in level l.

Returns:

tuples (keyword,P(kw|tk))

metadatumdist(doc_index, l=0, kind=2)[source]
multiflip_mcmc_sweep(n_steps=1000, beta=inf, niter=10, verbose=True)

Fit the sbm to the word-document network. Use multtiplip_mcmc_sweep

Parameters:

n_steps – int (default:1): number of steps.

plot(filename=None, nedges=1000)

Plot the graph and group structure.

Parameters:
  • filename – str; where to save the plot. if None, will not be saved

  • nedges – int; subsample to plot (faster, less memory)

plot_topic_dist(l)
print_summary(tofile=True)

Print hierarchy summary

print_topics(l=0, format='csv', path_save='')[source]

Print topics, topic-distributions, and document clusters for a given level in the hierarchy.

Parameters:
  • l – level to store

  • format – csv (default) or html

  • path_save – path/to/store/file

save_data()
save_graph(filename='graph.xml.gz') None[source]

Save the graph

Parameters:

filename – name of the graph stored

search_consensus(force_niter=100000, niter=100)
topicdist(doc_index, l=0)
topics(l=0, n=10)

get the n most common words for each word-group in level l. return tuples (word,P(w|tw))