sbmtm

This module is cloned from https://github.com/martingerlach/hSBM_Topicmodel/commit/261d870cfc884c4f23ddaa213d07ccbddf348c78

This program is free software: you can redistribute it and / or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see < http: // www.gnu.org/licenses/>.

class trisbm.trisbm.sbmtm[source]

Class for topic-modeling with sbm’s.

clusters(l=0, n=10)[source]: Get n ‘most common’ documents from each document cluster. most common refers to largest contribution in group membership vector. For the non-overlapping case, each document belongs to one and only one group with prob 1.

clusters_query(doc_index, l=0)[source]: Get all documents in the same group as the query-document. Note: Works only for non-overlapping model. For overlapping case, we need something else.

dump_model(filename='topsbm.pkl')[source]

fit(overlap=False, hierarchical=True, B_min=2, B_max=None, n_init=1, parallel=False, verbose=False)[source]

Fit the sbm to the word-document network.

Parameters:

overlap – bool (default: False). Overlapping or Non-overlapping groups. Overlapping implemented in fit_overlap
hierarchical – bool (default: True). Hierarchical SBM or Flat SBM. Flat SBM not implemented yet.
Bmin – int (default:None): pass an option to the graph-tool inference specifying the minimum number of blocks.
n_init – int (default:1): number of different initial conditions to run in order to avoid local minimum of MDL.
parallel – passed to mcmc_sweep If parallel == False each vertex move attempt is made sequentially, where vertices are visited in random order. Otherwise the moves are attempted by sampling vertices randomly, so that the same vertex can be moved more than once, before other vertices had the chance to move.

fit_overlap(n_init=1, hierarchical=True, B_min=20, B_max=160, parallel=True, verbose=True)[source]

Fit the sbm to the word-document network.

Parameters:

hierarchical – bool (default: True). Hierarchical SBM or Flat SBM. Flat SBM not implemented yet.
Bmin – int (default:20): pass an option to the graph-tool inference specifying the minimum number of blocks.

get_D()[source]

Returns:: number of doc-nodes == number of documents

get_N()[source]

Returns:: number of edges == tokens

get_V()[source]

Returns:: number of word-nodes == types

get_groups(l=0)[source]

extract statistics on group membership of nodes form the inferred state.

Parameters:

B_d – int, number of doc-groups
B_w – int, number of word-groups
p_tw_w – array B_w x V; word-group-membership: prob that word-node w belongs to word-group tw: P(tw | w)
p_td_d – array B_d x D; doc-group membership: prob that doc-node d belongs to doc-group td: P(td | d)
p_w_tw – array V x B_w; topic distribution: prob of word w given topic tw P(w | tw)
p_tw_d – array B_w x d; doc-topic mixtures: prob of word-group tw in doc d P(tw | d)

Returns:

dictionary

get_mdl()[source]

group_membership(l=0)[source]

Return the group-membership vectors for

document-nodes, p_td_d, array with shape Bd x D
word-nodes, p_tw_w, array with shape Bw x V

It gives the probability of a nodes belonging to one of the groups.

group_to_group_mixture(l=0, norm=True)[source]

load_graph(filename='graph.gt.gz')[source]: Load a word-document network generated by make_graph() and saved with save_graph().

load_model(filename='topsbm.pkl')[source]

make_graph(list_texts, documents=None, counts=True, n_min=None)[source]

Load a corpus and generate the word-document network

optional arguments:

Parameters:

documents – list of str, titles of documents
counts – save edge-multiplicity as counts (default: True)
n_min – int filter all word-nodes with less than n_min counts (default None)

make_graph_from_BoW_df(df, counts=True, n_min=None)[source]

Load a graph from a Bag of Words DataFrame

Parameters:

df – DataFrame should be a DataFrame with where df.index is a list of words and df.columns a list of documents
counts – save edge-multiplicity as counts (default: True)
n_min – filter all word-nodes with less than n_min counts (default None)

multiflip_mcmc_sweep(n_steps=1000, beta=inf, niter=10, verbose=True)[source]

Fit the sbm to the word-document network. Use multtiplip_mcmc_sweep

Parameters:: n_steps – int (default:1): number of steps.

plot(filename=None, nedges=1000)[source]

Plot the graph and group structure.

Parameters:

filename – str; where to save the plot. if None, will not be saved
nedges – int; subsample to plot (faster, less memory)

plot_topic_dist(l)[source]

print_summary(tofile=True)[source]: Print hierarchy summary

print_topics(l=0, format='csv', path_save='')[source]

Print topics, topic-distributions, and document clusters for a given level in the hierarchy.

Parameters:: format – csv (default) or html

save_data()[source]

save_graph(filename='graph.gt.gz')[source]: Save the word-document network generated by make_graph() as filename. Allows for loading the graph without calling make_graph().

search_consensus(force_niter=100000, niter=100)[source]

topicdist(doc_index, l=0)[source]

topics(l=0, n=10)[source]: get the n most common words for each word-group in level l. return tuples (word,P(w|tw))