SE Graph Data¶

Extract data from Stack Exchange archives for analysis.

Stack Exchange Graph Data generates the data needed to plot graphs in Gephi to analise how interconnected posts are. The aim is to find questions that require tags, or that should have additional references. Improving the latter will improve the ability to find the former.

Currently with most meta posts being in graphs of 10 or less it shows to me that we need more references between our meta posts.

SEGD is a bit overkill in what it does and so here’s a short rundown on what it does:

Download all site metadata from the archive.
Download the data for the site that we have filtered to get.
Extract the data from the 7z archive we got in (2).
Extract all links between posts on that site. If you download main site data then it will only find the connections between posts on main. Where if a meta site is specified then it will limit it to meta.

It extracts comments from posts and comments giving them different weightings on the exported data.
Q&A
1. Questions have the highest weighting at 3. This is because an answer directly relates to the question.
Post Link
1. Post links have the second highest weight as they provide evidence from other meta posts. It doesn’t have the same weight as Q&As as the post may be incorrect or discouraged. And so whilst the links are still valid, they should be taken with a grain of salt.
Comment Links
1. Comment links have the lowest weight as whilst they are normally used to provide links to similar post, sometimes they’re used for humor or other things that may reduce the validity of the connection with regard to what SEGD is try to achieve.
Once all the links have been accumulated, they are exported to the output file for use in Gephi.
We extract the tag information from each post so that we can observe the connections of each tag in the graph.
All node data is extracted to a separate output file.

Below shows how SEGD interacts with all the different modules and packages within it. The different colours show which package they’re in.

Black: stack_exchange_graph_data
Green: stack_exchange_graph_data.coroutines
Yellow: stack_exchange_graph_data.helpers
Blue: stack_exchange_graph_data.segd

$digraph G { rankdir=LR; "__main__"; cli; driver; node [color="#05930C"]; data_sources; links; nodes; node [color="#FFE050"]; h_cache [label="helpers.cache"]; coroutines; curl; progress; si; xref; node [color="#0074C1"]; s_cache [label="segd.cache"]; file_system; "graph"; models; site_info; "__main__" -> {cli, driver}; driver -> { data_sources, links, nodes, file_system, site_info, coroutines, progress }; data_sources -> {models, "graph", xref, coroutines}; links -> {"graph", coroutines}; nodes -> coroutines; s_cache -> {site_info, h_cache}; file_system -> {s_cache, site_info}; h_cache -> {curl, si}; curl -> progress; progress -> si; }$

As can be seen the program is somewhat split. Half the libraries are only involved in the iterator side of the program, where the other half are only involved in the coroutine side of the program.

Submodules¶

submodules:

Coroutines
- Data Sources
- Links
- Nodes
Helpers
- Cache
- Coroutines
- Curl
- Progress
- SI
- XRef
SEGD
- Cache
- File System
- Graph
- Models
- Site Information

Main¶

Extract links form SE data dumps.

SEGD exposes a commandline interface to change the functionality of the program. These can be seen by using the --help flag or in the CLI section.

stack_exchange_graph_data.__main__.main() → None¶: Run the program from the CLI interface.

CLI¶

Argument parser functions.

The exposed arguments are:

--help: show this help message and exit
--no-expand-meta: don’t include links that use the old domain name structure
--download: redownload data, even if it exists in the cache
--min MIN: minimum sized networks to include in output
--max MAX: maximum sized networks to include in output
--output OUTPUT: output file name
--cache-dir CACHE_DIR: cache directory

stack_exchange_graph_data.cli.make_parser() → argparse.ArgumentParser¶: Make parser for CLI arguments.

Driver¶

Connect and run the control flow of the program.

Since all the coroutines wrapped with stack_exchange_graph_data.helpers.coroutines.coroutine() are stand alone coroutines. Which require the target coroutine to be passed at creation time. This means that when we want to interact with them we need to create the entire control flow, which we do here.

The entire control flow of the coroutines is:

$digraph G { file_sink_l [label="file_sink"]; file_sink_n [label="file_sink"]; subgraph cluster_0 { label="links_driver"; color="#0074C1"; node [color="#0074C1"]; handle_links; filter_links; filter_duplicates; filter_network_size; sheet_prep; sheet_prep -> file_sink_l; } subgraph cluster_1 { label="nodes_driver"; color="#FFE050"; node [color="#FFE050"]; handle_nodes; handle_nodes -> file_sink_n; } subgraph cluster_2 { label="navigate"; color="#05930C"; node [color="#05930C"]; load_posts; get_post_links; load_comments; get_comment_links; } handle_links -> filter_links -> filter_duplicates -> filter_network_size -> sheet_prep; handle_links -> filter_duplicates; load_posts -> get_post_links -> {handle_links, handle_nodes}; load_comments -> get_comment_links -> handle_links; }$

stack_exchange_graph_data.driver.links_driver(arguments: argparse.Namespace, _site_info: stack_exchange_graph_data.segd.site_info.SiteInfo) → Generator¶: Build the control flow for links.

stack_exchange_graph_data.driver.load_xml_stream(file_path: pathlib.Path, progress_message: Optional[str] = None) → stack_exchange_graph_data.helpers.progress.ItemProgressStream¶: Load an iterable xml file with a progress bar.

stack_exchange_graph_data.driver.navigate(_file_system: stack_exchange_graph_data.segd.file_system.FileSystem, arguments: argparse.Namespace) → None¶: Build and navigate the coroutine control flow.

stack_exchange_graph_data.driver.nodes_driver(arguments: argparse.Namespace) → Generator¶: Build the control flow for nodes.

SE Graph Data¶

Submodules¶

Main¶

CLI¶

Driver¶

Stack Exchange Graph Data

Navigation

Related Topics