SE Graph Data¶
Extract data from Stack Exchange archives for analysis.
Stack Exchange Graph Data generates the data needed to plot graphs in Gephi to analise how interconnected posts are. The aim is to find questions that require tags, or that should have additional references. Improving the latter will improve the ability to find the former.
Currently with most meta posts being in graphs of 10 or less it shows to me that we need more references between our meta posts.
SEGD is a bit overkill in what it does and so here’s a short rundown on what it does:
Download all site metadata from the archive.
Download the data for the site that we have filtered to get.
Extract the data from the 7z archive we got in (2).
Extract all links between posts on that site. If you download main site data then it will only find the connections between posts on main. Where if a meta site is specified then it will limit it to meta.
It extracts comments from posts and comments giving them different weightings on the exported data.
- Q&A
Questions have the highest weighting at 3. This is because an answer directly relates to the question.
- Post Link
Post links have the second highest weight as they provide evidence from other meta posts. It doesn’t have the same weight as Q&As as the post may be incorrect or discouraged. And so whilst the links are still valid, they should be taken with a grain of salt.
- Comment Links
Comment links have the lowest weight as whilst they are normally used to provide links to similar post, sometimes they’re used for humor or other things that may reduce the validity of the connection with regard to what SEGD is try to achieve.
Once all the links have been accumulated, they are exported to the output file for use in Gephi.
We extract the tag information from each post so that we can observe the connections of each tag in the graph.
All node data is extracted to a separate output file.
Below shows how SEGD interacts with all the different modules and packages within it. The different colours show which package they’re in.
- Black
- Green
- Yellow
- Blue
As can be seen the program is somewhat split. Half the libraries are only involved in the iterator side of the program, where the other half are only involved in the coroutine side of the program.
Submodules¶
Main¶
Extract links form SE data dumps.
SEGD exposes a commandline interface to change the functionality of the
program. These can be seen by using the --help
flag or in the
CLI section.
-
stack_exchange_graph_data.__main__.
main
() → None¶ Run the program from the CLI interface.
CLI¶
Argument parser functions.
The exposed arguments are:
- --help
show this help message and exit
- --no-expand-meta
don’t include links that use the old domain name structure
- --download
redownload data, even if it exists in the cache
- --min MIN
minimum sized networks to include in output
- --max MAX
maximum sized networks to include in output
- --output OUTPUT
output file name
- --cache-dir CACHE_DIR
cache directory
-
stack_exchange_graph_data.cli.
make_parser
() → argparse.ArgumentParser¶ Make parser for CLI arguments.
Driver¶
Connect and run the control flow of the program.
Since all the coroutines wrapped with
stack_exchange_graph_data.helpers.coroutines.coroutine()
are stand
alone coroutines. Which require the target coroutine to be passed at
creation time. This means that when we want to interact with them we
need to create the entire control flow, which we do here.
The entire control flow of the coroutines is:
-
stack_exchange_graph_data.driver.
links_driver
(arguments: argparse.Namespace, _site_info: stack_exchange_graph_data.segd.site_info.SiteInfo) → Generator¶ Build the control flow for links.
-
stack_exchange_graph_data.driver.
load_xml_stream
(file_path: pathlib.Path, progress_message: Optional[str] = None) → stack_exchange_graph_data.helpers.progress.ItemProgressStream¶ Load an iterable xml file with a progress bar.
Build and navigate the coroutine control flow.
-
stack_exchange_graph_data.driver.
nodes_driver
(arguments: argparse.Namespace) → Generator¶ Build the control flow for nodes.