SEGD

Business logic for getting and extracting Stack Exchange information.

This package contains code that contains business logic rather than code that is reasonably generic. Code that could reasonably be it’s own PyPI package is stored in the Helpers package. The benefits for this approach are described in the Helpers section.

This code also doesn’t include any Coroutines, this is as coroutines have a very different design to normal code. Keeping them separate should help reduced confusion, this is as you can enter either package with one design mentality and not have to change their thought process at random points in the package.

An example of the difference is stack_exchange_graph_data.segd.cache which utilizes stack_exchange_graph_data.helpers.cache. The latter exposes a simple reusable interface, where the former uses that interface to build the bespoke endpoints for SEGD.

Cache

SEGD cache endpoints.

class stack_exchange_graph_data.segd.cache.Cache(cache_dir: pathlib.Path, archive: str)

SEGD Cache object.

site_archive(site: stack_exchange_graph_data.segd.site_info.SiteInfo) → stack_exchange_graph_data.helpers.cache.FileCache

Endpoint for the site’s 7z archive.

Parameters

site – The site info object of the wanted archive.

site_file(site: stack_exchange_graph_data.segd.site_info.SiteInfo, file_path: str) → stack_exchange_graph_data.helpers.cache.Archive7zCache

Endpoint for the site’s unarchived data.

These are files such as Comments.xml or Posts.xml. These contain the actual data from the Stack Exchange data dump.

Parameters
  • site – The site info object of the wanted data dump data.

  • file_path – The unarchived file wanted - Comments.xml.

property sites

Endpoint for the Sites.xml file.

This contains metadata about all sites. This allows us to use a short hand name to get the required site domain. It also allows us to easily see if the site requested even exists.

File System

Holds the driving segments of the program.

class stack_exchange_graph_data.segd.file_system.FileSystem(segd_cache: stack_exchange_graph_data.segd.cache.Cache)

File System interactions.

_get_site_info(site_name: str, use_cache: bool) → Any

Filter sites to just the wanted site.

Parameters
  • site_name – Name of site to get data for.

  • use_cache – Set to false to force redownload of data.

Returns

Raw site data.

get_site_archive(site: stack_exchange_graph_data.segd.site_info.SiteInfo, use_cache: bool = True) → pathlib.Path

Ensure the site’s 7z archive is in cache.

Parameters
  • site – The site info object of the wanted archive.

  • use_cache – Set to false to force redownload of data.

Returns

The location of the cached 7z archive.

get_site_file(site: stack_exchange_graph_data.segd.site_info.SiteInfo, file_path: str, use_cache: bool = True) → pathlib.Path

Get a data file from the site’s data dump.

Parameters
  • site – The site info object of the wanted data dump.

  • use_cache – Set to false to force redownload of data.

Returns

The location fo the cached data dump file.

get_site_info(site_name: str, use_cache: bool = True) → stack_exchange_graph_data.segd.site_info.SiteInfo

Get site information for the provided site.

Parameters
  • site_name – Name of site to get data for.

  • use_cache – Set to false to force redownload of data.

Returns

Object containing site information.

get_sites(use_cache: bool = True) → IO[bytes]

Open the Sites.xml file from cache.

Parameters

use_cache – Set to false to force redownload of data.

Returns

The Sites.xml file in binary read mode.

Graph

Graph mutations.

class stack_exchange_graph_data.segd.graph.LinkType

Graph link types.

class stack_exchange_graph_data.segd.graph.Node(value: int, links: List[Edge], inv_links: List[Node])

Graph Node.

stack_exchange_graph_data.segd.graph.find_graph_nodes(node: stack_exchange_graph_data.segd.graph.Node) → Set[int]

Find all the nodes in a subgraph.

Parameters

node – Start node to find subgraph.

Returns

Collection on node names in the subgraph.

Models

Common models used in control flow.

class stack_exchange_graph_data.segd.models.Comment

Comment data.

_asdict()

Return a new OrderedDict which maps field names to their values.

classmethod _make(iterable)

Make a new Comment object from a sequence or iterable

_replace(**kwds)

Return a new Comment object replacing specified fields with new values

property body

Alias for field number 1

property id

Alias for field number 0

Alias for field number 2

class stack_exchange_graph_data.segd.models.Post

Post data.

_asdict()

Return a new OrderedDict which maps field names to their values.

classmethod _make(iterable)

Make a new Post object from a sequence or iterable

_replace(**kwds)

Return a new Post object replacing specified fields with new values

property body

Alias for field number 1

property id

Alias for field number 0

Alias for field number 2

property parent_id

Alias for field number 4

property tags

Alias for field number 3

Site Information

Stack Exchange site information.

class stack_exchange_graph_data.segd.site_info.SiteInfo(domain: str)

Site information object.

_gen_domain(old_meta: bool = False) → str

Generate domain for the site.

Parameters

old_meta – This determines if we should use the alternate URL form for the output.

Returns

The site’s domain in the wanted format.

_get_meta(segments: List[str]) → Tuple[bool, bool]

Determine if domain is a meta site.

Parameters

segments – Unknown domain name segments.

Returns

  • If the site is a meta site.

  • The form the URL is in.

_split_domain(domain: str) → Tuple[str, str, bool, bool, bool]

Split domain into reusable segments.

Parameters

domain – The domain of the site.

Returns

All information extractable from the domain.