SEGD¶

Business logic for getting and extracting Stack Exchange information.

This package contains code that contains business logic rather than code that is reasonably generic. Code that could reasonably be it’s own PyPI package is stored in the Helpers package. The benefits for this approach are described in the Helpers section.

This code also doesn’t include any Coroutines, this is as coroutines have a very different design to normal code. Keeping them separate should help reduced confusion, this is as you can enter either package with one design mentality and not have to change their thought process at random points in the package.

An example of the difference is stack_exchange_graph_data.segd.cache which utilizes stack_exchange_graph_data.helpers.cache. The latter exposes a simple reusable interface, where the former uses that interface to build the bespoke endpoints for SEGD.

Cache¶

SEGD cache endpoints.

class stack_exchange_graph_data.segd.cache.Cache(cache_dir: pathlib.Path, archive: str)¶

SEGD Cache object.

site_archive(site: stack_exchange_graph_data.segd.site_info.SiteInfo) → stack_exchange_graph_data.helpers.cache.FileCache¶

Endpoint for the site’s 7z archive.

Parameters: site – The site info object of the wanted archive.

site_file(site: stack_exchange_graph_data.segd.site_info.SiteInfo, file_path: str) → stack_exchange_graph_data.helpers.cache.Archive7zCache¶

Endpoint for the site’s unarchived data.

These are files such as Comments.xml or Posts.xml. These contain the actual data from the Stack Exchange data dump.

Parameters

site – The site info object of the wanted data dump data.
file_path – The unarchived file wanted - Comments.xml.

property sites¶

Endpoint for the Sites.xml file.

This contains metadata about all sites. This allows us to use a short hand name to get the required site domain. It also allows us to easily see if the site requested even exists.

File System¶

Holds the driving segments of the program.

class stack_exchange_graph_data.segd.file_system.FileSystem(segd_cache: stack_exchange_graph_data.segd.cache.Cache)¶

File System interactions.

_get_site_info(site_name: str, use_cache: bool) → Any¶

Filter sites to just the wanted site.

Parameters

site_name – Name of site to get data for.
use_cache – Set to false to force redownload of data.

Returns

Raw site data.

get_site_archive(site: stack_exchange_graph_data.segd.site_info.SiteInfo, use_cache: bool = True) → pathlib.Path¶

Ensure the site’s 7z archive is in cache.

Parameters

site – The site info object of the wanted archive.
use_cache – Set to false to force redownload of data.

Returns

The location of the cached 7z archive.

get_site_file(site: stack_exchange_graph_data.segd.site_info.SiteInfo, file_path: str, use_cache: bool = True) → pathlib.Path¶

Get a data file from the site’s data dump.

Parameters

site – The site info object of the wanted data dump.
use_cache – Set to false to force redownload of data.

Returns

The location fo the cached data dump file.

get_site_info(site_name: str, use_cache: bool = True) → stack_exchange_graph_data.segd.site_info.SiteInfo¶

Get site information for the provided site.

Parameters

site_name – Name of site to get data for.
use_cache – Set to false to force redownload of data.

Returns

Object containing site information.

get_sites(use_cache: bool = True) → IO[bytes]¶

Open the Sites.xml file from cache.

Parameters: use_cache – Set to false to force redownload of data.
Returns: The Sites.xml file in binary read mode.

Graph¶

Graph mutations.

class stack_exchange_graph_data.segd.graph.LinkType¶: Graph link types.

class stack_exchange_graph_data.segd.graph.Node(value: int, links: List[Edge], inv_links: List[Node])¶: Graph Node.

stack_exchange_graph_data.segd.graph.find_graph_nodes(node: stack_exchange_graph_data.segd.graph.Node) → Set[int]¶

Find all the nodes in a subgraph.

Parameters: node – Start node to find subgraph.
Returns: Collection on node names in the subgraph.

Models¶

Common models used in control flow.

class stack_exchange_graph_data.segd.models.Comment¶

Comment data.

_asdict()¶: Return a new OrderedDict which maps field names to their values.

classmethod _make(iterable)¶: Make a new Comment object from a sequence or iterable

_replace(**kwds)¶: Return a new Comment object replacing specified fields with new values

property body¶: Alias for field number 1

property id¶: Alias for field number 0

property links¶: Alias for field number 2

class stack_exchange_graph_data.segd.models.Post¶

Post data.

_asdict()¶: Return a new OrderedDict which maps field names to their values.

classmethod _make(iterable)¶: Make a new Post object from a sequence or iterable

_replace(**kwds)¶: Return a new Post object replacing specified fields with new values

property body¶: Alias for field number 1

property id¶: Alias for field number 0

property links¶: Alias for field number 2

property parent_id¶: Alias for field number 4

property tags¶: Alias for field number 3

Site Information¶

Stack Exchange site information.

class stack_exchange_graph_data.segd.site_info.SiteInfo(domain: str)¶

Site information object.

_gen_domain(old_meta: bool = False) → str¶

Generate domain for the site.

Parameters: old_meta – This determines if we should use the alternate URL form for the output.
Returns: The site’s domain in the wanted format.

_get_meta(segments: List[str]) → Tuple[bool, bool]¶

Determine if domain is a meta site.

Parameters

segments – Unknown domain name segments.

Returns

If the site is a meta site.
The form the URL is in.

_split_domain(domain: str) → Tuple[str, str, bool, bool, bool]¶

Split domain into reusable segments.

Parameters: domain – The domain of the site.
Returns: All information extractable from the domain.

SEGD¶

Cache¶

File System¶

Graph¶

Models¶

Site Information¶

Stack Exchange Graph Data

Navigation

Related Topics