SEGD¶
Business logic for getting and extracting Stack Exchange information.
This package contains code that contains business logic rather than code that is reasonably generic. Code that could reasonably be it’s own PyPI package is stored in the Helpers package. The benefits for this approach are described in the Helpers section.
This code also doesn’t include any Coroutines, this is as coroutines have a very different design to normal code. Keeping them separate should help reduced confusion, this is as you can enter either package with one design mentality and not have to change their thought process at random points in the package.
An example of the difference is stack_exchange_graph_data.segd.cache
which utilizes stack_exchange_graph_data.helpers.cache
. The
latter exposes a simple reusable interface, where the former uses that
interface to build the bespoke endpoints for SEGD.
Cache¶
SEGD cache endpoints.
-
class
stack_exchange_graph_data.segd.cache.
Cache
(cache_dir: pathlib.Path, archive: str)¶ SEGD Cache object.
-
site_archive
(site: stack_exchange_graph_data.segd.site_info.SiteInfo) → stack_exchange_graph_data.helpers.cache.FileCache¶ Endpoint for the site’s 7z archive.
- Parameters
site – The site info object of the wanted archive.
-
site_file
(site: stack_exchange_graph_data.segd.site_info.SiteInfo, file_path: str) → stack_exchange_graph_data.helpers.cache.Archive7zCache¶ Endpoint for the site’s unarchived data.
These are files such as
Comments.xml
orPosts.xml
. These contain the actual data from the Stack Exchange data dump.- Parameters
site – The site info object of the wanted data dump data.
file_path – The unarchived file wanted -
Comments.xml
.
-
property
sites
¶ Endpoint for the
Sites.xml
file.This contains metadata about all sites. This allows us to use a short hand name to get the required site domain. It also allows us to easily see if the site requested even exists.
-
File System¶
Holds the driving segments of the program.
-
class
stack_exchange_graph_data.segd.file_system.
FileSystem
(segd_cache: stack_exchange_graph_data.segd.cache.Cache)¶ File System interactions.
-
_get_site_info
(site_name: str, use_cache: bool) → Any¶ Filter sites to just the wanted site.
- Parameters
site_name – Name of site to get data for.
use_cache – Set to false to force redownload of data.
- Returns
Raw site data.
-
get_site_archive
(site: stack_exchange_graph_data.segd.site_info.SiteInfo, use_cache: bool = True) → pathlib.Path¶ Ensure the site’s 7z archive is in cache.
- Parameters
site – The site info object of the wanted archive.
use_cache – Set to false to force redownload of data.
- Returns
The location of the cached 7z archive.
-
get_site_file
(site: stack_exchange_graph_data.segd.site_info.SiteInfo, file_path: str, use_cache: bool = True) → pathlib.Path¶ Get a data file from the site’s data dump.
- Parameters
site – The site info object of the wanted data dump.
use_cache – Set to false to force redownload of data.
- Returns
The location fo the cached data dump file.
-
get_site_info
(site_name: str, use_cache: bool = True) → stack_exchange_graph_data.segd.site_info.SiteInfo¶ Get site information for the provided site.
- Parameters
site_name – Name of site to get data for.
use_cache – Set to false to force redownload of data.
- Returns
Object containing site information.
-
get_sites
(use_cache: bool = True) → IO[bytes]¶ Open the
Sites.xml
file from cache.- Parameters
use_cache – Set to false to force redownload of data.
- Returns
The
Sites.xml
file in binary read mode.
-
Graph¶
Graph mutations.
-
class
stack_exchange_graph_data.segd.graph.
LinkType
¶ Graph link types.
-
class
stack_exchange_graph_data.segd.graph.
Node
(value: int, links: List[Edge], inv_links: List[Node])¶ Graph Node.
-
stack_exchange_graph_data.segd.graph.
find_graph_nodes
(node: stack_exchange_graph_data.segd.graph.Node) → Set[int]¶ Find all the nodes in a subgraph.
- Parameters
node – Start node to find subgraph.
- Returns
Collection on node names in the subgraph.
Models¶
Common models used in control flow.
-
class
stack_exchange_graph_data.segd.models.
Comment
¶ Comment data.
-
_asdict
()¶ Return a new OrderedDict which maps field names to their values.
-
classmethod
_make
(iterable)¶ Make a new Comment object from a sequence or iterable
-
_replace
(**kwds)¶ Return a new Comment object replacing specified fields with new values
-
property
body
¶ Alias for field number 1
-
property
id
¶ Alias for field number 0
-
property
links
¶ Alias for field number 2
-
-
class
stack_exchange_graph_data.segd.models.
Post
¶ Post data.
-
_asdict
()¶ Return a new OrderedDict which maps field names to their values.
-
classmethod
_make
(iterable)¶ Make a new Post object from a sequence or iterable
-
_replace
(**kwds)¶ Return a new Post object replacing specified fields with new values
-
property
body
¶ Alias for field number 1
-
property
id
¶ Alias for field number 0
-
property
links
¶ Alias for field number 2
-
property
parent_id
¶ Alias for field number 4
Alias for field number 3
-
Site Information¶
Stack Exchange site information.
-
class
stack_exchange_graph_data.segd.site_info.
SiteInfo
(domain: str)¶ Site information object.
-
_gen_domain
(old_meta: bool = False) → str¶ Generate domain for the site.
- Parameters
old_meta – This determines if we should use the alternate URL form for the output.
- Returns
The site’s domain in the wanted format.
-
_get_meta
(segments: List[str]) → Tuple[bool, bool]¶ Determine if domain is a meta site.
- Parameters
segments – Unknown domain name segments.
- Returns
If the site is a meta site.
The form the URL is in.
-
_split_domain
(domain: str) → Tuple[str, str, bool, bool, bool]¶ Split domain into reusable segments.
- Parameters
domain – The domain of the site.
- Returns
All information extractable from the domain.
-