The VisitManager
of iudex-core contains the central thread pool,
work prioritizer and politeness enforcer. Prioritization aspects are
handled both via an external WorkPollStrategy
implementation and in
internal (in memory) structure of host and visit queues.
The iudex-core module includes a prioritized visit queue and executor with the following features:
Per-host fetch rate limiting for politeness.
HostQueue(s)
containing visit orders for the same host, to be
processed in priority order.
A VisitQueue
of ready and sleeping (delay for politeness)
HostQueue(s)
. The ready queue is prioritized by HostQueue
topmost
priority. The sleeping queue is ordered by least next visit time.
A custom threaded, concurrent VisitManager
for processing the
VisitQueue
while upholding host politeness constraints.
A ThreadPoolExecutor
executes VisitTask(s)
which are simply
UniMap
orders run through a FilterContainer
. Tasks may block if
using blocking HTTP implementation or be short lived,
reentrant with an asynchronous implementation.
A pluggable WorkPollStrategy
(see below)
A GenericWorkPollStrategy
including support for minimum poll
interval, and minimum remaining ratios of orders and hosts before
new work is polled.
The VisitManager
supports generations of VisitQueue(s) and visitor
threads, for high concurrency, and avoiding over-commitment to any
single host.
The iudex-da modules provides a WorkPoller
implementation of
WorkPollStrategy
which obtains prioritized visit orders from a
Postgres database. Features:
Only visit orders (urls) with NEXT_VISIT_AFTER
the current time are
considered.
Visit orders are considered in descending PRIORITY
order.
The priority of the highest orders associated with a each host (by URL) is discounted by a fixed factor of per-host depth. This biases the work polled toward greater breadth of hosts and thus concurrency of execution, given the per-host politeness constraint.
SQL 2003 Window Functions are utilized for efficient calculation of host depth priority adjustments.