Purchase Pro Theory Problems

Use Case: Search for an Item

Example Product: Down Comforters

1)User/Community: Create the template for down comforters if template does not exist

2)User: Fill in parameters for the search:

i)ParamterImportanceValue

ii)Fill powerHigh700 or more

iii)Percent downHigh90% or more

iv)Shelldon't care

v)WarrantyLow10 years or more

vi)PriceHigh$500 or less

3)User: Query the database

4)Database: Generate and return results

5)User: Interactively view results

6)Search Engine: continuously crawls the web and fills out templates

a)Parse the HTML into IF

b)Determine if the page offers an item for sale

c)Determine what template to use (could be multiple)

d)Best effort attempt to fill out the appropriate template

Example Template:

ParamterValue TypeValueRangeUnits

Fill powerInteger0-1000n/a

Percent downInteger0-100%

Shell materialStringvariable*

Shell threadcount Stringvariable

WarrantyInteger0-100; lifetimeyears

PriceInteger0-10,000$

*some hints about what values are likely to be valid, e.g., cotton not metal

** A template does not attempt to be exhaustive but only captures the "most important" characteristics of a product, as determined by the creator of the template (probably a human)

Problems

IF Generation Problem

Summary:

Input: An HTML page

Output: An IF page

Product Distillation Problem

Summary: Given a page in IF, parse the page into sections that pertain to a single product for sale (a leaf in out product hierarchy)

Input:A page in IF; possibly contextual information

Output: Product Sections

Template Selection Problem

Summary: Given a page, what Products are for sale (and hence what templates should be used)

Input:

A page in intermediate format

Contextual information about the web page (CategoryPageRank,what site it’s on, etc.)

Output: A set of templates to use (and possibly which parts of the page)

Template Completion Problem

Summary: Given a page and a template, fill out the template on a best-effort basis

Input: A page in our intermediate format (IF)[1]

Output:Completed templates

Glossary of Terms:

CategoryPageRank

CategoryPageRank is a guess of what product a page is “about.” Whereas PageRank measures the “importance” of a page, CategoryPageRank categorizes a page into our hierarchy of products.

Product Hierarchy

We maintain a rooted tree of products. Each leaf is a specific product with a corresponding template. Internal nodes are product categories. Example: root->house wares->flatware->spoons. c.f. pricegrabber.com, yahoo.com

Product Section

IF pertaining to a single product

References
Look at papers that cite Bharat and Henzinger “improved Algorithms for Topic Distillation in a Hyperlinked Environment”

[1] The IF is the output of out HTML parsing engine