Purchase Pro Theory Problems
Use Case: Search for an Item
Example Product: Down Comforters
1)User/Community: Create the template for down comforters if template does not exist
2)User: Fill in parameters for the search:
i)ParamterImportanceValue
ii)Fill powerHigh700 or more
iii)Percent downHigh90% or more
iv)Shelldon't care
v)WarrantyLow10 years or more
vi)PriceHigh$500 or less
3)User: Query the database
4)Database: Generate and return results
5)User: Interactively view results
6)Search Engine: continuously crawls the web and fills out templates
a)Parse the HTML into IF
b)Determine if the page offers an item for sale
c)Determine what template to use (could be multiple)
d)Best effort attempt to fill out the appropriate template
Example Template:
ParamterValue TypeValueRangeUnits
Fill powerInteger0-1000n/a
Percent downInteger0-100%
Shell materialStringvariable*
Shell threadcount Stringvariable
WarrantyInteger0-100; lifetimeyears
PriceInteger0-10,000$
*some hints about what values are likely to be valid, e.g., cotton not metal
** A template does not attempt to be exhaustive but only captures the "most important" characteristics of a product, as determined by the creator of the template (probably a human)
Problems
IF Generation Problem
Summary:
Input: An HTML page
Output: An IF page
Product Distillation Problem
Summary: Given a page in IF, parse the page into sections that pertain to a single product for sale (a leaf in out product hierarchy)
Input:A page in IF; possibly contextual information
Output: Product Sections
Template Selection Problem
Summary: Given a page, what Products are for sale (and hence what templates should be used)
Input:
A page in intermediate format
Contextual information about the web page (CategoryPageRank,what site it’s on, etc.)
Output: A set of templates to use (and possibly which parts of the page)
Template Completion Problem
Summary: Given a page and a template, fill out the template on a best-effort basis
Input: A page in our intermediate format (IF)[1]
Output:Completed templates
Glossary of Terms:
CategoryPageRank
CategoryPageRank is a guess of what product a page is “about.” Whereas PageRank measures the “importance” of a page, CategoryPageRank categorizes a page into our hierarchy of products.
Product Hierarchy
We maintain a rooted tree of products. Each leaf is a specific product with a corresponding template. Internal nodes are product categories. Example: root->house wares->flatware->spoons. c.f. pricegrabber.com, yahoo.com
Product Section
IF pertaining to a single product
References
Look at papers that cite Bharat and Henzinger “improved Algorithms for Topic Distillation in a Hyperlinked Environment”
[1] The IF is the output of out HTML parsing engine