phextv1.title: raap - research as a phext

this self-describing document explains the process of writing research with phext.

phextv1.abstract
----------------

phext research documents are designed to be composable by design. by encoding your research in this format, you contribute to the global knowledge base known as the exocortex [1]. we plan to build gpt4 instances on this research data, enabling faster and more efficient research methods.

this phext is an example phexty that is also self-describing. after studying it, you will understand how to publish your research results in this format.

before studying this paper, you should read about the phext file format [2]. phext is at the most basic level just plain utf8 encoded text. it picks up where we left off in the 1970s, though, and ensures that we have a scalable text platform for the 21st century.

this document has 3 nodes: title, knowledge, and links. they are accessible at scrolls +0, +1, and +2 from here.

as i add more content here, the hash auto-updates. the beauty of this content hash is that it is both specific and meaningless. it will very likely hash to a different coordinate for different inputs. i encourage researchers to come up with the ultimate hashing algorithm for phext v2.

note: for pragmatic reasons, i have tried to keep phext sub-scroll parsing simple with version 1 - it may be advantageous to explore 1 scroll encoding paradigms. i need to import arxiv.org and run some analysis on how llm models perform using this data set to be sure.

phextv1.intro
-------------

i invented phext, so it is highly unlikely that you have heard of it. below is a summary suitable for the needs of this paper.

the key idea with phext is the introduction of a hierarchy of delimiters of unusual size. each delimiter is larger in scope than the last, and this allows us to break knowledge down into fractal dimensions while maintaining a single encapsulating shell. a single phext document can hold all information currently available on the Internet, and more.

for the purposes of this paper, you only need to understand 6 of the 9 phext dimensions: scrolls (SC), sections (SN), chapters (CH), books (BK), volumes (VM), and collections (CN). a scroll of phext is what you're used to: a normal page of text. a section is an ordered list of scrolls. a chapter is an ordered list of sections. a book is an ordered list of chapters. a volume is an ordered list of books. a collection is an ordered list of volumes.

all research that we will produce prior to reaching kardashev 1.0 status as a civilization will fit into these 6 dimensions. note that phext provides three additional layers of complexity: series, shelves, and libraries. so all of our research fits into a single top-level triplet (1.1.2). please reach out to @wbic16 or @phextio on twitter for more information about phext in general. The 1.1.1 top-level triplet is reserved for phext meta planning.

in addition to these delimiters, phext introduces the concept of an 11-dimensional subspace that is literally just plain text. each delimiter is a single byte sequence, just like the humble line break. a summary of how each type of dimension break affects your text processing is given below.

phext has made a concerted effort to avoid breaking backwards compatibility with prior formats, when possible. ascii control codes that have fallen out of use have been re-purposed for phext. some of these control codes were early attempts at this sort of organization, but they lacked the perspective of the 2020s Internet for sheer scale needed.

delimiter         value   column   line   Scroll  Section  Chapter  Book  Volume  Collection
---------         -----   ------   ----   ------  -------  -------  ----  ------  ----------
line break        0x0A    = 1      + 1                                            
scroll break      0x17    = 1      = 1    + 1                                     
section break     0x18    = 1      = 1    = 1     + 1                              
chapter break     0x19    = 1      = 1    = 1     = 1      + 1                     
book break        0x1A    = 1      = 1    = 1     = 1      = 1      + 1            
volume break      0x1C    = 1      = 1    = 1     = 1      = 1      = 1   + 1     
collection break  0x1D    = 1      = 1    = 1     = 1      = 1      = 1   = 1     + 1

i used 0x1E for series breaks, 0x1F for shelf breaks, and 0x01 for library breaks. we won't be using those in this format, but you can play around with them on phext.io if you're curious.

now, on with the show.


phextv1.methods
---------------

most research papers are 3,000 to 10,000 words - or about 12 to 30 pages of text. so we need 1 scroll for each unit of research, and then additional scrolls for related information. we will thus organize research by section number. this will have benefits when we attempt to de-duplicate and combine research results.

when we consider kardashev-scale knowledge, we must first consider the volume of text produced. if 10 million scientists produce an average of one high-quality paper every 6 months (4x the 2023 research rate of 5 million papers per year), and we build our knowledge tree incrementally, then we can expect to contribute an average of 20 million sections per year.

let's expand that process for 100 years and use that as our baseline unit of knowledge: a century of research and development for a population of 10m 21st century researchers. we expect to produce 2 billion sections of research, of which perhaps 100 million (5%) are salient.

thus, we have a goal: 100 million sections of oracle-level knowledge. how should we divide this space up? let's assume that we won't know which nuggets of knowledge are useful to an oracle, so we must simply generate all 2 billion sections.

if we take a phext space that is 20 x 100^4, we'll only need 6 coordinates, meaning that we need to allocate an entire block of the phext space for research. one of our primary goals will be to assign the proper section, chapter, book, volume, and collection numbers to each piece of research.

this process should be deterministic and fast: a one-way hash function. if two papers hash to the same section, that's not really a problem - we can just append research to the existing entry. this solves a couple of problems: 1. helping people find related research, and 2. helping to de-duplicate research. if your results match prior results, you probably shouldn't publish them - but rather comment on the already-published work (to improve both).

a typical APA-style research paper [5] has the headings listed below. there's deviation from this in the wild, so we will normalize our categories to streamline parsing and validating phext-based research.

* title
* abstract
* intro
* methods
* results
* discussion
* references
* tables
* figures
* appendix
* citations

let's see if we can summarize a category for each of these research headings.