6. Development

6.1. Architecture

_images/block-diagram.png

DXR divides into 2 halves, with stored indices in the middle:

  1. The indexer, run via dxr index, is a batch job which analyzes code and builds indices in elasticsearch, one per tree, plus a catalog index that keeps track of them. The indexer hosts various plugins which handle everything from syntax coloring to static analysis.

    Generally, the indexer is kicked off asynchronously—often even on a separate machine—by cron or a build system. It’s up to deployers to come up with strategies that make sense for them.

  2. The second half is a Flask web application which lets users run queries. dxr serve runs a toy instance of the application for development purposes; a more robust method should be used for Deployment.

6.1.1. How Indexing Works

We store every line of source code as an elasticsearch document of type line (hereafter called a “LINE doc” after the name of the constant used in the code). This lends itself to the per-line search results DXR delivers. In addition to the text of the line, indexed into trigrams for fast substring and regex search, a LINE doc contains some structural data.

  • First are needles, search targets that structural queries can hunt for. For example, if we indexed the following Python source code, the indicated (simplified) needles might be attached:

    def frob():     # py-function: frob
        nic(ate())  # py-callers:  [nic, ate]
    

    If the user runs the query function:frob, we look for LINE docs with “frob” in their “py-function” properties. If the user runs the query callers:nic, we look for docs with “py-callers” properties containing “nic”.

    These needles are offered up by plugins via the needles_by_line() API. For the sake of sanity, we’ve settled on the convention of a language prefix for language-specific needles. However, the names are technically arbitrary, since the plugin emitting the needle is also its consumer, through the implementation of a Filter.

  • Also attached to a LINE doc are offsets/metadata pairs that attach CSS classes and contextual menus to various spans of the line. These also come out of plugins, via refs() and regions(). Views of entire source-code files are rendered by stitching multiple LINE docs together.

The other major kind of entity is the FILE doc. These support directory listings and the storage of per-file rendering data like navigation-pane entries (given by links()) or image contents. FILE docs may also contain needles, supporting searches like ext:cpp which return entire files rather than lines. Plugins provide these needles via needles().

6.2. Setting Up

Here is the fastest way to get hacking on DXR.

6.2.1. Downloading DXR

Using git, clone the DXR repository:

git clone https://github.com/mozilla/dxr.git

6.2.2. Booting And Building

DXR runs only on Linux at the moment (and possibly other UNIX-like operating systems). The easiest way to get things set up is to use the included, preconfigured Docker setup. If you’re not running Linux on your host machine, you’ll need a virtualization provider. We recommend VirtualBox.

After you’ve installed VirtualBox (or ignored that bit because you’re on Linux), grab the three Docker tools you’ll need: docker, docker-compose, and, if you’re not on Linux, docker-machine. If you’re running the homebrew package manager on the Mac, this is as easy as...

brew install docker docker-compose docker-machine

Otherwise, visit the Docker Engine page for instructions.

Next, unless you’re already on Linux, you’ll need to spin up a Linux VM to host your Docker containers:

docker-machine create --driver virtualbox --virtualbox-disk-size 50000 --virtualbox-cpu-count 2 --virtualbox-memory 512 default
eval "$(docker-machine env default)"

Feel free to adjust the resource allocation numbers above as you see fit.

Note

Next time you reboot (or run make docker_stop), you’ll need to restart the VM:

docker-machine start default

And each time you use a new shell, you’ll need to set the environment variables that tell Docker how to find the VM:

eval "$(docker-machine env default)"

When you’re done with DXR and want to reclaim the RAM taken by the VM, run...

make docker_stop

Now you’re ready to fire up DXR’s Docker containers, one to run elasticsearch and the other to interact with you, index code, and serve web requests:

make shell

This drops you at a shell prompt in the interactive container. Now you can build DXR and run the tests to make sure it works. Type this at the prompt within the container:

# Within the docker container...
make test

6.2.3. Running A Test Index

The folder-based test cases make decent workspaces for development, suitable for manually trying out your changes. test_basic is a good one to start with. To get it running...

cd ~/dxr/tests/test_basic
dxr index
dxr serve -a

If you’re using docker-machine, run docker-machine ip default to find the address of your VM. Then surf to http://that IP address:8000/ from the host machine, and explore the index. If you’re not using docker-machine, the index should be accessible from http://localhost:8000/.

When you’re done, stop the server with Control-C.

6.3. Workflow

The repository on your host machine is mirrored over to the interactive container via Docker volume mounting. Changes you make in the DXR repository on your host machine will be instantly available within /home/dxr/dxr on the container and vice versa, so you can edit using your usual tools on the host and still use the container to run DXR.

After making changes to DXR, a build step is sometimes needed to see the effects of your work:

Changes to C++ code or to HTML templates in the nunjucks folder:
make (at the root of the project)
Changes to the format of the elasticsearch index:
Re-run dxr index inside your test folder (e.g., tests/test_basic). Before committing, you should increment the format version.

Stop dxr serve, run any applicable build steps, and then fire up the server again. If you’re changing Python code that runs only at request time, you shouldn’t need to do anything; dxr serve will notice and restart itself a few seconds after you save.

6.4. Coding Conventions

Follow PEP 8 for Python code, but don’t sweat the line length too much. Follow PEP 257 for docstrings, and use Sphinx-style argument documentation. Single quotes are preferred for strings; use 3 double quotes for docstrings and multiline strings or if the string contains a single quote.

6.5. Testing

DXR has a fairly mature automated testing framework, and all server-side patches should come with tests. (Tests for client-side contributions are welcome as well, but we haven’t got the harness set up yet.)

6.5.1. Writing Tests for DXR

DXR supports two kinds of integration tests:

  1. A lightweight sort with a single file worth of analyzed code. This kind stores the code as a Python string within a subclass of SingleFileTestCase. At test time, it instantiates the file on disk in a temp folder, builds it, and makes assertions about it. If the stop_for_interaction class variable is falsy (the default), it then deletes the index. If you want to browse the instance manually for troubleshooting, set this to True.
  2. A heavier sort of test: a folder containing one or more source trees and a DXR config file. These are useful for tests that require a multi-file tree to analyze or more than one tree. test_ignores is an example. Within these folders are also one or more Python files containing subclasses of DxrInstanceTestCase which express the actual tests. These trees can be built like any other using dxr index, in case you want to do manual exploration.

6.5.2. Running the Tests

To run all the tests, run this from the root of the DXR repository (in the container):

make test

To run just the tests in tests/test_functions.py...

nosetests tests/test_functions.py

To run just the tests from a single class...

nosetests tests/test_functions.py:ReferenceTests

To run a single test...

nosetests tests/test_functions.py:ReferenceTests.test_functions

If you have trouble, make sure you didn’t mistranscribe any colons or periods.

To omit the often distracting elasticsearch logs that nose typically presents when a test fails, add the --nologcapture flag.

6.6. Writing Plugins

Plugins are the way to add new types of analysis, indexing, searching, or display to DXR. In fact, even DXR’s basic capabilities, such as text search and syntax coloring, are implemented as plugins. Want to add support for a new language? A new kind of search to an existing language? A new kind of contextual menu cross-reference? You’re in the right place.

At the top level, a Plugin class binds together a collection of subcomponents which do the actual work:

digraph plugin {
"Plugin" -> "FolderToIndex";
"Plugin" -> "TreeToIndex" -> "FileToIndex";
"Plugin" -> "FileToSkim";
"Plugin" -> "filters";
"Plugin" -> "mappings";
"Plugin" -> "analyzers";
}

6.6.1. Registration

A Plugin class is registered via a setuptools entry point called dxr.plugins. For example, here are the registrations for the built-in plugins, from DXR’s own setup.py:

entry_points={'dxr.plugins': ['urllink = dxr.plugins.urllink',
                              'buglink = dxr.plugins.buglink',
                              'clang = dxr.plugins.clang',
                              'omniglot = dxr.plugins.omniglot',
                              'pygmentize = dxr.plugins.pygmentize']},

The keys in the key/value pairs, like “urllink” and “buglink”, are the strings the deployer can use in the enabled_plugins config directive to turn them on or off. The values, like “dxr.plugins.urllink”, can point to either...

  1. A Plugin class which itself points to filters, skimmers, indexers, and such. This is the explicit approach—more lines of code, more opportunities to buck convention—and thus not recommended in most cases. The Plugin class itself is just a dumb bag of attributes whose only purpose is to bind together a collection of subcomponents that should be used together.
  2. Alternatively, an entry point value can point to a module which contains the subcomponents of the plugin, each conforming to a naming convention by which it can be automatically found. This method saves boilerplate and should be used unless there is a compelling need otherwise. Behind the scenes, an actual Plugin object is constructed implicitly: see from_namespace() for details of the naming convention.

Here is the Plugin object’s API, in case you do decide to construct one manually:

class dxr.plugins.Plugin(filters=None, folder_to_index=None, tree_to_index=None, file_to_skim=None, mappings=None, analyzers=None, direct_searchers=None, refs=None, badge_colors=None, config_schema=None)[source]

Top-level entrypoint for DXR plugins

A Plugin is an indexer, skimmer, filter set, and other miscellany meant to be used together; it is the deployer-visible unit of pluggability. In other words, there is no way to subdivide a plugin via configuration; there would be no sense running a plugin’s filters if the indexer that was supposed to extract the requisite data never ran.

If the deployer should be able to independently enable parts of your plugin, consider exposing those as separate plugins.

Note that Plugins may be instantiated multiple times; don’t assume otherwise.

Parameters:
  • filters – A list of filter classes
  • folder_to_index – A FolderToIndex subclass
  • tree_to_index – A TreeToIndex subclass
  • file_to_skim – A FileToSkim subclass
  • mappings – Additional Elasticsearch mapping definitions for all the plugin’s elasticsearch-destined data. A dict with keys for each doctype and values reflecting the structure described at http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html. Since a FILE-domain query will be promoted to a LINE query if any other query term triggers a line-based query, it’s important to keep field names and semantics the same between lines and files. In other words, a LINE mapping should generally be a superset of a FILE mapping.
  • analyzers – Analyzer, tokenizer, and token and char filter definitions for the elasticsearch mappings. A dict with keys “analyzer”, “tokenizer”, etc., following the structure outlined at http://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html.
  • direct_searchers

    Functions that provide direct search capability. Each must take a single query term of type ‘text’, return an elasticsearch filter clause to run against LINEs, and have a direct_search_priority attribute. Filters are tried in order of increasing priority. Return None from a direct searcher to skip it.

    Note

    A more general approach may replace direct search in the future.

  • refs – An iterable of Ref subclasses supported by this plugin. This is used at request time, to turn abreviated ES index data back into HTML.
  • badge_colors – Mapping of Filter.lang -> color for menu badges.
  • config_schema – A validation schema for this plugin’s configuration. See https://pypi.python.org/pypi/schema/ for docs.

mappings and analyzers are recursively merged into other plugins’ mappings and analyzers using the algorithm described at deep_update(). This is mostly intended so you can add additional kinds of indexing to fields defined in the core plugin using multifields. Don’t go too crazy monkeypatching the world.

classmethod from_namespace(namespace)[source]

Construct a Plugin whose attrs are populated by naming conventions.

Parameters:namespace – A namespace from which to pick components

Filters are taken to be any class whose name ends in “Filter” and doesn’t start with “_”.

Refs are taken to be any class whose name ends in “Ref” and doesn’t start with “_”.

The tree indexer is assumed to be called “TreeToIndex”. If there isn’t one, one will be constructed which does nothing but delegate to the class called FileToIndex (if there is one) when file_to_index() is called on it.

The file skimmer is assumed to be called “FileToSkim”.

Mappings are pulled from mappings attribute and analyzers from analyzers.

If these rules don’t suit you, you can always instantiate a Plugin yourself.

Actual plugin functionality is implemented within file indexers, tree indexers, folder indexers, filters, and skimmers.

6.6.2. Folder Indexers

class dxr.indexers.FolderToIndex(plugin_name, tree, path)[source]

The FolderToIndex generates needles for folders and provides an optional list of headers to display in browse view as browse_headers.

6.6.3. Tree Indexers

class dxr.indexers.TreeToIndex(plugin_name, tree, vcs_cache)[source]

A TreeToIndex performs build environment setup and teardown and serves as a repository for scratch data that should persist across an entire indexing run.

Instances must be pickleable so as to make the journey to worker processes. You might also want to keep the size down. It takes on the order of 2s for a 150MB pickle to make its way across process boundaries, including pickling and unpickling time. For this reason, we send the TreeToIndex once and then have it index several files before sending it again.

Parameters:
  • tree – The configuration of the tree to index: a TreeConfig
  • vcs_cache – A VcsCache that describes any VCSes used by this tree. May be None if tree does not contain any VCS repositories.
environment(vars)[source]

Return environment variables to add to the build environment.

This is where the environment is commonly twiddled to activate and parametrize compiler plugins which dump analysis data.

Parameters:vars – A dict of the already-set variables. You can make decisions based on these.

You may return a new dict or scribble on vars and return it. In either case, the returned dict is merged into those from other plugins, with later plugins taking precedence in case of conflicting keys.

file_to_index(path, contents)[source]

Return an object that provides data about a given file.

Return an object conforming to the interface of FileToIndex, generally a subclass of it.

Parameters:
  • path – A path to the file to index, relative to the tree’s source folder
  • contents – What’s in the file: unicode if we managed to guess an encoding and decode it, None otherwise

Return None if there is no indexing to do on the file.

Being a method on TreeToIndex, this can easily pass along the location of a temp directory or other shared setup artifacts. However, beware of passing mutable things; while the FileToIndex can mutate them, visibility of those changes will be limited to objects in the same worker process. Thus, a TreeToIndex-dwelling dict might be a suitable place for a cache but unsuitable for data that can’t evaporate.

If a plugin omits a TreeToIndex class, from_namespace() constructs one dynamically. The method implementations of that class are inherited from this class, with one exception: a file_to_index() method is dynamically constructed which returns a new instance of the FileToIndex class the plugin defines, if any.

post_build()[source]

Hook called after the tree’s build command completes

This is a good place to do any whole-program analysis, storing it on me or on disk.

pre_build()[source]

Hook called before the tree’s build command is run

This is a good place to make a temp folder to dump said data in. You can stash away a reference to it on me so later methods can find it.

6.6.4. File Indexers

class dxr.indexers.FileToIndex(path, contents, plugin_name, tree)[source]

A source of search and rendering data about one source file

Analyze a file or digest an analysis that happened at compile time.

Parameters:
  • path – The (bytestring) path to the file to index, relative to the tree’s source folder
  • contents – What’s in the file: unicode if we managed to guess at an encoding and decode it, None otherwise. Don’t return any by-line data for None; the framework won’t have succeeded in breaking up the file by line for display, so there will be no useful UI for those data to support. Think more along the lines of returning EXIF data to search by for a JPEG. For unicode, split the file into lines using universal newlines (dxr.utils.split_content_lines()); that’s what the rest of the framework expects.
  • tree – The TreeConfig of the tree to which the file belongs

Initialization-time analysis results may be socked away on an instance var. You can think of this constructor as a per-file post-build step. You could do this in a different method, using memoization, but doing it here makes for less code and less opportunity for error.

FileToIndex classes of plugins may take whatever constructor args they like; it is the responsibility of their TreeToIndex objects’ file_to_index() methods to supply them. However, the path and contents instance vars should be initialized and have the above semantics, or a lot of the provided convenience methods and default implementations will break.

needles()[source]

Return an iterable of key-value pairs of search data about the file as a whole: for example, modification date or file size.

Each pair becomes an elasticsearch property and its value. If the framework encounters multiple needles of the same key (whether coming from the same plugin or different ones), all unique values will be retained using an elasticsearch array.

needles_by_line()[source]

Return per-line search data for one file: for example, markers that indicate a function called “foo” is defined on a certain line.

Yield an iterable of key-value pairs for each of a file’s lines, one iterable per line, in order. The data might be data to search on or data stowed away for a later realtime thing to generate refs or regions from. In any case, each pair becomes an elasticsearch property and its value.

If the framework encounters multiple needles of the same key on the same line (whether coming from the same plugin or different ones), all unique values will be retained using an elasticsearch array. Values may be dicts, in which case common keys get merged by append_update().

This method is not called on symlink files, to maintain the illusion that they do not have contents, seeing as they cannot be viewed in file browsing.

FileToIndex also has all the methods of its superclass, FileToSkim.

6.6.4.1. Looking Inside Elasticsearch

While debugging a file indexer, it can help to see what is actually getting into elasticsearch. For example, if you are debugging needles_by_line(), you can see all the data attached to each line of code (up to 1000) with this curl command:

curl -s -XGET "http://localhost:9200/dxr_10_code/line/_search?pretty&size=1000"

Be sure to replace “dxr_10_code” with the name of your DXR index. You can see which indexes exist by running...

curl -s -XGET "http://localhost:9200/_status?pretty"

Similarly, when debugging needles(), you can see all the data attached to files-as-a-whole with...

curl -s -XGET "http://localhost:9200/dxr_10_code/file/_search?pretty&size=1000"

6.6.5. File Skimmers

class dxr.indexers.FileToSkim(path, contents, plugin_name, tree, file_properties=None, line_properties=None)[source]

A source of rendering data about a file, generated at request time

This is appropriate for unindexed files (such as old revisions pulled out of a VCS) or for data so large or cheap to produce that it’s a bad tradeoff to store it in the index. An instance of me is mostly an opportunity for a shared cache among my methods.

Parameters:
  • path – The (bytestring) conceptual path to the file, relative to the tree’s source folder. Such a file might not exist on disk. This is useful mostly as a hint for syntax coloring.
  • contents – What’s in the file: unicode if we knew or successfully guessed an encoding, None otherwise. Don’t return any by-line data for None; the framework won’t have succeeded in breaking up the file by line for display, so there will be no useful UI for those data to support. In fact, most skimmers won’t be be able to do anything useful with None at all. For unicode, split the file into lines using universal newlines (dxr.utils.split_content_lines()); that’s what the rest of the framework expects.
  • tree – The TreeConfig of the tree to which the file belongs

If the file is indexed, there will also be...

Parameters:
  • file_properties – Dict of file-wide needles emitted by the indexer
  • line_properties – List of per-line needle dicts emitted by the indexer
absolute_path()[source]

Return the (bytestring) absolute path of the file to skim.

Note: in skimmers, the returned path may not exist if the source folder moved between index and serve time.

annotations_by_line()[source]

Yield extra user-readable information about each line, hidden by default: compiler warnings that occurred there, for example.

Yield a list of annotation maps for each line:

{'title': ..., 'class': ..., 'style': ...}
char_offset(row, col)[source]

Return the from-BOF unicode char offset for the char at the given row and column of the file we’re indexing.

This is handy for translating row- and column-oriented input to the format refs() and regions() want.

Parameters:
  • row – The 1-based line number, according to splitting in universal newline mode
  • col – The 0-based column number
contains_text()[source]

Return whether this file can be decoded and divided into lines as text. Empty files contain text.

This may come in handy as a component of your own is_interesting() methods.

is_interesting()[source]

Return whether it’s worthwhile to examine this file.

For example, if this class knows about how to analyze JS files, return True only if self.path.endswith('.js'). If something falsy is returned, the framework won’t call data-producing methods like links(), refs(), etc.

The default implementation selects only text files that are not symlinks. Note: even if a plugin decides that symlinks are interesting, it should remember that links, refs, regions and by-line annotations will not be called because views of symlinks redirect to the original file.

Return whether the file is a symlink.

Note: symlinks are never displayed in file browsing; a request for a symlink redirects to its target.

Return an iterable of links for the navigation pane:

(sort order, heading, [(icon, title, href), ...])

File views will replace any {{line}} within the href with the last-selected line number.

refs()[source]

Provide cross references for various spans of text, accessed through a context menu.

Yield an ordered list of extents and menu items:

(start, end, ref)

start and end are the bounds of a slice of a Unicode string holding the contents of the file. (refs() will not be called for binary files.)

ref is a Ref.

regions()[source]

Yield instructions for syntax coloring and other inline formatting of code.

Yield an ordered list of extents and CSS classes (encapsulated in Region instances):

(start, end, Region)

start and end are the bounds of a slice of a Unicode string holding the contents of the file. (regions() will not be called for binary files.)

class dxr.lines.Ref(tree, menu_data, hover=None, qualname=None, qualname_hash=None)[source]

Abstract superclass for a cross-reference attached to a run of text

Carries enough data to construct a context menu, highlight instances of the same symbol, and show something informative on hover.

Parameters:
  • menu_data – Arbitrary JSON-serializable data from which we can construct a context menu
  • hover – The contents of the <a> tag’s title attribute. (The first one wins.)
  • qualname – A hashable unique identifier for the symbol surrounded by this ref, for highlighting
  • qualname_hash – The hashed version of qualname, which you can pass instead of qualname if you have access to the already-hashed version
es()[source]

Return a serialization of myself to store in elasticsearch.

static es_to_triple(es_data, tree)[source]

Convert ES-dwelling ref representation to a (start, end, Ref subclass) triple.

Return a subclass of Ref, chosen according to the ES data. Into its attributes “menu_data”, “hover” and “qualname_hash”, copy the ES properties of the same names, JSON-decoding “menu_data” first.

Parameters:
  • es_data – An item from the array under the ‘refs’ key of an ES LINE document
  • tree – The TreeConfig representing the tree from which the es_data was pulled
menu_items()[source]

Return an iterable of menu items to be attached to a ref.

Return an iterable of dicts of this form:

{
    html: the HTML to be used as the menu item itself
    href: the URL to visit when the menu item is chosen
    title: the tooltip text given on hovering over the menu item
    icon: the icon to show next to the menu item: the name of a PNG
        from the ``icons`` folder, without the .png extension
}

Typically, this pulls data out of self.menu_data.

opener()[source]

Emit the opening anchor tag for a cross reference.

Menu item text, links, and metadata are JSON-encoded and dumped into a data attr on the tag. JS finds them there and creates a menu on click.

class dxr.lines.Region(css_class)[source]

A <span> tag with a CSS class, wrapped around a run of text

classmethod es_to_triple(es_region)[source]

Convert ES-dwelling region representation to a (start, end, Region) triple.

6.6.6. Filters

class dxr.filters.Filter(term, enabled_plugins)[source]

A provider of search strategy and highlighting

Filter classes, which roughly correspond to the items in the Filters dropdown menu, tell DXR how to query the data stored in elasticsearch by needles() and needles_by_line(). An instance is created for each query term whose name matches and persists through the querying and highlighting phases.

This is an optional base class that saves code on many filters. It also serves to document the filter API.

Variables:
  • name – The string prefix used in a query term to activate this filter. For example, if this were “path”, this filter would be activated for the query term “path:foo”. Multiple filters can be registered against a single name; they are ORed together. For example, it is good practice for a language plugin to query against a language specific needle (like “js-function”) but register against the more generic “function” here. (This allows us to do language-specific queries.)
  • domain – Either LINE or FILE. LINE means this filter returns results that point to specific lines of files; FILE means they point to files as a whole. Default: LINE.
  • description – A description of this filter for the Filters menu: unicode or Markup (in case you want to wrap examples in <code> tags). Of filters having the same name, the description of the first one encountered will be used. An empty description will hide a filter from the menu. This should probably be used only internally, by the TextFilter.
  • union_only – Whether this filter will always be ORed with others of the same name, useful for filters where the intersection would always be empty, such as extensions
  • is_reference – Whether to include this filter in the “ref:” aggregate filter
  • is_identifier – Whether to include this filter in the “id:” aggregate filter

This is a good place to parse the term’s arg (if it requires further parsing) and stash it away on the instance.

Parameters:
  • term – a query term as constructed by a QueryVisitor
  • enabled_plugins – an iterable of the enabled Plugin instances, for use by filters that build upon the filters provided by plugins

Raise BadTerm to complain to the user: for instance, about an unparseable term.

filter()[source]

Return the ES filter clause that applies my restrictions to the found set of lines (or files and folders, if domain is FILES).

To quietly do no filtration, return None. This would be suitable for path:*, for example.

To do no filtration and complain to the user about it, raise BadTerm.

We might even make this return a list of filter clauses, for things like the RegexFilter which want a bunch of match_phrases and a script.

highlight_content(result)[source]

Return an unsorted iterable of extents that should be highlighted in the content field of a search result.

Parameters:result – A mapping representing properties from a search result, whether a file or a line. With access to all the data, you can, for example, use the extents from a ‘c-function’ needle to inform the highlighting of the ‘content’ field.
highlight_path(result)[source]

Return an unsorted iterable of extents that should be highlighted in the path field of a search result.

Parameters:result – A mapping representing properties from a search result, whether a file or a line. With access to all the data, you can, for example, use the extents from a ‘c-function’ needle to inform the highlighting of the ‘content’ field.

6.6.7. Mappings

When you’re laying down data to search upon, it’s generally not enough just to write needles() or needles_by_line() implementations. If you want to search case-insensitively, for example, you’ll need elasticsearch to fold your data to lowercase. (Don’t fall into the trap of doing this in Python; the Lucene machinery behind ES is better at the complexities of Unicode.) The way you express these instructions to ES is through mappings and analyzers.

ES mappings are schemas which specify type of data (string, int, datetime, etc.) and how to index it. For example, here is an excerpt of DXR’s core mapping, defined in the core plugin:

mappings = {
    # Following the typical ES mapping format, `mappings` is a hash keyed
    # by doctype. So far, the choices are ``LINE`` and ``FILE``.
    LINE: {
        'properties': {
            # Line number gets mapped as an integer. Default indexing is fine
            # for numbers, so we don't say anything explicitly.
            'number': {
                'type': 'integer'
            },

            # The content of the line itself gets mapped 3 different ways.
            'content': {
                # First, we store it as a string without actually putting it
                # into any ordered index structure. This is for retrieval and
                # display in search results, not for searching on:
                'type': 'string',
                'index': 'no',

                # Then, we index it in two different ways: broken into
                # trigrams (3-letter chunks) and either folded to lowercase or
                # not. This cleverness takes care of substring matching and
                # accelerates our regular expression search:
                'fields': {
                    'trigrams_lower': {
                        'type': 'string',
                        'analyzer': 'trigramalyzer_lower'
                    },
                    'trigrams': {
                        'type': 'string',
                        'analyzer': 'trigramalyzer'
                    }
                }
            }
        }
    },
    FILE: ...
}

Mappings follow exactly the same structure as required by ES’s “put mapping” API. The choice of mapping types is also outlined in the ES documentation.

Warning

Since a FILE-domain query will be promoted to a LINE query if any other query term triggers a line-based query, it’s important to keep field names and semantics the same between lines and files. In other words, a LINE mapping should generally be a superset of a FILE mapping. Otherwise, ES will guess mappings for the undeclared fields, and surprising search results will likely ensue. Worse, the bad guesses will likely happen intermittently.

6.6.7.1. The Format Version

In the top level of the dxr package (not the top of the source checkout, mind you) lurks a file called format. Its role is to facilitate the automatic deployment of new versions of DXR using dxr deploy. The format file contains an integer which represents the index format expected by dxr serve. If a change in the code requires a mapping or semantics change in the index, the format version must be incremented. In response, the deployment script will wait until new indices, of the new format, have been built before deploying the change.

If you aren’t sure whether to bump the format version, you can always build an index using the old code, then check out the new code and try to serve the old index with it. If it works, you’re probably safe not bumping the version.

6.6.8. Analyzers

In Mappings, we alluded to custom indexing strategies, like breaking strings into lowercase trigrams. These strategies are called analyzers and are the final component of a plugin. ES has strong documentation on defining analyzers. Declare your analyzers (and building blocks of them, like tokenizers) in the same format the ES documentation prescribes. For example, the analyzers used above are defined in the core plugin as follows:

analyzers = {
    'analyzer': {
        # A lowercase trigram analyzer:
        'trigramalyzer_lower': {
            'filter': ['lowercase'],
            'tokenizer': 'trigram_tokenizer'
        },
        # And one for case-sensitive things:
        'trigramalyzer': {
            'tokenizer': 'trigram_tokenizer'
        }
    },
    'tokenizer': {
        'trigram_tokenizer': {
            'type': 'nGram',
            'min_gram': 3,
            'max_gram': 3
            # Keeps all kinds of chars by default.
        }
    }
}

6.7. Contributing Documentation

We use Read the Docs for building and hosting the documentation, which uses sphinx to generate HTML documentation from reStructuredText markup.

To edit documentation:

  • Edit *.rst files in docs/source/ in your local checkout. See reStructuredText primer for help with syntax.
  • Use cd ~/dxr/docs && make html in the VM to preview the docs.
  • When you’re satisfied, submit the pull request as usual.

6.8. Troubleshooting

Why is my copy of DXR acting erratic, failing at searches, making requests for JS templates that shouldn’t exist, and just generally not appearing to be in sync with my changes?
Did you run python setup.py install for DXR at some point? Never, ever do that in development; use python setup.py develop instead. Otherwise, you will end up with various files copied into your virtualenv, and your edits to the originals will have no effect.
How can I use pdb to debug indexing?
In the DXR config file for the tree you’re building, add workers = 0 to the [DXR] section. That will keep DXR from spawning multiple worker processes, something pdb doesn’t tolerate well.
I pulled a new version of the code that’s supposed to have a new plugin (or I added one myself), but it’s acting like it doesn’t exist.
Re-run python setup.py develop to register the new setuptools entry point.