crsp_stk_subset

crsp_stk_subset creates a new CRSPAccess database from an existing database by subsetting data using date range, frequency, and identifier screens. The program allows screening by date range, exchange, share type, NASDAQ National Market inclusion, and when-issued status, and can convert the frequency of time-series data.

Usage

crsp_stk_subset inpath outpath insetid outsetid paramfile logfile [permfile]

Parameters

inpath
Input CRSPDB directory path. The directory where the database is stored. Standard environment names can be used such as $CRSP_DSTK or $CRSP_MSTK on UNIX, %crsp_dstk% or %crsp_mstk% on Windows.

outpath
Output CRSPDB directory path. The directory where the new output CRSPAccess will be created. The directory must not include existing CRSPAccess data and the user must have permission and enough disk space to create the resultant database.

insetid
Input Setid. The input database set type. Use:

10  for daily stock data

20  for monthly stock data

outsetid
Output Setid. The output database set type. Input and output index setids should be the same unless the frequency of the standard time-series is changing from daily to monthly or less frequent calendar.

paramfile
Parameter file. The name of a text file containing specifications of the subsetting to be done in converting the input database to the output database. See the Parameter Options Specifications table on the next page, for the subsetting options and the specifications of this file.

logfile
Log file. The name of an output file to be created with logging information about the input securities. Each line in the log file will contain a PERMNO and a two-letter code on the status of the input PERMNO in the output database. The codes are:

OK  if the security is kept in the output database with no changes to header information.

O#  if the security is kept, but header information is changed because the most recent
information changed after removing some part of the history.

1  if the header CUSIP changed

2  if header exchange code changed, and

4  if header SIC code changed.

DT  if the security is excluded due to date range

EX  if the security is excluded due to exchange

SH  if the security is excluded due to share type

WI  if the security is excluded due to when-issued screening

NM  if the security is excluded due to NASDAQ National Market screening

permfile
(optional) An optional file containing a list of PERMNOs, one to a line, of the securities in the input database to be subsetted. If this option is not given, all the securities in the database will be used.

Parameter File Options

The crsp_stk_subset program uses an input text file to select subsetting options. The input file consists of one or more lines, each with a keyword and a value. The keywords, definitions, and rules for use are as follows:

begdate

Beginning date. The first date of valid data, in YYYYMMDD format, if a date restriction is made. If begdate is used it must be a trading date in the price calendar of the input database. enddate must also be used and must be after begdate. If begdate is not used, there is no restriction by date.

enddate
Ending date. The last date of valid data, in YYYYMMDD format, if a date restriction is made. If enddate is used it must be a trading date in the price calendar of the input database. begdate must also be used and must precede enddate. If enddate is not used, there is no restriction by date.

want_exch
A binary flag indicating which exchanges are kept in the output database. The following codes are used to indicate the exchanges to keep:

1  NYSE

NYSE MKT

NASDAQ

ARCA

If want_exch is not specified, no exchange restriction is made.

ex_subflag
Modifies want_exch. Use:

(default) all data while trading on unwanted exchanges is not included in the new database.

the entire issue is removed if it ever traded on an unwanted exchange.

no restrictions are made if ever trading on a wanted exchange.

shrcode
A code that determines which share types are kept in the result database. The possible values are:

1  restrict based on CRSP NYSE and NYSE MKT file restrictions, including share codes with a first digit of 1,2,3, 4, and 7, and any second digit.

3  restrict based on CRSP Cap-Based Portfolios, including the same restrictions are 1, but also excluding ADRs, foreign-incorporated issues, REITs, and closed end investment funds.

4  restrict based on CRSP Total Return Indexes, including the same restrictions as 1, but also including share codes with a first digit of 9, including units including non-common components.

5  restrict based on specific digits of the CRSP share code. If this option is chosen, shrcodel and shrcoder must be specified.

shrcodel
A string indicating which first digits of share codes are valid. The string is a 10-character string, with each character a 0 or 1. If the nth character in the string is a 0, securities where the first digit of the share code is n are excluded. If the nth character in the string is 1, securities where the first digit of the share code is n are kept.

For example, the line shrcodel 0101000000 would be used to keep only ordinary common shares and ADRs, with CRSP share codes with a first digit of 1 or 3. Shrcodel can only be used if shrcode and shrcoder are specified.

shrcoder
A string indicating which second digit CRSP of share codes are valid. The string is a 10-character string, with each character a 0 or 1. If the nth character in the string is a 0, securities where the second digit of the share code is n are excluded. If the nth character in the string is 1, securities where the second digit of the share code is n are kept.

For example, the line shrcoder 1101101111 would be used to keep all secondary share types except foreign incorporated securities and closed-end funds incorporated outside the U.S. (share codes ending in 2 or 5) shrcoder can only be used if shrcode and shrcodel are specified.

sc_subflag modifies shrcode. Use:

(default) All data while classified as an unwanted share code is erased.

The entire issue is removed if ever classified as an unwanted share code.

No restrictions are made if ever classified as a wanted share code.

nmscode
A numeric code that can further restrict issues trading on NASDAQ. The codes are:

keep all Global Markets (Global Market and Global Select Market, and National Market before July 1, 2006. (NMSIND = 2, 5, or 6)

2  keep all Capital Markets (named Small-Cap before July 1, 2006) (NMSIND = 1, 3, or 4)

3  keep all All Trade Reported Tiers, excluding only Small-Cap before June 15, 1992 (NMSIND = 2, 3, 4, 5, or 6)

4  keep all Non-Trade-Reported Tiers, including only Small-Cap before June 15, 1992 (NMSIND = 1)

5  keep all Global Select Market and National Market (NMSIND = 2 or 6)

6  keep all Global Market Only(named National Market before July 1, 2006) (NMSIND = 2 or 5)

Global Select Market Only (NMSIND = 6)

nms_subflag modifies nmscode. Use:

(default) all data while trading on unwanted NASDAQ market is erased.

the entire issue is removed if ever trading on an unwanted NASDAQ market.

2  no restrictions are made if ever trading on a wanted NASDAQ market.

wicode
A three character code used to restrict types of when-issued trading. When-issued trading is trading supported by an exchange of an issue that does not officially exist but is expected to exist in the future. The program supports three types of when-issued trading:

initial - an anticipated new issue is traded before its trading status becomes official.

ex-distributed - a post-split or post-reorganization version of a security is traded before the ex-date, simultaneously with the regular issue, with prices independent of the regular issue.

reorganization - a security undergoing a reorganization, such as a Chapter 11, trades with the expectation of returning under a plan of reorganization.

CRSP subscriber databases currently include only reorganization when-issued trading. The default is to make no further restrictions. Each of the three characters in wicode refers to the restrictions made for that type of when-issued trading.

1st digit
0 to make no restrictions, 1 to erase when-issued price range and erase name information, 2 to erase when-issued price range but keep name information.

2nd digit
0 to make no restrictions, 1 to delete ex-distributed issues

3rd digit
0 to make no restrictions, 1 to erase reorganization when-issued price ranges but keep name information, 2 to keep reorganization when-issued price ranges but delete name information, and 3 to erase price ranges and name information

nameflag
A numeric code determining how name structures are restricted when restrictions are made using begdate and enddate. The values are:

keep entire name history

delete names no longer valid before range starts

delete names beginning after range ends

3  delete names before and after ranges

shareflag
A numeric code determining how shares observations are restricted when price ranges are restricted. The values are:

erase raw shares observations out of range

1  keep raw shares observations outside of valid price range if they are used to derive shares outstanding for any time within the kept price range

2  keep the last raw shares observation that predates the first trading on NYSE, NYSE MKT, NASDAQ, or ARCA if there are no valid raw shares observations once trading starts and the first exchange is valid according to exchange restrictions

pct
Can be used to summarize NASDAQ information structures by number of market makers. If 0 or unspecified, then all NASDAQ information structures are kept. Otherwise pct is treated as a percentage change. If the only change in a NASDAQ information event is a market maker change from the last kept NASDAQ information structure less than pct, that structure is not copied to the new database.

adjdt
Base date if price, volume, or share values are adjusted. Values will be as is on this date, and adjusted in the source data using splits or other events before or after the adjustment date. The date must be in YYYYMMDD format. adjdt can be 0 to adjust each period so the last date in the period is used for the base date. This can be used to adjust data to the same basis before summarizing when changing the base frequency of the database.

factype
Type of adjustments made for prices. Possible values are:

-1  no adjustments will be made, cancels adjdt

0  prices are adjusted for all distributions with nonzero price factors

prices are adjusted only for stock splits and stock dividends

sum_code
Set to 0 if no frequency conversion will be done to create the new database and set to 1 if frequency conversion will be done. Currently only conversion from daily to monthly is supported.

sum_prc
Sets rules for loading the closing price time-series when changing the base frequency of the database. Possible values are:

0  the source price on the last day of the target period

1  the average of the absolute values of source prices during the target period

2  the median of the source prices during the target period. Absolute values of prices are used for ranking. Finding medians has a high cost in time and resources.

no prices are loaded to the target database

4  the nonmissing price from the source prices closest to the end of the period. The program will look in the previous and next target periods up to one hundred source periods in either direction if the last price is missing. If there is a price equally distant forward and backward, the earlier price is used. If a price is used that is not the last day of the period it is adjusted for all price factors between the last day of the period and the actual date of that price.

sum_sp
Sets rules for loading the Bid or Low Price and Ask or High Price time-series when changing the base frequency of the database. Possible values are:

the last source Bid or Low Price and Ask or High Price are loaded to the target Bid or Low Price and Ask or High Price time-series.

the highest askhi in the source time-series within the target range is loaded to askhi, and the lowest bidlo in the source time-series within the target range is loaded to bidlo

the highest price in the source time-series within the target range is loaded to askhi, and the lowest bidlo in the source time-series within the target range is loaded to bidlo. If bid/ask averages marked as negative prices are present, the absolute value of them are used for ranking, but if chosen the negative sign is kept.

3  no Bid or Low Price or Ask or High Price data is loaded to the target database

sum_vol
Sets rules for loading the volume time-series when changing the base frequency of the database. Possible values are:

The sum of all volumes in the target period are loaded to the target volume time-series

The average of source nonmissing volumes in the target range is loaded to the target volume time-series

Median of source nonmissing volumes in the target range is loaded to the target volume time-series

No volume data is loaded to the target database

sum_ret
Sets rules for loading the returns time-series when changing the frequency of the database. Possible values are:

0  No returns data is loaded to the target database

1  Source returns in the target range are compounded and loaded to the target returns time-series

2  Source returns and returns without dividends are compounded and loaded to the target returns timeseries

Holding Period Total Returns and returns without dividends are recalculated from the price time-series (sum_prc cannot
be 3)

sum_spread
Sets rules for loading auxiliary time-series, including Bid, Ask, Number of Trades, Price Alternate, and Spread between Bid and Ask, when changing the frequency of the database. Possible values are:

Load the last spread in each source price range to the target database. Only the Bids and Asks stored in the Bid or Low Price and Ask or High Price time-series are used.

Bid, Ask, Number of Trades, Price Alternate, and Spread between Bid and Ask time-series are not loaded in the target database

2  Bid, Ask, Number of Trades, Price Alternate, and Spread between Bid and Ask time-series are loaded with the following rules:

The last nonmissing Price or Bid/Ask Average from the source within the target range is loaded to the Price Alternate time-series. The Number of Trades time-series is loaded with the corresponding dates within the source where the last nonmissing Price or Bid/Ask Average was found. Bid and Ask are loaded with the corresponding value in the last target period of the source bid and ask time-series. Spread between Bid and Ask is loaded as in option 0.

Example

Windows

The parameter file is an ASCII file where users can specify the various parameters. Here is an example of a parameter file, param.txt:

begdate 19940103
enddate 19950131
want_exch 2
shrcode 5
shrcodel 0100000000
shrcoder 0100000000
nmscode 0
wicode 0
nameflag 0
shareflag 1
pct 25
adjdt 0
factype -1
sum_code 0
sum_prc 0
sum_sp 2
sum_vol 1
sum_ret 0
sum_spread 2

This file will result in a database with NYSE MKT data for securities with a share code of 11 with data from January 3, 1994 until January 31, 1995.

To create the new database in c:\dasub\ using the daily stock database as input, using these parameters loaded to a file called param.txt and using all PERMNOs,

crsp_stk_subset %crsp_dstk% c:\dasub\ 10 10 param.txt subset.log